HOME    COMMUNITY    BLOGS & FORUMS    Breaking The Three Laws
Breaking The Three Laws
  • About

    Breaking the Three Laws is dedicated to discussing technically challenging ASIC prototyping problems and sharing solutions.
  • About the Author

    Michael (Mick) Posner joined Synopsys in 1994 and is currently Director of Product Marketing for Synopsys' DesignWare USB Solutions. Previously, he was the Director of Product Marketing for Physical (FPGA-based) Prototyping and has held various product marketing, technical marketing manager and application consultant positions at Synopsys. He holds a Bachelor Degree in Electronic and Computer Engineering from the University of Brighton, England.

The Secret Ninja-Fu for Higher Performance Prototype Operation

Posted by Michael Posner on April 5th, 2014

Not many people know this but I am a FPGA-based prototyping Ninja-Fu master. What super power do I have you ask? I have the power to enable higher performance prototype operation and in this week’s blog I am sharing this ancient secret power with you. Wow, the start of this blog sounds like the bio from a really bad “B” movie, it definitely seemed funnier in my mind, then again everything seems funnier in my mind. There is actually some seriousness to this blog as I really am going to share the not so secret method to enable higher performance in your FPGA-based prototypes.

First, let’s study a typical SoC, this case it’s ~40-50 Million ASIC gates. I chose this design as it’s easier to explain but the principle for higher performance operation is even more important for larger more complex SoC’s. Our example SoC includes a CPU with tightly coupled GPU and DDR3-based memory subsystem, PCIe high performance interface, SRAM scratch pad storage, global bus, custom logic block (your SoC’s special sauce for instance) and a number of lower performance peripherals

When you model such an SoC in an FPGA-based prototype, even with the largest FPGA’s, you need to partition the design. Partition is to split up the design across multiple FPGA’s. The challenge is that the SoC design blocks have more signals than you have FPGA pins (Hey that’s one of the three laws of the breaking the three laws blog). We all know that when you partition such a design you need to insert pin multiplexing to manage the many signals over the limited FPGA pins. As I am writing this I suddenly realized I have shared the secret of higher performance prototyping before, here, anyway, this blog is way cooler so I’ll continue writing.

The challenge of partitioning this design is that due to the tightly coupled CPU/GPU you end up with many signals spanning out from a small number of design blocks. Lets assume the CPU and GPU are partitioned across two FPGA’s. If all you are prototyping is these two blocks then with the use of pin multiplexing you can connect the two blocks together. The challenge of this prototyping project is that you are also modeling the other design blocks as you want to validate the software and use that to validate your RTL design blocks. This means you end up with the SoC partitioned across four FPGA’s which forces even more connections between FPGA’s.

The picture above is a representation of the partition, the raw IO interconnect usage and the number of external IO’s required for daughter boards. Suddenly you see not only the sheer volume of interconnect needed but also the number of individual connectors required to create such an partition. Just look at FPGA 2, it’s packed with IO and daughter boards. You could try and partition the design in a different way but it’s sure to tank the performance as the GPU needs to be tightly coupled to the DDR3 memory and the CPU requires a tight link to the PCIe interface. If you sacrifice physical IO between FPGA 1 and FPGA 2 you will end up with very high pin mux ratios resulting in very low system performance.

If you were to try and model this SoC on a board with a fixed interconnect between FPGA’s or forced to use a board with great big IO connectors you would physically not be able to support SoC designs like this. With the fixed interconnect board, even if you could work out a partition, you will have to force fit your SoC interconnect topology across a fixed number of IO’s resulting in high mux ratio’s thus low performance. In addition it’s unlikely that the board would have the number of available external connector IO to support the SoC’s external interfaces for daughter boards. It’s similarly bad on a board with high pin count connectors.  Using our typical SoC as the example, if the FPGA-based prototyping board has FMC like connectors, ~150 IO’s per connector, you would need ten connectors to support the required interconnect and daughter boards for the tightly connected CPU/GPU. Whoops, I know of no board that has this many connectors. Again you would be forced to use very high pin multiplexing tanking the performance and making the platform worthless.

Now look at the HAPS-70 S48, the Synopsys four FPGA FPGA-based prototyping system. This type of typical SoC design is the reason why the HAPS-70 systems expose all the FPGA’s pins to HapsTrak 3 (HT3) connectors. HT3 granularity is 50 FPGA IO’s per connector and are bank matched to the Xilinx Virtex-7 2000T banks and Super Logic Regions (SLR’s). This granularity is the “not so secret” enabler for SoC prototyping and the key to higher performance operation.

Now you can see that not only do you have the connector granularity to tailor the interconnect to the requirements of the SoC design but you also have ample connectors to support the external IO daughter boards. You can create a very dense interconnect between FPGA 1 and FPGA 2 supporting the tightly coupled CPU/GPU and you don’t need pin multiplexing as you have the physical number of IO’s needed. At the same time you can support all the other interconnect requirements to the other FPGA’s and the required daughter boards.

Hold on, there’s more….

You have ample connectors to setup the prototype with the needed JTAG debugger daughter board connecting your software debugger to the CPU. You have ample connectors to add real time debug to the platform. Real time debug is when you extract signals from the design and route them to a debugger daughter board which you connect a logic analyzer to. Oh and you can also add on some HAPS Deep Trace Debug memory so you can capture seconds of debug visibility. So not only is the HAPS system higher performance but the hardware architecture is the enabler for prototyping typical SoC’s. If you are smart you will also understand that as the SoC grows in size and requires more FPGA partitions that the HAPS flexible interconnect architecture becomes even more important. Below you can see a picture of the HAPS-70 S96, eight FPGA system, deployed for SoC prototyping enabling earlier software development and system validation

Now you have the secret Ninja-Fu.

I’ve pretty much finished my large tracked vehicle project which I featured last week. I plan to add a controllable shovel and other attachments to the front but I got side tracked building a new project.

This is a very small shovel dozer, you can see how small it is as it’s sitting on top of my home-built tracked vehicle. (Or maybe my tracked vehicle is just very big). You this little shovel dozer come from a kit but rather than using the supplied hard wired connection I’m going to retro fit this model with some tiny radio controlled electronics. I have not built the RC control yet and with upcoming business travel I’m not sure when I am going to get a chance to. I’ll be sure to post an update when this latest project is finished.

  • Print
  • Digg
  • StumbleUpon
  • del.icio.us
  • Facebook
  • Twitter
  • Google Bookmarks
  • LinkedIn