Breaking The Three Laws


How To Run Your Multi-FPGA Prototype at Ludicrous Speed

Ludicrous Speed

In this blog I’m going to explain how you can run a multi-FPGA HAPS prototype at up to 100 MHz !!!!!

I’ve blogged a couple of times now on how the HAPS flexible interconnect routing architecture combined with HAPS High Speed Time Domain Multiplexing (pin multiplexing) increases the performance of a multi-FPGA HAPS prototype.

This week I was asked how fast can you run a multi-FPGA HAPS system when NO pin multiplexing is needed. Interesting, I sometimes forget that there are designs which do not exceed the number of physical IO pin’s on the HAPS systems. So let’s assume you have a design which can be partitioned without the need for pin-multiplexing. This means that the number of signals between FPGA’s does not exceed ~1100 (per FPGA) in the case of the HAPS system. Did you know that Synopsys ProtoCompilers goal is to find an automated partition with a goal of no pin multiplexing. Yep, ProtoCompiler will try to find a partition which does not require any pin multiplexing. If ProtoCompiler cannot find a solution it deploys HAPS High Speed Pin Multiplexing with a goal of the lowest pin-mux ratio to deliver the highest performance. Many designs require multi-thousands of signals between FPGA’s so ProtoCompiler has to automatically deploy one of the available multiplexing schemes.

I was asked if the intelligent interconnect affects performance, as in are PCB traces faster? The answer is in the space of prototyping the effect is negligible. I blogged on this case with the technical data as to why here

HT3 interconnect routes raw performance

The HT3 interconnect is capable of operating far faster than you can run the prototype.

Combine this with the fact that you can tailor the HAPS flexible interconnect to match the requirements of the SoC (DUT) the result is an optimized configuration.

HAPS Flexible interconnect routing architecture

Ok, so you are getting bored by this point so I’ll finish with the data which is needed to calculate the performance of the HAPS prototype when no pin multiplexing is implemented. The equation is (Total FPGA delay + Inter-FPGA delay) = System path delay (The link from one pin on an FPGA to another pin on a second FPGA)

The inter-FPGA delay for the HAPS-70 and the new generation of Xilinx UltraScale VU440 based systems is actually very close, ~6 ns for both.  The difference will come in the FPGA fabric:  Virtex 7 2000T(logic delay + IO) vs. UltraScale(user logic delay + IO).  For best case logic delay, we can assume clk-to-Q + LVCMOS_18 IO delay.

For UltraScale VU440 (very preliminary because Xilinx datasheets are not complete, so I extracted what I could):

•             LVCMOS_18 output delay for -1 parts:  1.19 ns

•             LVCMOS_18 input delay for -1 parts:  0.54 ns

•             Global clock input with MMCM to clock output:  1.79 ns

•             Total FPGA delay = 3.52 ns

•             System path = 3.52 ns + 6 ns = 9.52 ns

Result 104 MHz

For V7 2000T:

•             (2.36 ns + 0.9 ns + 0.49 ns) + 6 ns = 9.75 ns (Result 102 MHz)

Ergo, performance with no pin multiplexing of up to 102/103 MHz on HAPS systems. No real difference between the two platforms as the architectures are the same and the FPGA’s performance improvement is insignificant so has minimal effect.

The funny thing is I have never come across a customer running a multi-FPGA prototype at this performance even when they are not using any pin multiplexing. The reason is the USER LOGIC Delay. Above we assume best case, clk to Q, but in reality users designs end up adding ~15ns user logic delay. So now rather than a delay of ~10ns you have a total path delay of ~25ns or 40 MHz. This is a more realistic performance number to expect from a multi-FPGA prototype when no pin multiplexing is used. If ProtoCompiler is able to find the perfect clk-to-Q partition point on all signals that have to cross FPGA’s then yes, the resulting HAPS prototype is capable of running at up to 100 MHz. Even if ProtoCompiler is forced to deploy HAPS High Speed Pin Multiplexing, HSTDM, at the lowest ratio you can get up to 30 MHz, again the biggest performance limiter is user logic delay, not the technology itself.

I took a day off this week so I could go to the track. I volunteer and coach race car driving with a side benefit of getting a lot of time on the track myself either in a coaches only session or out with the students doing lead/follow. I got in over 200 miles this weekend at the Ridge MotorSports Park, great fun. I also enjoyed making use of the tent on top of my truck so I didn’t have to get a motel room. The moon was amazing on Friday night.

I’m not sure if you can see my tent in this picture 🙂

Tent on top of Toyota FJ truck with Race Car Trailer behind it

How fast do you run your prototypes? Make a comment and let me know.

To SUBSCRIBE use the Subscribe link in the left hand navigation bar.

Another option to subscribe is as follows:

• Go into Outlook

• Right click on “RSS Feeds”

• Click on “Add a new RSS Feed”

• Paste in the following “”

• Click on “Accept” or “Yes” or whatever the dialogue box says

  • Print
  • Digg
  • StumbleUpon
  • Facebook
  • Twitter
  • Google Bookmarks
  • LinkedIn