Powering Circuit Simulation Software with NVIDIA GPUs 

Samad Parekh

Feb 08, 2022 / 4 min read

Whether you’re playing the latest interactive game or watching a movie on your tablet, you’re reaping the benefits of graphics processing units (GPUs). Over the last decade, GPU technology has experienced phenomenal advancements. Initially used to render graphics and video, GPUs are increasingly used for high-performance computing (HPC) tasks such as deep learning, artificial intelligence (AI), and more. Indeed, the HPC industry is moving towards an accelerated computing model where intensive calculations are carried out on GPUs to achieve faster real-world execution times.

Continuing advances in semiconductor process technology along with increasing circuit complexity are presenting a growing challenge for circuit simulation, specifically when it comes to simulation cost, quality, and time to results. To solve for these challenges and ensure that today’s chips are thoroughly verified, you need a unified flow with advanced GPU performance scaling.

GPUs are an Attractive Option to Accelerate Circuit Simulation and Sign Off

With CPU performance gains beginning to plateau, GPUs are an attractive option to accelerate circuit simulation and sign off. Across a variety of circuit types (PLLs, SerDes, SRAMs, PHYs) with device counts in the tens or hundreds of millions of elements, GPUs can deliver up to 10x improvement in simulation runtime, as shown in Figure 1.

Performance Gain with V100 GPU

Figure 1: Performance gain with V100 GPU

Synopsys PrimeSim Continuum Now with NVIDIA Ampere Tensor Core A100 GPU

Synopsys PrimeSim™ Continuum, released originally on NVIDIA V100 GPUs, offers a unique next-generation CPU-GPU hybrid architecture that delivers significant performance improvements while meeting signoff accuracy requirements for today’s advanced applications.

The latest version of the PrimeSim simulator (version 2021.09) supports the NVIDIA A100 Tensor Core GPU architecture. The Ampere A100, launched in 2020, is the most recent NVIDIA GPU. Traditional HPC workloads, such as circuit simulations, continue to demand more double-precision compute performance and memory bandwidth. Leveraging architecture concepts for GEMM (General Matrix-Matrix Multiplication) acceleration, the A100 incorporates Tensor Core support for double-precision FP64 data types, boosting peak GPU performance to 19.5 TFLOPS. Table 1 compares key attributes of Ampere A100 (2020) to its predecessor in the datacenter, the Volta V100 (2017) GPU.

 

 

  Volta

  V100

  Ampere

  A100

  Increase

   FP64

   

  7.8

  TFLOPS

  19.5

  TFLOPS

  2.5x

   

   DRAM

   Bandwidth

 

 

 

  900 GB/s

 

 

  2,000 GB/s

 

 

 

  2.2x

 

   NVLink

   Bandwidth

 

 

  300 GB/s

 

 

  600 GB/s

 

 

  2x

   L2 Capacity

  6 MB

  40 MB

  6.7x

   DRAM Capacity

  32 GB

  80 GB

  2.5x

 

Table 1: Comparison of key attributes between V100 and A100 GPUs

 

Ampere dramatically increases each of these key hardware features, including 5x greater FP16 throughput, 2.2x more DRAM bandwidth, and 6.7x more on-chip L2 cache. In addition to massive parallel computing throughput and memory bandwidth, the Ampere architecture includes hardware support for accelerating machine learning and HPC applications, such as, with structured sparsity in the Tensor Cores. In the memory system, A100 provides a range of features to provide better control over data movement and placement. A100 supports data transfers directly from the memory hierarchy into shared memory, without requiring the data to transit through the register ?le. A100 also provides a new set of L2 cache control operations that allow the programmer to in?uence the cache replacement policies and effectively dictate which data structures stay resident in the cache. Finally, the L2 cache includes hardware-supported data compression in which data is kept compressed in both DRAM and L2 (saving bandwidth and capacity) and is then uncompressed/compressed when it is transferred to and from a Streaming Multiprocessor (SM).

Because it supports the Ampere A100 architecture, PrimeSim can exploit the following benefits:

  • 35% increase in Streaming Multiprocessors (SMs) from 80 to 108
  • Support for Tensor Cores capable of doing FP64 ops
  • An increase of 2x and 6.7x in L1 and L2 cache
  • 2X increase in memory bandwidth going from 900 GB/s to 2 TB/s.

With modern process nodes giving rise to larger device counts, the two most significant tasks for a SPICE simulator are model evaluation and matrix solution. Having more SMs is directly beneficial for large netlists with large transistor counts. Each streaming multi-processor is a double-precision computing unit capable of running thousands of threads in parallel. This allows a massive number of device evaluations in parallel. The much larger L1 and L2 cache means much less data swapping which again helps simulation time.

Larger numbers of parasitics in the netlist often give rise to denser matrices and solving those is computationally expensive because they require large numbers of floating-point operations at double-precision. This is where the Tensor Cores in the streaming multi-processors offer a performance enhancement. With an ability to do up to 19.5 TFLOPs, the A100 is extremely efficient at solving dense matrices. With the above enhancements, and using an optimum combination of CPU and GPU across the same selection of cases, the A100-40GB GPU provides an additional 50% improvement on average vs. the V100 GPU, as shown in Figure 2.

Figure 2: Additional performance gain with A100 GPU vs. V100 GPU

Leveraging the Power of the GPU for SPICE-Level Accuracy

You have an ever-growing need to simulate larger circuits with SPICE-level accuracy. These analog and mixed-signal simulations often are too time-consuming and, in many cases, just not possible to run with the level of accuracy you need. With PrimeSim Continuum, you have an alternative. Leveraging the power of the GPU, the heterogeneous, accelerated compute architecture gives you a way to simulate these challenging circuits to achieve signoff with SPICE-level accuracy, ultimately reducing your runtimes from days or weeks to just hours. It?s the practical way for you to characterize the performance of your designs and you won?t compromise on accuracy.

Continue Reading