by Samad Parekh, Senior Product Manager for Custom Design and Physical Verification Group, Synopsys, and Srinivas Kodiyalam, Senior Developer Relations Manager, Industrial HPC and AI, NVIDIA
Whether you’re playing the latest interactive game or watching a movie on your tablet, you’re reaping the benefits of graphics processing units (GPUs). Over the last decade, GPU technology has experienced phenomenal advancements. Initially used to render graphics and video, GPUs are increasingly used for high-performance computing (HPC) tasks such as deep learning, artificial intelligence (AI), and more. Indeed, the HPC industry is moving towards an accelerated computing model where intensive calculations are carried out on GPUs to achieve faster real-world execution times.
Continuing advances in semiconductor process technology along with increasing circuit complexity are presenting a growing challenge for circuit simulation, specifically when it comes to simulation cost, quality, and time to results. To solve for these challenges and ensure that today’s chips are thoroughly verified, you need a unified flow with advanced GPU performance scaling.
With CPU performance gains beginning to plateau, GPUs are an attractive option to accelerate circuit simulation and sign off. Across a variety of circuit types (PLLs, SerDes, SRAMs, PHYs) with device counts in the tens or hundreds of millions of elements, GPUs can deliver up to 10x improvement in simulation runtime, as shown in Figure 1.
Synopsys PrimeSim™ Continuum, released originally on NVIDIA V100 GPUs, offers a unique next-generation CPU-GPU hybrid architecture that delivers significant performance improvements while meeting signoff accuracy requirements for today’s advanced applications.
The latest version of the PrimeSim simulator (version 2021.09) supports the NVIDIA A100 Tensor Core GPU architecture. The Ampere A100, launched in 2020, is the most recent NVIDIA GPU. Traditional HPC workloads, such as circuit simulations, continue to demand more double-precision compute performance and memory bandwidth. Leveraging architecture concepts for GEMM (General Matrix-Matrix Multiplication) acceleration, the A100 incorporates Tensor Core support for double-precision FP64 data types, boosting peak GPU performance to 19.5 TFLOPS. Table 1 compares key attributes of Ampere A100 (2020) to its predecessor in the datacenter, the Volta V100 (2017) GPU.
|Volta V100||Ampere A100||Increase|
|FP64||7.8 TFLOPS||19.5 TFLOPS||2.5x|
|900 GB/s||2,000 GB/s||2.2x|
|NVLink Bandwidth||300 GB/s||600 GB/s||2x|
|L2 Capacity||6 MB||40 MB||6.7x|
|DRAM Capacity||32 GB||80 GB||2.5x|
Table 1: Comparison of key attributes between V100 and A100 GPUs
Ampere dramatically increases each of these key hardware features, including 5x greater FP16 throughput, 2.2x more DRAM bandwidth, and 6.7x more on-chip L2 cache. In addition to massive parallel computing throughput and memory bandwidth, the Ampere architecture includes hardware support for accelerating machine learning and HPC applications, such as, with structured sparsity in the Tensor Cores. In the memory system, A100 provides a range of features to provide better control over data movement and placement. A100 supports data transfers directly from the memory hierarchy into shared memory, without requiring the data to transit through the register ﬁle. A100 also provides a new set of L2 cache control operations that allow the programmer to inﬂuence the cache replacement policies and effectively dictate which data structures stay resident in the cache. Finally, the L2 cache includes hardware-supported data compression in which data is kept compressed in both DRAM and L2 (saving bandwidth and capacity) and is then uncompressed/compressed when it is transferred to and from a Streaming Multiprocessor (SM).
Because it supports the Ampere A100 architecture, PrimeSim can exploit the following benefits:
With modern process nodes giving rise to larger device counts, the two most significant tasks for a SPICE simulator are model evaluation and matrix solution. Having more SMs is directly beneficial for large netlists with large transistor counts. Each streaming multi-processor is a double-precision computing unit capable of running thousands of threads in parallel. This allows a massive number of device evaluations in parallel. The much larger L1 and L2 cache means much less data swapping which again helps simulation time.
Larger numbers of parasitics in the netlist often give rise to denser matrices and solving those is computationally expensive because they require large numbers of floating-point operations at double-precision. This is where the Tensor Cores in the streaming multi-processors offer a performance enhancement. With an ability to do up to 19.5 TFLOPs, the A100 is extremely efficient at solving dense matrices. With the above enhancements, and using an optimum combination of CPU and GPU across the same selection of cases, the A100-40GB GPU provides an additional 50% improvement on average vs. the V100 GPU, as shown in Figure 2.
You have an ever-growing need to simulate larger circuits with SPICE-level accuracy. These analog and mixed-signal simulations often are too time-consuming and, in many cases, just not possible to run with the level of accuracy you need. With PrimeSim Continuum, you have an alternative. Leveraging the power of the GPU, the heterogeneous, accelerated compute architecture gives you a way to simulate these challenging circuits to achieve signoff with SPICE-level accuracy, ultimately reducing your runtimes from days or weeks to just hours. It’s the practical way for you to characterize the performance of your designs and you won’t compromise on accuracy.