How Silicon Lifecycle Management Strengthens HPC and Data Center Reliability

Guy Cortez, Randy Fish

Jun 20, 2023 / 5 min read

Beyond the hyper-connected, AI-driven, answers-at-your-fingertips convenience, the need for high-performance computing (HPC) and hyperscale levels of storage can be existential. Supercomputers are helping to improve the outcomes in everything from mathematical models to climate predictions, and cloud data centers house the infrastructure that keeps our digital lives humming. There is more data today than has ever existed before. It moves at high speeds across vast distances. Silicon process nodes are shrinking, pushing the reticle boundaries of manufacturing, giving rise to multi-die systems that are forging new possibilities in performance.

With all this advanced complexity in electronic systems, you might ask, what can go wrong? Simply put: a lot. Silent Data Corruption (SDC), the errors happening undetected below the surface, are real, as is device aging, thermal and power challenges, and more. These challenges can be a headache and quite possibly culminate in catastrophe if they aren’t handled well—especially if you are dealing with these issues at scale.

Other issues?

For SoC designers, greater complexity is a forcing function for employing a silicon lifecycle management (SLM) strategy to ensure the reliability, availability, and serviceability (RAS) of your devices. In fact, knowing what is happening inside your final product, along with understanding the long-term RAS implications, is essential for design success.

What Does a Silicon Lifecycle Management Strategy Look Like?

It’s no longer just about making sure your chip works when you produce and ship it. SoC and multi-die designs need to be monitored and tested throughout their lives and repaired (when possible) before you have a problem or even a failure. To do this, you need control of and access to the elements inside your chips to debug and read out data and perform the appropriate analytics to determine if there are issues. This information will enable you to service your systems before it’s too late.

An SLM strategy can help you take specific action to ensure RAS throughout a device’s life:

  • In-Design: Identify the optimal design component candidates in your device for monitoring. Insert monitor IP directly into the infrastructure of the design.
  • In-Ramp: Prioritize your highest yield limiter candidates, perform accurate failure analysis and course correct the design and/or fab process to satisfy high yield requirements.
  • In-Production: Identify yield and quality outliers through automated insights, perform root-cause analysis across various stages of high-volume manufacturing and course correct in the semiconductor supply chain as necessary .
  • In-Field: Assess silicon health through predictive maintenance and optimize performance metrics such as power and throughput (if possible), especially as the device ages.

 

An Example: Modeling Strategies for Better Thermal and Power Management

Managing thermal complexities and optimizing power are major priorities in SoC systems. While this is true across a single die, it is exponentially more difficult to do across multiple dies in a system, especially as the system ages. Inserting the right monitors into your design is foundational to mitigate heat and voltage problems and to achieve long-term device success in both HPC and the data center.

Process, voltage, and temperature (PVT) monitors have been used for years in-field for on-chip voltage and power management—otherwise known as dynamic voltage and frequency scaling (DVFS). Or sometimes, these monitors are used to simply monitor the temperature, enabling a shut off when they are trending toward a catastrophic outcome. In fact, nearly 100% of designs at 16nm and below, along with 100% of data center chips, use PVT monitors.

During your wafer sort testing, you’ll get first results from these monitors, and the data can be used immediately. At this point, you will understand your thermal profile and can apply more test sequences to monitor voltage values across the die. In addition, you can perform analytics based on test, PVT, and path margin monitor IP data, and go back into the design environment to understand the real margins that you’re seeing in your silicon and correlate them to your models. The better the modeling, the more you can strip down your margins to increase performance or reduce power without sacrificing RAS.

To help anticipate whether something will go wrong ahead of time, you can set thresholds. For temperature monitors, the threshold sets the point where you start managing the temperature back down. You can do this because thermal response is most of the time relatively slow. The more aggressive you are with your thresholds, the earlier you can act. Voltage monitors can be used similarly, even though what you are monitoring is a bit different.

When you are in your early ramp phase, you’re doing minimal production of chips, just to make sure that that chip is functional and to confirm that it is hitting target yields before launching into full production. You collect data from the test and diagnostic results that comes off the fab early on, as well as all the data throughout product manufacturing. You may identify systematic to address during this time. Once your device is deployed in the field, you’ll want to take advantage of the latest strategies to see how your device is functioning while in use as it ages. For this, new capabilities are emerging, including in-field scans with Intel’s Sapphire Rapids. You can also insert an SLM software agent into a device that’s local to the system for ongoing edge analytics, as well as problem mitigation. In-field silicon management is an area where much innovation is happening right now, and new capabilities are on the near-term horizon.

How to Bring It All Together — A Holistic SLM Strategy

HPC and data center workloads require that you test, monitor, and repair devices throughout the life of your silicon. You can’t afford to be blind to what’s happening inside your chip. When you are dealing with massive amounts of data—design data, fab data, diagnostic data, product manufacturing test data, including valuable monitor data and beyond—it makes sense to ease your journey with a holistic, systemic approach to make that data meaningful and actionable, while ensuring streamlined productivity.

Synopsys is the only company offering a holistic, complete SLM solution: an integrated platform of tools supporting the SoC lifecycle from design through production, along with a robust future of in-field solutions coming soon! We help you cover your bases to make your products work now and throughout their lifetime. Our Synopsys PVT Monitors, Path Margin Monitors (PMMs), and Real-time High-Speed Access and Test (HSAT) IP are all part of the Synopsys SLM family of products. They provide the in-chip sensors you need to monitor data, run manufacturing and in-field tests. Synopsys HSAT IP enables your device to use functional I/Os such as USB and PCI Express® (PCIe®) interfaces, so you won’t need to use a lot of test and interface pins and you can continue to perform scan and diagnostics when the device is deployed in use.

Synopsys SLM goes far beyond IP monitors, though. It includes the analysis and insights of all the different types of silicon health data in one place. Our complete solution gives you the design phase support for identifying candidate paths to monitor. Once you implement the monitoring IP, our test infrastructure products such as those in the Synopsys TestMax™ family of products, connect your devices to test infrastructure, generating the scan sequences for the monitors—getting the data in and out—for next level diagnosis of possible issues. With Synopsys SLM, you gain insight on your SoC to keep your chip’s RAS on track through time, even at scale.

If you would like to learn more about Synopsys SLM, contact us.

Continue Reading