Why Reliability, Availability & Serviceability Matter in HPC 

Charlie Matar, Pawini Mahajan, Rita Horner

Dec 14, 2022 / 5 min read

While once the domain of large data centers and supercomputers, high-performance computing (HPC) has become rather ubiquitous and, in some cases, essential in our everyday lives. Because of this, reliability, availability, and serviceability, or RAS, is a concept that more HPC SoC designers should familiarize themselves with.

RAS may sound like a self-explanatory term, but what does it really involve when it comes to HPC SoCs? Data center operators have long maintained service level agreements with customers to provide assurance of system uptime. RAS complements these agreements and can now be supported with new technology that ultimately yields actionable insights. Read on to learn more about why silicon lifecycle management (SLM), embedded monitoring IP, and the right design and verification tools can be your allies in the quest to achieve high levels of RAS for your HPC designs.

3 Key High-Performance Computing Components

The video footage captured by a home’s security doorbell or a building’s surveillance system. Financial and business operations modeling. Scientific and medical research. Augmented reality and virtual reality. The list goes on when it comes to application areas that rely on HPC. The proliferation of data collected by our devices and systems, AI-fueled analytics, the availability of massive amounts of compute resources, and the cloud are converging to make it possible to quickly derive useful, actionable insights, making HPC an integral part of a much wider array of applications than when the very first supercomputers emerged in the 1940s.

A typical HPC infrastructure today consists of three key elements: compute, network, and storage. Each requires a certain level of performance, latency, power efficiency, scalability, productivity, and security. Let’s take a closer look at each element:

  • Compute consists of CPUs and GPUs, accelerators, network on chip (NoC), and compute servers. This is where the high-performance data processing takes place. Complex multi-core or even multi-die system architectures, large memories with fast access, high-bandwidth I/O interfaces, power/cooling management, and security are key characteristics. In-chip monitoring and analysis can support RAS goals.
  • Network consists of switches and routers, adapters, bridges, repeaters, the network interface card (like a SmartNIC), and optical and electrical interconnects. This element delivers high-performance connectivity, ideally with high throughput, low latency, energy efficiency, configurability and scalability, real-time monitoring and reporting, and security. Debug capabilities, forward error correction (FEC), and IP can support RAS requirements.
  • Storage consists of a solid-state drive (SSD) or hard disk drive (HDD), storage area network (SAN), and network attached storage (NAS). Ideally, the storage element should provide high-bandwidth storage, reduced data transfer energy and latency, flexibility, scalability, reliability, and security. Capabilities like built-in self-test (BIST), error correction code (ECC), and redundancy can facilitate high levels of RAS.

There are two primary types of HPC systems: the homogeneous machines and the hybrid ones. Homogeneous machines only have CPUs. Hybrids, by contrast, have both GPUs and CPUs, with GPUs running the tasks and CPUs overseeing the computation.

HPC clusters can be composed of large numbers of servers, where the total physical size, energy use, or heat output of the computing cluster might become a serious issue. Furthermore, there are requirements for dedicated communications among the servers that are somewhat unique to clusters.

Because small design differences amount to large benefits when multiplied by the number of servers in the clusters, we are seeing the emergence of server designs that are optimized for HPC. Sometimes these are designs targeted at large, public Web operators, such as search engine firms, that deliver similar benefits in HPC clusters. However, they can also offer features only appropriate for HPC users. For example, if the system were designed to provide the cluster interconnect differently, there might be potential for significant cabling reductions.

Actionable Insights Through In-Chip Monitoring and Analysis

The utility of HPC lies in its ability to process an enormous amount of data—petabytes or even zettabytes—and to run complex models in real time (or near real time). It goes without saying that anytime an HPC system goes down, it leads to lost money and business discontinuity. The ramifications become steeper with mission-critical applications. At advanced nodes, with large monolithic dies or complex architectures such as multi-die, it becomes even more challenging to meet RAS requirements.

Depending on the criticality of the application at hand, systems can be built with backups that provide redundancy in the event of failures. Beyond redundancy, there’s much more you can do at the system and chip levels to meet RAS targets. This is where SLM plays a big role, providing intelligent, automated in-chip monitoring IP and methodologies to generate actionable insights through each phase of a system’s life.

Designers have been embedding monitors and sensors into their chips for a number of decades; however, the technology has evolved to a point where it can now provide much more accurate data with higher levels of granularity. This enables greater visibility into the device’s real-time environmental, structural, and functional conditions. Examples include the monitoring of thermal hotspots, process variation, and voltage supply, as well as accurate measurement of timing margins.

Thanks to embedded and cloud-based analytics, along with the availability of a unified SLM solution, design teams will be able to build up a continuous, real-time picture of the silicon health of their device, not only during the in-design, in-ramp and in-production phases but also during in-field operation. They can get a better understanding of root cause and debug and repair right away, reducing costs and potential harm. Issues that SLM can address include transistor aging and delay faults. To get a picture of the benefits this brings, consider a satellite that has a bug. Normally, it can take many weeks to get a repaired board back from the lab to install in the satellite, taking it out of commission for an extended period for troubleshooting and repair. By moving fault detection and fault repair in the field through SLM technologies, teams can keep their systems up and running with fewer and shorter disruptions.

Taking a look at a data center, we can see another example highlighting how SLM can facilitate meeting RAS requirements.

  • At the silicon level, the ability to remotely debug in the field is critical for teams striving to successfully hyperscale their data center. SLM provides the remote telemetry and monitoring to make this possible.
  • At the system level, precise clock throttling, another SLM capability, is essential for maximizing data throughput and CPU, GPU, and AI engine utilizations.
  • At the data center level, monitoring server performance, network congestion, and disk utilization with SLM tools are key to detecting and predicting data outages, which can increase uptime.
  • At the hyperscale level, teams can utilize SLM to minimize on-chip thermal and supply stress to extend reliability.
  • For die-to-die high-speed interfaces, there’s the signal integrity monitoring that SLM provides, which, along with redundancy for interface integrity, helps ensure the robustness of chiplet designs.

Summary

Rather than a disconnected set of point tools, an end-to-end solution that brings together everything from design calibration analytics to in-chip monitoring and system performance optimization can make the process of addressing RAS targets a more seamless one. Synopsys is unique in providing such an end-to-end flow, with our Silicon Lifecycle Management Family, complemented by a broad portfolio of low-latency, silicon-proven IP and design and verification technologies for HPC applications. Within this set of solutions are features such as physically aware silicon monitors, cloud analytics, and embedded analytics and optimization technologies. SoC sensor IP and process monitors, also part of the family, are used for in-design, in-ramp, in-production and in-field optimizations. Reliable silicon at the manufacturing phase and in field, where monitors can collect real-time data about the silicon, plus comprehensive test and debug solutions, ensure high RAS.

Given the increasingly wider range of applications that now rely on HPC, maintaining high levels of reliability, availability, and serviceability of these systems is a critical consideration across the board. Achieving optimal levels of RAS to support everything from streaming video to climate change modeling is another important element in keeping our digital, smart everything world running at top speed.

Continue Reading