From Silicon To Software

 

Why Processor Workloads Are Changing as Moore’s Law Slows

embedded processor architecture

By Rich Collins, Product Marketing Director, and Markus Willems, Sr. Manager of Product Marketing; Synopsys Solutions Group

There’s plenty of healthy debate on how valid Moore’s law remains today. But there’s no denying that the predictable doubling of chip performance every couple of years that has guided the semiconductor industry for so long is slowing down. Eventually, it will simply be physically impossible to develop chips with ever smaller (and greater numbers of) transistors. However, engineering ingenuity is alive and well, and chip designers continue to find ways to push the envelope toward better performance, power, and area (PPA).

Optimizing processor architectures for changing workloads is one outcome of this quest to extract more from Moore’s law. As it becomes futile to generate more performance simply by moving to smaller geometries and cranking up the clock frequency, design engineers are considering different SoC architectures that are more efficiently tailored to their demanding and ever-changing workloads. Algorithms, too, have become more innovative to address the needs of applications like artificial intelligence (AI) and sensor fusion; this impacts SoC architecture decisions as well. In this blog post, we’ll examine how changing processing workloads and increasingly innovative algorithms are driving the need for new processor IP—and how designers of embedded SoCs are addressing the never-ending demand for higher performance while staying within their area and power budgets.

Performance Demands Drive New Processor Architectures

High-performance embedded applications like solid-state drive (SSD) flash storage, sensor fusion, AI, and 5G wireless share challenges that are driving the need for architectural improvements on the processor side:

  • Logic speeds are increasing faster than embedded memory access times
  • Clock speeds for most embedded designs have topped out in the 1-GHz to 2-GHz range
  • Clock speeds also need moderation to manage power consumption
  • Demands for performance, functionality, and features continue to grow

To address these competing requirements, design engineers are implementing more heterogeneous processing elements to extract higher performance for varying types of workloads. In an SoC with a heterogeneous processor architecture, designers combine these various processor types with each core handling its specialized application. Linux-capable CPUs, real time controllers, digital signal processors (DSPs), GPUs, and neural network accelerators are examples of the types of processors commonly found in advanced SoCs today.

The advantages of a heterogeneous approach, however, are not limited to applications with large workloads. A smart home automation hub can be designed in this way, in the interest of processing and power efficiency. These devices handle a variety of different functions: image processing, voice recognition, natural language processing, control functions, etc. A task like natural language processing calls for a heavy-duty processor, which also consumes a lot of power. At the same time, the device itself is always on, so it’s impractical to run the always-on tasks on that same heavy-duty processor. To save power, a small, low-power controller with built-in digital signal processing support can manage always-on tasks like keyword wake-up or face detection. Once the device is fully powered up, more compute-intensive tasks like natural language processing or face recognition can be shifted to a larger, more powerful core, like a high-performance DSP or an AI acceleration engine.

The advantages of a heterogeneous processor approach are not limited to applications with large workloads.

Heterogeneous workloads became common in high-volume, small-geometry devices like mobile phones, which can easily contain more than 50 cores, each specialized for a specific set of tasks. Today, these types of workloads are making their way into a wider variety of application areas. Even modern cars are evolving from having distributed controllers throughout the vehicle to centralized, heterogenous multi-core controllers. The emergence of the software-defined vehicle to support increased levels of automation and sophistication of in-vehicle applications is driving the automotive industry to a heterogenous computing model.

Even within a homogeneous, multi-core processor implementation, designers are looking for the ability to run heterogenous workloads. Providing a large-scale, coherent multi-core cluster (greater than eight cores) with a high-bandwidth interconnect and the ability to customize the processor provides yet another architectural option for heterogenous tasks, especially those which need to share coherent data. For example, in a computational storage application, a subset of the core cluster can be used as the host processor (running Linux) while another subset of cores running an RTOS can manage the SSD storage arrays. All of this can be managed with a single, multi-core cluster.

What’s Needed to Support Specialized Processor Architectures?

The specialized processor architectures we’re now seeing present a new set of challenges for programmability and tooling. After all, the complex, multi-processing architectures that are emerging to support these workloads are hardly useful if the chips are difficult to program. This calls attention to the need for a robust set of tools that are aligned to these types of architectures, easing the process for engineers to program and port their applications to the different processors.

A heterogenous processing architecture raises the big question of how all the components should be connected to best support parallel processing. Certain cores may be placed in close proximity and, as a result, could share resources like cache memory. It’s useful, then, to take a data-centric view of the architecture and determine where the data is located, where it’s processed, and where it’s needed. Keeping data local to a processing cluster reduces power consumption and read latency while offloading the network-on-chip. To provide a flexible memory scheme for processing local data, shared SRAMs can be dynamically partitioned between the L2 cluster cache and the cluster shared memory (for use by other processing elements that might share the resource with the multi-core cluster). The designer can decide how these memory resources should be shared across the different processors. High-bandwidth and low-latency interconnects also ensure fast data transfer between the cores.

Another important consideration for the processor architecture involves the power and clock domains. Independent power and clock domains help simplify the physical design, while minimizing power and area. Cores in the cluster can reside in their own power domains, with each core running its own clock.

Inside the Processor Architect’s Toolbox

To facilitate an effective heterogeneous processing architecture, the processor architect has a variety of options in the toolbox.

  • Multi-core performance scaling for applications that can be parallelized can result in near-linear speedup; key to making this a reality are a low-latency processor cluster architecture and sufficient communication bandwidth
  • Hardware acceleration via a dedicated processor or custom hardware
  • Application-specific instruction-set processor (ASIP) plus firmware, which provides the flexibility of the processor plus software option along with the power and area efficiency of hardware accelerators

Broad Portfolio of Processor IP Solutions

Processors and tools with the right capabilities can help mitigate some of the challenges of developing tailored, heterogeneous, multi-core architectures. When selecting processors, the evaluation process should cover performance, power, scalability, and flexibility specifications. Another important consideration is the ecosystem for creating these architectures: tools that are based on de facto or industry standards can reduce risks and provide consistency.

In our DesignWare® ARC® Processor IP portfolio, Synopsys provides power- and area-efficient 32-/64-bit CPU and DSP cores, vision processors, subsystems, and software development tools. ARC processors are also supported by an array of third-party tools, operating systems, and middleware from leading industry vendors enrolled in the ARC Access Program.

embedded processor architecture
Example of a next-generation embedded processor architecture.

Our tools include development kits, compilers, debuggers, simulators, open-source software, GNU tools, and documentation that are designed to simplify the development and programming processes. For example, with our APIs, programmers can write, say, a neural network application without worrying about how the workload gets partitioned or processed at the hardware level. The APIs handle this via a layer that determines the most efficient processing element on which to run the software. Another example is the debugging environment, where the user is presented with a concurrent view of different heterogeneous cores for an efficient, synchronized debugging process.

When off-the-shelf processor IP doesn’t quite meet a design’s unique workload requirements or if there’s a need to future-proof processing competencies, a team might opt to design an application-specific processor, or ASIP, which provides software programmability within its application domain. The Synopsys ASIP Designer tool automates the design and implementation of ASIPs, providing rapid exploration of architectural choices, generation of an efficient C/C++ compiler based software development kit that automatically adapts to architectural changes, and automatic generation of power- and area-optimized synthesizable RTL.

Summary

The slowing of Moore’s law has opened new avenues of innovation from chip designers seeking to extract the optimal PPA for their target applications. Specialized processor architectures have emerged as an answer to the demands of changing processor workloads and sophisticated new algorithms.  Most modern SoCs have processing requirements that can be PPA-optimized by a combination of configurable, off-the-shelf IP and dedicated, specialized hardware accelerators. Designing these architectures can be challenging, but these challenges are mitigated when working with a processor IP and tools vendor that brings a broad spectrum of solutions to the table.

In Case You Missed It

Catch up on these other IP-related blog posts: