Embedded Processor IP for AI SoCs: 7 Benchmarking Tips 

Gordon Cooper

Oct 05, 2021 / 6 min read

From smart speakers and digital cameras to edge servers and hyperscale data centers, the types of applications that rely on deep-learning neural networks to deliver actionable insights run a wide gamut. Inside each of these systems are robust AI SoCs that bring them to life—SoCs that rely on powerful embedded processor IP to run the compute-intensive algorithms.

When you’re designing a chipset for an AI application, you’ll obviously want to integrate the best AI-enabled processor and neural network accelerator into your system. But how do you determine what is really the most optimal for your application’s unique requirements? Processor benchmarking is the tried-and-true method to uncover some answers. But even then, the type of AI algorithm that you’ll want to run on the processor presents a significant influence on its performance. How do you get an accurate comparison of available processors?

In this blog post, I’ll share some neural network accelerator performance benchmarking considerations, key tips for selecting the ideal embedded processor IP for your AI SoC, and insights into why programmable processors make the comparison process much easier.

AI (Artificial Intelligence) concept

Selecting Neural Networks for Accurate Benchmarking

AI algorithms are growing more complex and also more specific to the application at hand. There are also lots of variables that affect how well a processor performs for a given application—enough so that it’s difficult to get an apples-to-apples comparison. A processor used to run a relatively simple algorithm may not measure up when applied to a more complex one, leaving you with power and performance benchmark data that doesn’t provide an accurate result in silicon.

Benchmarking an AI-enabled processor to run a convolutional neural network (CNN) involves many considerations. At a simple level, if you’ve got a neural network that’s common and you have the same data and coefficients, you can run this network through your architecture to generate the performance result, typically an accuracy measurement. However, in a real-time, embedded system, you’ll also need to factor parameters such as power, area, latency, and bandwidth into your benchmarking for a more realistic picture. Understanding overall SoC performance also involves considering aspects such as the chip’s process node, its clock speed, and optimizations on the network (like compression and quantization). Since the purpose of benchmarking is to compare two or more architectures, to verify that a given architecture will meet your application’s requirements, it’s very important to be clear about your system and its limitations.

There are lots of variables that affect how well a processor performs for a given application—enough so that it’s difficult to get an apples-to-apples comparison.

There’s currently no industry-standard neural network when it comes to AI hardware benchmarking, but the MLPerf benchmark suite comes close. Developed by ML Commons, an open engineering consortium, the MLPerf benchmarks are de-facto industry-standard metrics that measure machine-learning performance and now encompass data sets and best practices. On the inference side, the consortium’s neural networks include data center, edge, mobile, and tiny.

One of the more commonly used neural networks in the MLPerf benchmarking suite is ResNet-50, a CNN that’s 50 layers deep and provides object classification. It can be used as a building block to create more advanced benchmarking neural networks. The neural networks provided by MLPerf provide a good starting point for evaluating architectural efficiency of a given processor. Of course, every processor vendor is incentivized to fully optimize its neural network accelerator to MLPerf, which means that you won’t necessarily get a measurement of how good their tools are if you go by MLPerf results alone. And this is essential, since the tools must be able to perform accurate neural network mapping to optimize for a specific processor. If you use MLPerf as a starting point for benchmarking, choose some non-standard neural networks as well, and give your vendors a short turnaround to optimize them to get a better sense of how well their processors will perform.

Defining Benchmarking Parameters for Embedded AI Processors

Now that we’ve discussed some of the considerations at play in a benchmarking exercise, it’s time to share seven tips for selecting embedded AI processor IP for your SoC:

  1. Use a mix of standard and custom neural networks for performance benchmarking of AI hardware. Benchmarking off-the-shelf neural networks measure the ability of the vendor to perform optimizations by hand, while non-standard and custom neural networks measure the ability of the vendor’s tools to map algorithms.
  2. Use cycles/frame or frames per second (fps) @ xxHz for performance measurements. If fps is used, the frequency should also be mentioned (Use fMAX/peak with caution as the industry does not have a standard definition of what “peak” means). Also, note that trillions of operations per second (TOPS) is a first-order marketing number and should not be used for benchmarking. TOPS tells you the number of computing operations an AI chip can handle in one period of time, but it doesn’t indicate anything about the types or qualities of operations a chip can process, nor does it factor in power consumption.
  3. Pair compression (which will improve fps) and accuracy. Too much compression negatively impacts accuracy so it’s important to get both measurements.
  4. Specify the bandwidth constraints. Unlimited bandwidth results in benchmarking that’s too optimistic, since memory bandwidth is a growing bottleneck for AI systems, especially as the sizes of neural network accelerators grow.
  5. Question power simulation data. Vendor power estimates can vary widely. When possible, choose emulation over simulation and/or static analysis for AI workloads.
  6. Align area with benchmarks. Ensure that the area provided is the same configuration used for benchmarking (memory sizes, configuration options, etc.).
  7. Align area/power with operating temperature, since leakage varies significantly for different use cases.

Let’s take a closer look at power, since it is such a critical element in the power/performance balance for compute-intensive AI workloads. As dynamic power consumption and static power consumption are impacted by process technology scaling, there’s a continual need to accept trade-offs to balance your power and performance demands. Early and accurate power estimation of IP blocks is essential to align your processor selection with your application’s power. Since individual performance and power metrics aren’t comprehensive enough, it’s important to also consider the conditions under which the power estimation is conducted. For example, when evaluating a CNN application’s power consumption, the most accurate metric to use is energy in terms of Joules per frame for the representative neural networks. It is challenging, though, to compute the average power per frame. Many designers opt to measure the energy efficiency of a single convolution layer of a neural network, but even this approach is fraught with hurdles because a single “representative” layer isn’t necessarily representative.

For maximum power measurement accuracy, you need a solution that can execute billions of cycles of a CNN on a full-layout netlist. Simulation would take far too long. Emulation, on the other hand, can help IP developers as well as SoC designers accurately compute power of embedded processors for hundreds of millions of processed cycles in minutes or hours, rather than weeks or months.

Ensuring Confidence in Your AI Processor IP Selection

Once you’ve completed benchmarking and selected your processor, the landscape of neural networks doesn’t stand still. As neural networks continue to evolve, AI-enabled processors need to keep up with the latest neural network developments in what is currently a moving target. You want a hardware-accelerated solution that is as power- and area-optimized as possible while also being programmable to provide flexibility. A programmable processor’s code can be changed to support new features as they become available. Indeed, managing new neural network features via the software provides a level of future-proofing.

Synopsys technical experts have collaborated closely with customers on hundreds of standard and non-standard neural network benchmarks using programmable Synopsys DesignWare ARC® EV Processors for embedded vision, providing accuracy in fps as well as data around power, area, bandwidth, and latency. Using Synopsys’ extensive development tools, our customers can better balance their trade-offs, such as improving bandwidth at the expense of larger internal SRAM. Increasing area for a larger neural network and then running it at a lower frequency to save power is another trade-off option. Others might prioritize performance over accuracy. ARC EV Processors feature tools to benchmark neural networks quickly.

As part of Synopsys’ broad portfolio of AI solutions, Synopsys offers specialized processing, memory performance, and real-time connectivity IP to accelerate your time-to-market. In addition to the ARC EV Processors, the Synopsys ASIP Designer tool supports custom processing with parallelism and specialized datapath elements for the design and implementation of application-specific instruction-set processors (ASIPs). Synopsys DesignWare Memory IP offers efficient architectures for different memory constraints including bandwidth, capacity, and cache coherency. And Synopsys IP provides reliable, real-time connectivity to CMOS image sensors, microphones, and motion sensors for AI applications including vision, natural language understanding, and context awareness.

Summary

From the smart speaker that recognizes your voice commands to the high-performance computing application that models climate change patterns, the presence of AI is becoming more ubiquitous in our lives. AI SoCs that make these applications possible require processor IP that’s ready to take on the demands of compute-intensive workloads. As benchmarking neural networks continue to evolve, choosing the embedded processors that provide predictable power, area and performance is critical to building the high-performing AI SoC architectures your applications need.

Continue Reading