Senior ABI Research analyst Yih-Khai Wong recently noted that edge artificial intelligence (AI) applications and use cases are gaining popularity across many industries. Instead of transferring massive amounts of data to the cloud, edge AI significantly reduces latency by allowing cameras and sensors to rapidly—and directly—analyze petabytes of high-resolution images and videos. Edge AI systems typically require sophisticated inference capabilities enabled by complex convolutional neural networks (CNNs) and transformers. Although GPUs support CNNs and transformers, dedicated AI accelerators optimized for both models are a better option for edge applications that demand high-performance inference capabilities in the smallest area with the least power.
Read on to learn why companies use low-power, high-performance neural processing units (NPUs) such as the Synopsys ARC® NPX6 NPU IP to design dedicated edge AI processors for a wide range of applications, from surveillance anomaly detection and event-based cameras to augmented reality (AR) and virtual reality (VR). Optimized for both CNNs and transformers, the ARC NPX6 NPU IP recently received the “Best Edge AI Processor” in the 2023 Edge AI and Vision Product of the Year Awards presented by the Edge AI and Vision Alliance.
For over a decade, CNNs were the most popular deep-learning model for vision processing applications. As they evolved, CNNs accurately supported an expanded set of use cases, including image classification, object detection, semantic segmentation (grouping or labeling every pixel in an image), and panoptic segmentation (identifying object locations as well as grouping or labeling every pixel in every object).
Although the computer vision application landscape was understandably dominated by CNNs, a new algorithm category—initially developed for natural language processing (NLP) such as translation and queries—has made serious inroads beyond ChatGPT. Known as a transformer, the deep-learning model pioneered by Google beats CNNs in accuracy without any modifications other than replacing language patches with image patches. Moreover, Google’s vision transformer (ViT), an optimized model based on the original transformer architecture, outperforms a comparable CNN with four times fewer computational resources.
This is because transformers use property attention mechanisms and sparsity to improve training, focus on relevant data, and bolster inference capabilities. With these techniques, transformers can better learn and understand more complex patterns—as CNNs typically address data frames without knowing what came before or after. Nevertheless, it is important to note that while transformers deliver higher accuracy, CNNs achieve significantly higher frames per second (FPS). That’s why transformers are frequently paired with CNNs—to bolster the speed and accuracy of vision processing applications.
MobileViT, introduced by Apple in early 2022, is an example of this approach. Essentially, MobileViT merges transformer and CNN features to create a lightweight model for vision classification. The combination of transformer and convolution, when compared to the CNN-only MobileNet, delivers a three percent higher accuracy rate for the same size model (6M coefficients).
Unsurprisingly, there are now many purpose-built transformers that support image and video inference with increased accuracy. Meta’s TimeSformer, for example, achieves the best reported numbers on several challenging action recognition benchmarks, including the Kinetics-400 action recognition data set. Moreover, compared with modern 3D CNNs, TimeSformer is approximately three times faster to train and requires less than one-tenth the amount of compute resources for inference. The sheer scalability of TimeSformer enables the training of large models on extended video clips and opens the door for companies to design more intelligent edge AI systems.
Designed for surveillance video anomaly detection (SVAD), TransCNN is a hybrid transformer and CNN mechanism that leverages spatial and temporal information from surveillance videos to detect anomalous events. TransCNN detects anomalies with a high level of accuracy by combining an efficient backbone CNN model for spatial feature extraction with a transformer-based model that learns long-term temporal relationships between different complex surveillance events. TransCNN outperforms other state-of-the-art approaches, achieving high AUC values of 94.6%, 98.4%, and 89.6% on the ShanghaiTech, UCSD Ped2 and CUHK avenue datasets, respectively.
Vision transformer models are also helping system designers optimize AR and VR applications. For example, a research team at the University of Science and Technology of China recently combined a transformer-based encoder with a CNN-based decoder to improve conventional depth estimation. Dubbed monocular depth estimation (MDE), the hybrid technique efficiently recovers the depth value of each pixel from a single RGB image. When pitted against NYU Depth V2 (indoor) and KITTI (outdoor) benchmarks, MDE achieves top performance scores on both quantitative and qualitative evaluations.
Another team of researchers from Stanford University, NVIDIA, and the University of Hong Kong developed the transformer-based COPILOT to predict and avoid collisions for people immersed in AR and VR environments. To train COPILOT, the research team developed a synthetic data generation framework that generates videos of virtual humans moving and colliding within diverse 3D scenarios. Extensive testing demonstrates COPILOT successfully generalizes unseen synthetic as well as real-world scenes, with outputs that can be harnessed to further improve downstream collision avoidance.
Although most current AI accelerators are optimized for CNNs, many of them are not ideal for transformers, as the latter demands considerable compute capabilities and memory resources to perform complex calculations and support sparsity. Optimized for both CNNs and transformers, the ARC NPX6 NPU IP is an accelerator that efficiently enables high-performance edge AI use cases within a minimal power envelope by:
The ARC NPX6 NPU IP is paired with the Synopsys ARC® MetaWare MX Development Toolkit which includes compilers and debugger, neural network software development kit (SDK), virtual platforms SDK, runtimes and libraries, and advanced simulation models. To help designers accelerate time to market, the toolkit automatically compiles and optimizes models for CNNs, transformers, recommenders, recurrent neural networks (RNNs), and long short-term memory networks (LSTMs).
Many companies leverage the ARC NPX6 NPU IP to design advanced edge AI systems for a wide range of applications. At the Embedded Vision Summit last night, the ARC NPX6 NPU IP was awarded “Best Edge AI Processor” in the 2023 Edge AI and Vision Product of the Year Awards presented by the Edge AI and Vision Alliance. When announcing the award, Jeff Bier, founder of the Edge AI and Vision Alliance, told attendees that Synopsys is a consistent, agile innovator as demonstrated by winning Edge AI and Vision Product of the Year Awards two years running – 2022 Best Automotive Solution for Synopsys ARC EV7xFS Processor IP and the 2023 Best Edge AI Processor.
“I congratulate the Synopsys team on earning this distinction for the ARC NPX6 NPU IP, which continues Synopsys’ strong tradition of innovation in processors for embedded AI and computer vision,” stated Jeff Bier.
Edge AI systems typically require advanced inference capabilities enabled by complex CNNs and transformers. Although GPUs support CNNs and transformers, purpose-built AI accelerators optimized for both models are a better choice for edge applications that demand high-performance inference capabilities in the smallest area with the least power. That’s why companies use the low-power, high-performance ARC NPX6 NPU IP to design dedicated edge AI processors for a wide range of applications including surveillance anomaly detection, event-based cameras, and AR/VR devices.
Catch up on these AI-related blog posts to stay on top of the latest trends: