Exploring ML-Based Regression Failure Analysis

Rob van Bloomestein

Nov 15, 2022 / 3 min read

Recently we wrote about how AI-driven debug automation technology can accelerate the root-cause analysis of regression failures. In that blog we introduced the Synopsys Verdi Regression Debug Automation (RDA) technology that helped customers like MediaTek achieve a 4X improvement in identifying root-causes of failures in their design. This blog will take a deeper look into the components of this RDA technology, explain how they work and how users can take advantage to achieve similar results.

First, let’s recap what Synopsys Verdi RDA does from a high-level. Each time a regression fails, teams must often examine 100’s if not 1,000s of failures and debug their causes. Synopsys Verdi RDA uses machine learning (ML) to automate the process of finding the root causes of failures in the design under test and testbench. The technology automatically classifies and probes these raw regression failures. The failures are then automatically triaged to identify if they are part of the design or the testbench. Design and verification teams then perform root-cause analysis to pinpoint the bug(s) triggering these failures.

Now let’s take a closer look at the components that automate and accelerate regression debug. The process starts by collecting data from the regression run. The collected data feeds into the Regression Binning application which then analyzes the log files and classifies these failures into different buckets according to error types such as UVM-based messaging, user-defined rules, verification IP, and instruction set errors. The results can then be visualized in Synopsys Verdi where users can filter and search the results and then invoke interactive debug. The process has been shown to be 90% accurate and reduces overall triage time.

Verdi RDA Flow Diagram

Another component of RDA is its Debug Facilitator that automatically collects debug data for each failure bin and facilitates debug. Debug Facilitator automatically captures and creates debug checkpoints during simulation. This establishes a full state of the environment at the time of failure. The checkpoints are used along with root-cause analysis results to perform interactive and reverse debug enabling quick debug and analysis of the identified root cause(s) in the testbench.

Synopsys Verdi RDA’s Design Under Test Root Cause Analysis (DUTRCA) helps to filter out non-design-related bugs and then uses time-based rollback mechanisms and TraceDiff technologies to automatically narrow down DUT problems. It looks for comparisons between waveforms to identify difference signal values. Root cause analysis is then automatically done to determine if the issues are related to the testbench or the DUT.

Synopsys Verdi RDA also includes features that reduce the number of failures related to unknown (X) values. This is important because Xs are notoriously difficult to debug because they cover many cycles and levels of logic. The X Root Cause Analysis (XRCA) technology within RDA analyzes single and multiple paths to identify the root-causes of x-related issues. If the results in multiple paths being sourced from a single root cause, those paths are grouped into a single group. These results are viewed and analyzed in Synopsys Verdi to better understand and quickly correct the issues.

Analyzing and debugging the numerous regression failures is traditionally a largely manual process. This isn’t that bad if there are only a handful of failures to debug, however design complexities have given rise to an increasing number of regression failures. Included in these failures are often tests that previously passed only to fail on the next regression run. Any changes made to the verification environment could also result in a large number of failed tests. These failures take a great deal of manual effort with a costly impact on resources and time-to-market. The Synopsys Verdi RDA technology saves significant time and effort for every failing test that is debugged while greatly reducing the number of such tests. The result maximizes regression utilization, focuses manual effort on true debug rather than automatable tasks, reduces the turnaround time (TAT) in the debug process, and cuts the overall debug regression effort on a chip project in half.

Continue Reading