Compute performance and energy efficiency have long been a driving force behind technology development, but increasing concern around the environmental footprint of the tech sector has brought them into even sharper focus. Demand for compute is increasing, yet carbon emissions must be reduced. Put simply, if we are to tackle climate change, decarbonizing compute has never been more important.
At the end of 2020, Arm committed to reaching carbon net zero by 2030, taking a science-based approach to cutting absolute emissions from our business operations by at least 42%. As part of that commitment, we have committed to sourcing 100% renewable electricity across our estate by 2023. We are also moving our compute workloads from x86 to AArch64 processors, which achieve better throughput per watt, and shifting from on-premise to cloud compute. This helps us avoid building on-premise compute to cope with peak verification demands. These routes towards carbon net zero are highly tangible and easy to visualize.
What is less obvious, however, is the impact of the engineering workflows required to develop Arm’s intellectual property (IP). These sophisticated workflows consume billions of compute hours per year and, of course, require a significant amount of energy to power them. The challenge for Arm engineering is to increase workflow efficiency and reduce time and energy consumption, while achieving results of the same, or higher, quality.
Saving energy through Machine Learning and data science
To ensure that we deliver the highest-quality IP, engineers continuously innovate and extensively test Arm technology. The underlying workflows are complex and differ across our engineering groups. They all, however, help to create new designs or better software, detect bugs and prevent bug escapes, or improve the power consumption, performance, and area (PPA) of our products.
By applying Machine Learning (ML) and data science to streamline these workflows, we not only contribute to deliver these objectives, we also spend compute hours more efficiently. Such productivity improvements help to enable delivery of the growing number of products we have been releasing, while keeping constant, or increasing, quality, despite increased product complexity and shorter time-to-market. You could argue that without ML and DS, the growing complexity of designs will make them intractable to verify.
However, productivity improvements also create an opportunity to reduce energy consumption and produce less carbon – and for ML and DS in engineering, that opportunity is huge.
Integrating ML and DS is a joint venture between our engineers and data scientists. To concentrate our efforts, we currently focus on three areas: verification, software and physical design. Here are just a few examples of what we are working on.
Our IP must be tested for compliance against test suites, such as AVS or DVS (architecture or design verification suites), which can contain millions of individual tests. These tests might be run daily, clocking up a considerable amount of compute hours. By applying ML and DS, we can derive – and subsequently run – a subset of tests that is representative of the entire suite. Then, using the resulting pass or fail patterns, we can predict the outcome of all tests in the suite, without the need to run them all.
Whilst at key milestones we use the full test suite, by minimizing the number of tests run in between these points, we can reduce compute hours and conserve energy without compromising on accuracy and quality.
Coverage is a measure used to describe how much of the design’s code or functionalities have been executed during testing. We run a large number of tests to achieve maximum coverage since high coverage levels are one of the steps towards finding, and fixing, more bugs.
For each of the tests used in coverage workflows, we can configure tens to thousands of parameters that influence the test’s behavior or set the number of random seeds used per test to determine total test volume. Using ML and DS, we analyze the relationship between coverage, each test and each test’s parameters. We then use this insight to develop ML and DS models that automatically select the ‘best’ tests, to determine an optimal number of random seeds per test and to set the tests’ parameters.
We have found such ML-derived tests to be more efficient, allowing a reduction in the test volume required to achieve the same or higher coverage levels and, in turn, reducing compute hours and energy consumed.
Bug finding is a costly workflow, in terms of both time and compute, but it’s fundamental for the development of high-quality IP. We use multiple ML and DS approaches in this phase to analyze test pass/fail patterns and aim to find as many bugs as quickly, and with as few tests, as possible.
First, we use ML and DS as a kind of ‘gatekeeper’ between test generation and execution. The gate opens more readily for tests that are more likely to fail, effectively decreasing the amount of inefficient tests executed. Elsewhere, we use ML and DS to investigate what happens during test execution, then use this information to influence future test execution and design new, more efficient tests.
This benefit may be directly used to save compute and/or increase quality. However, it also contributes to counterbalance the otherwise continuously increasing verification efforts (and compute) required by ever more complex workflows and designs.
Margining is executed during memory compiler development to ensure that our memory IP is functional and robust on silicon. We perform margining on every memory compiler per foundry, and per node.
Foundry process technology improvements present us with more challenges, including covering more process, voltage and temperature (PVT) cases and memory compiler features. This significantly increases the already large number of simulations and analyses – and, consequently, the required compute hours.
While all simulations need to be run at key milestones, we can increase efficiency in between. Applying ML and DS to a subset of simulations allows for the development of models that predict the results of simulations without actually running them, reducing both compute hours and energy consumption.
Greater efficiency; smaller environmental footprint
We are already seeing the benefits of applying ML and DS to tackle workflow efficiencies and increase productivity. While efficiency gains turn out to be minimal in some areas, others reach impressive improvements of over 90%. Our success has encouraged further investment and we’re continuously exploring ways to use ML/DS across even more of our engineering processes.
Integrating ML and DS at scale ultimately contributes to the delivery of ever more sophisticated and powerful technology at lower cost, within shorter time frames and at higher quality – all while enabling a decrease of our environmental footprint. What is not to like?