Artificial intelligence (AI) processing hardware has emerged as a critical piece of today’s tech innovation. AI hardware architecture is very symmetric with large arrays of up to thousands of processing elements (tiles), leading to billion+ gate designs and huge power consumption. For example, the Tesla auto-pilot software stack consumes 72W of power, while the neural network accelerator consumes 12W (Source: The Verge). A recent study from Stanford has shown that building and training a complex neural network can lead to up to 78,000 pounds of carbon emissions (the equivalent of flying 60 passengers from San Francisco to New York). Designing for efficient energy consumption for AI has become critical, not only to reduce the cost of running farms and improve battery life, but also for the preservation of our planet.
The challenge of optimizing AI power necessitates a comprehensive approach, which includes 1) analyzing software and hardware together with the goal to optimize both, 2) defining the best possible architecture and power management, 3) obtaining early total and glitch power at the RTL stage to identify the best micro-architectures, 4) making power a cost function during implementation, and 5) performing efficient power and signal integrity signoff. 1. System-level power analysis, or how to define the best architecture for AI hardware System-level architecture is key to identifying the best architecture for maximum performance and lower power. Due to intense tile-to-tile traffic when the algorithm on the AI hardware is run and the huge amount of switching activity happening synchronously, it is critical to analyze the execution of the software application on the hardware model to define the best software and hardware architecture to spread the switching activity. Techniques include clock spreading, distributing memory access over time, developing better DVFS, improving power shutdown schemes, and optimizing power management strategies. Example: Power vs. performance vs. energy trade-off analysis Source: Synopsys 2. Power profiling of software and hardware using emulation Another way to analyze the power of a tile in the context of the full chip and software is to use emulation. Emulation enables the user to do power analysis when the real workload (up to billions of cycles) is run on the chip and identify windows of interest for di/dt, peak power or average power analysis. Due to the large number of MAC operations per cycle, identifying these windows is critical for IR drop and peak power analysis. Emulation quickly obtains a power profile of the workload and provides feedback to the software and hardware engineers; for example, it can allow users to identify any power leaking during the tile-to-tile operation that can be turned off by changing the software to disable hierarchical clock gating, for example. 3. Early power analysis and optimization at RTL Due to the symmetric and replicated architecture of AI hardware, it is very important to identify the best possible micro-architecture, clock gating, memory gating or data gating for the tile at the RTL stage. Reducing power for a highly replicated tile will lead to high-energy savings at chip level. This is enabled by physically aware RTL power analysis that can provide early but accurate power estimates (typically within 10% of signoff). RTL power analysis in turn enables fast what-if analysis to identify the best micro-architecture and provide guidance on how to improve clock gating efficiency and memory access rate. Additional data gating at this stage can lead to up to 25% power savings for an AI processing tile. 4. Glitch power – A significant concern for AI-style designs Due to the huge number of operations performed when the AI algorithm is run on hardware, glitch power has become critical for power consumption. Glitch power can represent up to 40% of the total power. Typically, glitch power is computed very late in the flow when gate level simulation with timing delays is available. This is too late to perform changes to the micro-architecture, take glitch power into consideration as part of power costing during implementation, or perform specific ECOs to reduce glitch power. Percentage of glitch power vs total power for different designs Source: Synopsys More novel approaches are available to anticipate glitch power accurately from RTL or 0 delay simulation. This approach enables estimating glitch power within 5 percent of signoff very early in the flow, driving better design decisions during RTL development and better power costing during implementation and ECO, and drastically reducing glitch power. Early glitch estimator combinational power results within 5 percent of GLS Source: Synopsys 5. Final chip-level power signoff The last step is to signoff for power and IR drop. The main challenge is the size of the design and the number of cycles to analyze. This problem can be resolved by massively parallelizing the analysis workloads, while leveraging both on-premise and cloud resources that may be available. Chip-level signoff analysis can be further sped up by leveraging reuse of tile-based power analysis. For IR drop analysis, vectorless techniques can be used to generate vectors that achieve the maximum instantaneous peak power or maximum IR drop. Conclusion Powering modern and future AI hardware must start with understanding the software. A comprehensive design solution for AI power establishes an intrinsic connection with the micro-architecture early in the design process and provides the framework to follow through to design completion and final signoff, minimizing risk for late-stage surprises. Solaiman Rahim (all posts) Solaiman Rahim is group director for R&D in Synopsys’ Design Group. Technical Papers AAA’s Evaluation Of Active Driving Assistance Systems August 12, 2020 by Technical Paper LinkAI Roadmap: A human-centric approach to AI in aviation March 9, 2020 by Technical Paper LinkNTSB Releases Report On 2018 Silicon Valley Tesla Autopilot Fatal Accident February 14, 2020 by Technical Paper LinkSupercomputing Performance & Efficiency: An Exploration Of Recent History & Near-Term Projections January 28, 2020 by Technical Paper LinkPlasticine: A Reconfigurable Architecture For Parallel Patterns (Stanford) January 16, 2020 by Technical Paper Link Trending Articles Huawei: 5G Is About Capacity, Not Speed One-on-one with CTO Paul Scanlan. New Architectures, Much Faster Chips Massive innovation to drive orders of magnitude improvements in performance. RISC-V’s Expanding Footprint Codasip’s CTO talks about the market opportunities and technical challenges of working with the open-source ISA From Cloud To Cloudlets Why the intersection of 5G and the edge is driving a new compute model. Chiplet Reliability Challenges Ahead Determining how third-party chiplets will work in complex systems is still a problem. Knowledge Centers Entities, people and technologies explored Related Articles Aging Problems At 5nm And Below Semiconductor aging has moved from being a foundry issue to a user problem. As we get to 5nm and below, vectorless methodologies become too inaccurate. China Speeds Up Advanced Chip Development Efforts underway to develop 7nm, DRAM, 3D NAND, and EUV domestically as trade war escalates. Huawei: 5G Is About Capacity, Not Speed One-on-one with CTO Paul Scanlan. EUV’s Uncertain Future At 3nm And Below Manufacturing chips at future nodes is possible from a technology standpoint, but that’s not the only consideration. The Next Advanced Packages New approaches aim for better performance, more flexibility — and for some, lower cost. New Architectures, Much Faster Chips Massive innovation to drive orders of magnitude improvements in performance. ML Opening New Doors For FPGAs Programmability shifts some of the burden from hardware engineers to software developers for ML applications. The Good And Bad Of Chiplets IDMs leverage chiplet models, others are still working on it.