DRAM Choices Becoming Central Design Considerations

Low Power-High Performance Memory footprint, speed and density scaling are compounded by low-power constraints. May 12th, 2022 – By: Ann Steffora Mutschler Chipmakers are paying much closer attention to various DRAM options as they grapple with what goes on-chip or…

Low Power-High Performance Memory footprint, speed and density scaling are compounded by low-power constraints. May 12th, 2022 – By: Ann Steffora Mutschler Chipmakers are paying much closer attention to various DRAM options as they grapple with what goes on-chip or into a package, elevating attached memory to a critical design element that can affect system performance, power, and cost.

These are increasingly important issues to sort through with a number of tradeoffs, but the general consensus is that to reach the higher levels of performance required to process more data per watt, power efficiency must be improved. Increasingly, the flavor of DRAM used, and how it is accessed, play a huge role in achieving that performance target within a given power budget. “The low power space is about how to maintain the signal integrity and get to the performance levels needed, all the while improving the power efficiency,” said Steven Woo, fellow and distinguished inventor at Rambus. “In the mobile space, there’s no end to sight to the performance needed on the next generation. There are a lot of concepts that previously were used in just one type of memory, such as multiple channels, which now are being implemented across different markets.” The main questions are centered around what concepts get borrowed for the other types of memories, and how the low power market will start influencing what goes on in main memory. Low-power engineering teams have been focusing primarily on trying to save power and stay within a certain power envelope. “Given that DRAMs are all based on the same basic cell technology, it’s about how to optimize everything else that’s around it, how to optimize for a low power environment, and how to optimize for a high performance environment,” Woo said. Rami Sethi, vice president and general manager at Renesas Electronics, agreed. “There’s a general acknowledgement that the DDR bus, in terms of its ability to support multi-slot configuration, is going to become increasingly challenged over time. As you run that wide parallel pseudo-differential, mostly single-ended-style kind of bus into the 6, 7, 8 Giga transfers per second speeds, first you’re going to lose that second slot. You’re not going to be able to support a multi-slot configuration at those speeds at some point. When that happens, you effectively cut your memory capacity in half. DRAM scaling and density increases will make up for some of that, and you can continue to add more and more channels to the memory controller. But ultimately that’s going to run out of steam, and that approach just won’t get you the incremental capacity that you need.” Stuart Clubb, technical product management director for digital design implementation at Siemens EDA, points to a similar trend. “Some years ago, NVIDIA published an energy cost comparison showing that going out to main memory (specifically DRAM) versus local CPU registers was about 200X more energy cost for the same computational effort. Other papers in the past have detailed the near 20X difference in power costs between Level 1 cache fetches and main memory. It stands to reason that anything you can do to reduce iterative main memory access is going to be of benefit.” Further, as power and energy become even more important product metrics, the use of application-specific accelerators to supplement general-purpose compute resources is increasing. “Be it processing-in-memory, computational storage, bus-based accelerators on the SoC or the PCI server side, or in-line pre-processing, the need for efficient hardware that reduces main memory-associated energy costs is growing,” Clubb said. “Specialized accelerators need to be built to specific tasks for low energy consumption. While traditional RTL power estimation and optimization tools can help in the general RTL design effort, this accelerator space is where we have seen an uptick in the use of high-level synthesis. Exploring the design space, experimenting with varying architectures, and ultimately building those accelerators with competitive low-power RTL is where HLS is adding value. Custom accelerators with localized lower power memory solutions have the advantage that when not in use, you can turn them off completely, which you won’t be doing with a CPU/GPU/NPU-type solution. No matter how much you try to optimize your main memory architecture, the energy cost of data movement is probably something you really want to avoid as much as possible.” Muddy water Inside of data centers, this requires some tradeoffs. There are costs associated with moving data, but there also are costs associated with powering and cooling servers. By allowing multiple servers to access memory as needed can improve overall memory utilization. This is one of the main drivers for Compute Express Link, a CPU-to-memory interconnect specification. “CXL basically allows you a serial connection,” said Renesas’ Sethi. “You’re reducing the number of pins that you need. You can put modules that are based on CXL further away from the CPU, and there’s better extensibility than you get with a direct DDR attach. Those modules can look more like storage devices, SSDs, or PCIe add-in cards, so you can get more density in the form factor. Then, CXL also gives a lot of the protocol hooks for things like cache coherency, load store, memory access, so it starts to allow DRAM to look more like a direct attached DRAM, or at least like a non-uniform memory access (NUMA)-style DRAM access.” That, in turn, requires more consideration about how to architect memory, and which type of memory to use. “When people are choosing between memories, such as DDR and LPDDR, realistically there are some things that are going to be DDR for a really long time. If you’re building a big server, you’re probably going to build a memory out of DDR. If you’re building a mobile phone, you’re probably going to build a memory out of LPDDR. There are very clear cut things on both sides,” noted Marc Greenberg, group director, product marketing, DDR, HBM, Flash/Storage, and MIPI IP at Cadence. “However, the middle is less clear. For the past five years, it’s been becoming increasingly muddy where, for example, LPDDR is being used in places that traditionally may have been exclusively the purview of the DDR memories.” Each has its strengths and weaknesses, but there is enough overlap to confuse things. “One of the strengths of DDR memory is the ability to add capacity in a removable way, such that if you want to add more gigabytes of storage or have many gigabytes of storage, DDR is the way to do it in most cases. What the LPDDR memories offer is a certain range of memory densities, of capacity that match well to the mobile devices they go into. But those capacities in some cases match certain types of computing functions, as well. One of the areas where we’ve seen LPDDR memory start to make its way into the server room is in various kinds of machine learning/artificial intelligence accelerators.” There are a number of attributes that every memory type has, from how much capacity can be put on the interface to how easy it is to add capacity and how much bandwidth is supports. There also are power and enterprise reliability standards to consider. Greenberg noted that in some instances, the LPDDRs match the requirements for certain types of systems better than the DDR memories, even if it’s a traditionally DDR type of application. Randy White, memory solutions program manager at Keysight Technologies, believes there are more or less two choices — DDR or LPDDR, with HBM and GDDR being used in more specialized designs. “Unless you are specializing in a niche application, the choice really comes down to two. DDR is probably 60% to 70% of the volume of memory out there between data centers and desktops. LPDDR is 30% or more, and that’s growing because it tracks the number of new products that are introduced from phones to other mobile devices. At the same time, LPDDR tends to be ahead of mainstream DDR by one year or more. That’s been the same for years and years, so why is LPDDR always pushing the envelope? They always get the specs released earlier, as well. It’s because the phone is the target device, and that’s where there’s so much money tied up. You don’t want the memory to be to be the bottleneck.” Choosing the right memory White says this decision comes down to one of two things — capacity and applications. “Do you need more than 64 gigabytes? A phone or a mobile application would use anywhere from 16 to 32 gigabytes for system memory. This is different than storage for all of the videos and files. You pay a lot of money for that, and your phone provider sells you options for that. But the system memory is fixed. You don’t know about it, it just works. For phones, you don’t really need that much. A server that’s running thousands of virtual machines, doing financial transactions, engineering, database queries, or Netflix streaming, needs terabytes of memory, an order or more of magnitude. That’s the number one selection criteria. How much do you want?” The second consideration is where it’s going. “What’s the form factor that you’re going into? Servers need a lot of memory so they have many DIMM slots. But how are you going to get 64-gigabyte, multiple memory chips into a phone? The phone fits in the palm of your hand. Don’t forget about the display, the battery, the process, so the space constraints are different,” White said. An additional consideration involves the evolution of mobile design. “You need more compute power,” he said. “You need more memory, but space is shrinking. How do you deal with that? This is a really fascinating trend. There’s much more integration between the processor and the memory itself. If you look at an old phone from 5 or 10 years ago, the processor was on one part of the board, then the signal was routed out to the board — maybe an inch or two — to the discrete memory component, and the signals went back and forth. Now we see the trend of die stacking, and package-on-package. Now you can stack up to 16, and that’s how you get 32 gigabytes or more, because these memory chips are no more than 2 or 4 gigabytes. The integration is getting so high, for space but also speeds, that you get better signal integrity if you’re not transmitting so far down the circuit board.” At the same time, this doesn’t mean system architects are finding it easier to make the tradeoffs. In fact, it is not uncommon for engineering teams to change their minds multiple times between DDR, LPDDR, GDDR6, and even HBM, Cadence’s Greenberg said. “People will go back and forth between those decisions, try it out, weigh up all their options, see how it looks, and then sometimes change types after they’ve evaluated for a while,” he said. “They’re typically doing system-level modeling. They’ll have a model of their view and their neural network, along with a model of the memory interface and the memory itself. They’ll run traffic across it, see how it looks, and get a performance assessment. Then they look at how much is each one going to cost because, for example, the HBM memory stands out as having extremely high bandwidth at very reasonable energy per bit. But there are also a lot of costs associated with using an HBM memory, so an engineering team may start out with HBM, run simulations, and when all their simulations look good they’ll budget and realize how much they would be paying for a chip that has HBM attached to it. And then they’ll start looking at other technologies. HBM does offer excellent performance for a price. Do you want to pay that price or not? There are some applications that need HBM, and those devices will end up at a price point where they can justify that memory use. But there are a lot of other devices that don’t need quite as much performance as that, and they can come down into GDDR6, LPDDR5, and DDR5 in some cases.” Fig. 1: Simulation of GDDR6 16G data eye with channel effects. Source: Cadence Additionally, when the focus is on low power, it is assumed that LPDDR can’t be beat. That isn’t correct. “The real low power memory is HBM,” said Graham Allan, senior manager for product marketing at Synopsys. HBM is the ultimate point-to-point because it’s in the same package, on some form of interposer. There’s a very short route between the physical interface on the SoC and the physical interface on the DRAM. It is always point-to-point, short-route, and unterminated so I’m not burning any termination power. If you look at the power efficiency, which is the energy that it takes to transfer one bit of information — often expressed in picojoules per bit or gigabytes per watt — the power efficiency for HBM is the best of any DRAM. So HBM is really the ultimate low power DRAM. Another avenue for power reduction, which memories historically have not taken advantage of, is to separate the power supply for the core of the DRAM from the I/O. “DRAM always likes to have one power supply,” Allan said. “Everything in the whole DRAM chip is running on the same power supply, and LPDDR4 operated off a 1.2 volt supply. The data eyes were 350 to 400 millivolts tall, using a 1.5 volt supply. Somebody very smart said, ‘Why are we doing that? Why don’t we have a lower voltage supply to get the same height of the data eye? Sure, there’s going to be a little bit of compromise on the rise and fall times, because we’re not driving these transistors with the same drive strength. But it’s point-to-point, so it should be okay.’ And that’s what became LPDDR4x. The major difference between LPDDR4 and LPDDR4x was taking the power supply from the DRAM and chopping it — one power supply for the I/O, one power supply for the DRAM.” Understandably, everyone who could go to LPDDR4x would have, because the DRAM vendors basically have one die that can support either operating voltage. “For the host, unfortunately, if you were designed for LPDDR4, you’re going to be driving LPDDR4 signals out to the DRAM,” Allan said. “And if you’re an LPDDR4x DRAM, you’re going to say, ‘Those signals are a little bit too strong for me, and the voltage is too high. I can’t guarantee my long-term reliability isn’t impacted by what you’re giving me. So technically, you’re violating the overshoot/undershoot specs of the DRAM.’ You had to go through this process where there was a transition. Our customers were asking for help with DDR4 to LPDDR4x. At the end of the day, it’s not a huge power savings. It’s maybe in the range of 15% for the overall subsystem power savings. And that’s because the core of the DRAM takes a lot of power, you’re not changing that power supply chain, so you’re not changing how those work. You’re only changing the voltage for the I/O to transfer data across the bus. You’re doing that in the PHY on the SoC and you’re doing that in the PHY on the DRAM. Now interestingly enough, as we’ve gone from HBM2 and HBM2e to HBM3, we’ve gone from a common 1.2 voltage for HBM2e, to a 0.4 volt operating supply for the I/Os on HBM3. So we’ve reduced it by a third. That’s a big power savings, especially when there are 1,024 of these going up and down.” Reliability concerns Set against all of the considerations above is an increasing concern for system reliability, according to Rambus’ Woo. “Reliability is becoming a more prominent first-class design parameter. In smaller process geometries, things get more complicated, things interfere, and device reliability is a little bit harder. We’re seeing things like the refresh time/refresh interval is dropping because the capacitive cells are getting smaller. Those are all reflections of how reliability is becoming more important. The question is, with integrity being more challenging, how does the architecture change for these DRAM devices? Is it more on-die ECC or things like that being used? That’s all happening because it’s now a more prominent problem.” So what comes next and how does the industry move forward? “We tend to see when there are issues at a component level that are really challenging to solve, either because technologically it’s hard or it’s really expensive,” said Woo. “We tend to see those things pulled up to the system level, and people trying to find ways at the system level to solve these things.” Related CXL And OMI: Competing Or Complementary? It depends on whom you ask, but there are advantages to both. What’s Changing In DRAM The impact of shrinking features on memory. Power Now First-Order Concern In More Markets No longer a separate set of requirements, designers are prioritizing both power and performance in markets where performance has been the main goal. Ann Steffora Mutschler (all posts) Ann Steffora Mutschler is executive editor at Semiconductor Engineering. Knowledge Centers Blogs Compute Express Link (CXL) Published on September 20, 2019 Low-Power Design Published on July 25, 2017 Low Power Published on Memory Published on April 4, 2017 High-Bandwidth Memory (HBM) Published on DRAM: Dynamic Random Access Memory Published on February 6, 2017 Technical Papers FICS Research Institute: Detailed Assessment of the PQC Candidates To Power Side Channel Attacks May 11, 2022 by Technical Paper LinkCircuit knitting Based On Quasiprobability Simulation May 10, 2022 by Technical Paper LinkMeasuring Frequency Dependence Across 5G mmWave Bands (NIST) May 10, 2022 by Technical Paper LinkSOT-MRAM-based CIM architecture for a CNN model May 9, 2022 by Technical Paper LinkResearch Platform for Heterogeneous Computing (ETH Zurich) May 9, 2022 by Technical Paper Link Trending Articles More Options, Less Dark Silicon Disaggregation and the wind-down of Moore’s Law have changed everything. Paving The Way To Chiplets Different interconnect standards and packaging options being readied for mass chiplet adoption. Chip Industry Heads Toward $1T Continued expansion in new and existing markets points to massive and sustained growth. Photomask Shortages Grow At Mature Nodes Aging equipment and rising demand are pushing up prices and slowing production. CEO Outlook: Chip Industry 2022 Experts at the Table: Designing for context, and geopolitical impacts on a global supply chain. Knowledge Centers Entities, people and technologies explored Related Articles A Minimal RISC-V Is there room for an even smaller version of a RISC-V processor that could replace 8-bit microcontrollers? Is Programmable Overhead Worth The Cost? How much do we pay for a system to be programmable? It depends upon who you ask. More Options, Less Dark Silicon Disaggregation and the wind-down of Moore’s Law have changed everything. SOT-MRAM To Challenge SRAM Spin-orbit torque memory adds endurance and faster write speeds, but displacing existing memories is still not easy. CXL and OMI: Competing or Complementary? It depends on whom you ask, but there are advantages to both. Improving PPA In Complex Designs With AI Research shows significant improvement in time to market and optimization of key metrics. Why Data Center Power Will Never Come Down Efficiency is improving significantly, but the amount of data is growing faster. Why Comparing Processors Is So Difficult Some designs focus on power, while others focus on sustainable performance, cost, or flexibility. But choosing the best option for an application based on benchmarks is becoming more difficult.