The ever-increasing demands of artificial intelligence (AI) mean concepts for memory that have been around for decades are getting renewed attention, and Samsung Electronics Co., Ltd.’s recent announcement of a high bandwidth memory (HBM) integrated with AI processing power is good example.
First presented at the International Solid-State Circuits Virtual Conference (ISSCC) earlier this year, Samsung’s new processing-in-memory (PIM) architecture places AI computing capabilities inside an HBM. This is a different approach from the commonly used von Neumann architecture for computing systems, which uses separate processor and memory units to carry out millions of intricate data processing tasks. Because the von Neumann architecture employs sequential processing that requires data to constantly move back and forth, it results in a system-slowing bottleneck especially when handling ever-increasing volumes of data found in large-scale processing in data centers, high performance computing (HPC) systems, and AI-enabled mobile applications.
An HBM-PIM takes a different route, said Nam Sung Kim, senior vice president of Samsung’s memory business unit, by bringing processing power directly to where the data is stored; it places a DRAM-optimized AI engine inside each memory bank. This storage sub-unit enables parallel processing and minimizes data movement. He said the new architecture can deliver more than twice the system performance while reducing energy consumption by more than 70% when applied to Samsung’s existing HBM2 Aquabolt solution.
Kim said several trends are driving renewed attention toward integrated memory with processing power. One of the critical challenges is that the power efficiency of memory has decreased as bandwidth has increased. In the meantime, he said, diverse machine learning accelerators are being developed for use in data centers and they demand higher bandwidth and the larger capacity memory in every generation. Overall, the rise of machine learning has created a demand for a new architecture because power consumption trends are not sustainable with more HBM stacks for higher bandwidth and larger capacity.
However, any new solution needed to address key requirements, said Kim, as well as cost concerns. Processor design companies don’t have the time and resources to change the memory subsystems of their processors for unproven technologies. DRAM makers, meanwhile, are reluctant to change their core design for PIM because it’s been optimized over the span of decades and changing it will be expensive, he said. Similarly, customers don’t want to change their application code just to accommodate PIM — they want homogeneous systems. “The processing memory should be able to efficiently serve as a normal memory.”
Kim said Samsung decided to start from the DRAM maker’s point of view. “How do we deliver high on-chip compute bandwidth without changing the core design?” The answer was to place a single instruction, multiple data (SIMD) floating point unit (FPU) at the bank I/O boundary and exploit bank-level parallelism by accessing multiple FPUs in a lockstep manner, he said. “Instead of accessing one bank at a time, we can concurrently activate multiple banks and let them do the computation.”
To address customer concerns, Samsung built the processing and memory architecture to existing industry standards. “We can make this a processing memory technology as a drop-in replacement for a commodity DRAM,” said Kim. In addition, a software stack was developed to allow existing application source code to run without any changes, including device drivers.
Even still, he said Samsung expects there is more collaboration to be done with industry partners, including the standardization of PIM through JEDEC, building a better supporting software ecosystem, and expanding application of PIM to other DRAM types for other computing platforms, such as low power DDR (LPDDR) for mobile and edge devices.
Jim Handy, principal analyst with Objective Analysis, said Samsung’s HBM-PIM is still a science project “straight out of the research labs,” and it’s unclear if there’s a business to made out of it. It’s also not the first company to attempt something similar, including Micron, which announced its Automata processor back in 2013, touting it as a fundamentally different new processor architecture that sped up the search and analysis of complex and unstructured data streams. Automata’s computing fabric was made up of tens of thousands to millions of processing elements that were interconnected. Its design was based on an adaptation of memory array architecture, exploiting the inherent bit-parallelism of traditional SDRAM.
Micron has stopped development of Automata, although some academic research centers are continuing to work on the technology, such as the Center for Automata Processing at the University of Virginia, which was co-founded by Micron, as well as its spinoff Natural Intelligence Semiconductor. There are other companies looking at similar ways of putting a processing element inside a DRAM, said Handy, but what sets Samsung apart is that it’s using an HBM stack with logic on the bottom. HBM is expensive, and that creates challenges for both the manufacturer and the user.
“From the user’s standpoint they can get a whole lot more performance as long as they’re willing to completely restructure their software.” He said that’s a barrier because customers would prefer to stick with off-the-shelf software that been tested and proven rather than restructure or write new code to gain the advantages of the memory. “That limits the size of the market.” In addition, they are generally pitched by vendors as memories to customers, so the buyers expect them to cost about the same thing as memory costs. But like anything that has niche appeal and are built in small volumes, they’re more expensive to produce, said Handy. “It’s really hard for a DRAM manufacturer to make these things for prices anywhere near the costs of a DRAM.”
Even if the Samsung HBM-PIM technology isn’t ever fully commercialized, there will probably be a time when certain elements of this approach are used in systems, Handy said, but it’s clear what it might be. One example may be to enhance security—DRAM pages are erased whenever a page swap is done by the operating system, and while it’s usually done by the processor, it could be easily done inside of the DRAM. “That’s the kind of thing that opens the door to something far more elaborate.”
Gary Hilson is a general contributing editor with a focus on memory and flash technologies for EE Times.