At the TinyML Summit, early-stage analog AI accelerator startup Areanna presented the first public reveal of its architecture, disclosing some of the features of its 40 TOPS/W SRAM array-based design. The unusual design integrates analog-to-digital and digital-to-analog conversion within the memory array. Since ADCs and DACs typically take up the vast majority of silicon area and power budget for compute-in-memory designs, integrating this functionality within the memory array could be a game changer for analog compute technology.
Areanna is led by former Tektronix analog design engineer Behdad Youssefi alongside another ex-Tek colleague, Patrick Satarzadeh. They remain the company’s only two full-time employees, alongside a couple of part time engineers and several advisors. The company has achieved a test chip with one computing tile based on its architecture up and running.
Compute and quantize
Areanna calls its architecture compute and quantize in memory (CQIM). The concept is based on analog compute-in-memory techniques, the same basic concept employed by several other AI chip startups (Mythic, Gyrfalcon and others). However, Areanna uses an SRAM array rather than non-volatile memory, mixed with a good dose of secret sauce.
Areanna’s IP is in the design of its SRAM array which incorporates ADC and DAC functionality inside the array. Other compute-in-memory designs use a DAC on each row/input and an ADC on each column/output. These ADCs and DACs take up a significant portion of the chip’s power budget and silicon area (by Areanna’s figures, up to 85% of power consumption and 98% of silicon area). In his TinyML presentation, Youssefi described analog compute methods “replacing the Von Neumann architecture’s memory bottleneck with a data conversion bottleneck.”
In Areanna’s CQIM architecture, the A-D and D-A conversion is performed by the same circuit structures as computation – Areanna calls these multiplying bit-cells (MBCs).
While Areanna’s premise is based on analog computing, the circuitry is almost entirely digital, and is fabricated in digital process technologies. Computation is carried out by reading weight parameters from SRAM bit cells, then multiplying them by input activations, then converting to charge by unit capacitors and accumulating on vertical accumulation lines. Having the same MBC structures do both A-D and D-A conversion saves vast amounts of silicon area and the lack of ADC sampling circuit saves power.
“There is an SRAM bit cell, then there’s a multiplier, some logic, and the output of the logic block is a digital signal,” Youssefi said, in a separate interview with EE Times. “A [metal] capacitor turns this signal into charge, which is shared on the vertical accumulation line. There is very little analog circuitry in order to carry out this so-called analog computation.”
An important feature of this design is that only one quantization (one A-D conversion) is required per dot product computation, regardless of the resolution of computation.
“The way we generate and accumulate MAC results and quantize it back to digital allows us to do only one quantization,” Youssefi said. “This is because of the way we do scaling in the analog domain, prior to quantization. In other compute-in-memory architectures, that scaling happens in the digital domain, so when you’ve already done your A-D conversion, you do the scaling. We do it in the analog domain with high integrity.”
Other compute-in-memory architectures, said Youssefi, might resolve between one and four bits of each computation per vertical accumulation line. A typical architecture might take a two-bit digital input and produce a four-bit digital output (lower precision DACs and ADCs are usually used in order to save silicon area). So multiplying eight-bit weights and input activations might require the computation to be broken down into multiple pieces. Areanna’s design offers fully programmable resolution without compromising the hardware utilization rate.
“We do not compromise hardware utilization rate by going from eight bits to four bits to one bit, it’s still one hundred percent hardware utilization regardless of the resolution,” he said. “[For other compute-in-memory schemes] if you want to offer variable resolution, then you have to significantly lower your hardware utilization rate.”
Data flow optimization
The advantages of using SRAM compared to non-volatile memory include SRAM’s low read/write energy; this enables weights to be brought in from off-chip without a high energy penalty. SRAM’s low write energy also enables flexibility in data flow optimizations, Youssefi explained.
Various data flow optimizations are in use in the industry today – they differ by which data types are kept stationary and which move around the chip. For example, for a large neural network layer that has a lot of weights, it might be efficient to keep the weights stationary. For a network processing high resolution images, input activation data is the most data intensive data type, so it may make more sense to keep input activations stationary. Areanna’s SRAM-based fabric allows double stationary data flow optimization, that is, two data types can be made stationary without additional hardware.
“Because our computation is done in parallel, in the analog domain, we don’t really need to move data around,” Youssefi said. “The weights, or whatever the user chooses, can be made stationary and the partial sum [output] is always stationary by virtue of the architecture. So there’s no movement on those two data types.”
The user can choose to make both input activations and partial sums stationary, or both weights and partial sums stationary, depending on what is most efficient for the application (or for, say, specific layers in a neural network).
Another problem with many current compute-in-memory architectures is their scalability is limited, according to Youssefi.
“Logic technology, which is optimized for power performance, is used to construct these data converters,” Youssefi said in his TinyML presentation. “Then there’s a memory technology, which is optimized for density, and it’s used to fabricate the memory array. And when you put these two technologies together on the same die, you end up with the worst of both worlds.”
Since it is built almost entirely on digital blocks, Areanna’s design can be fabricated in standard CMOS processes and can track with Moore’s Law to smaller process nodes. There is also no need to worry about analog non-idealities that plague other compute-in-memory designs – the metal capacitors Areanna uses have very high accuracy matching precision, and everything else is digital.
Areanna, founded in 2019, has seed funding from the US National Science Foundation in the form of a Small Business Innovation Research (SBIR) grant, totaling $225,000. The company has two patents on its architecture. In 2020, the startup taped out and fabricated a working test chip which is capable of partial matrix multiplication, proving the functionality of the architecture. The chip’s baseline power efficiency is 40 TOPS/W and its computational density is 2 TOPS/mm2 silicon area (both figures for 8-bit compute). Its memory bandwidth is 2 TB/s per core.
The next step, Youssefi said, is for Areanna to build bigger test chips with multiple computing tiles. A second, more advanced test chip will follow in 2022.