At ISSCC, IBM Research presented a test chip which represents the hardware manifestation of its years of work on low-precision AI training and inference algorithms. The 7nm chip supports 16-bit and 8-bit training, as well as 4-bit and 2-bit inference (32-bit or 16-bit training and 8-bit inference are the industry standard today).
Reducing precision can slash the amount of compute and power required for AI computation, but IBM has a few other architectural tricks up its sleeve which also help efficiency. The challenge is to reduce precision without negatively affecting the computation’s result, something IBM has been working on for a number of years at the algorithm level.
IBM’s AI Hardware Center was set up in 2019 to support the company’s target of increasing AI compute performance 2.5x per year, with an ambitious overall goal of 1000x performance efficiency (FLOPS/W) improvement by 2029. Ambitious performance and power targets are necessary since the size of AI models, and the amount of compute required to train them, is growing fast. Natural Language Processing (NLP) models in particular are now trillion-parameter behemoths, and the carbon footprint that goes along with training these beasts has not gone unnoticed.
This latest test chip from IBM Research shows the progress IBM has made so far. For 8-bit training, the 4-core chip is capable of 25.6 TFLOPS, while the inference performance is 102.4 TOPS for 4-bit integer computation (these figures are for a clock frequency of 1.6GHz and a supply voltage of 0.75V). Reducing clock frequency to 1GHz and supply voltage to 0.55V boosts power efficiency to 3.5 TFLOPS/W (FP8) or 16.5 TOPS/W (INT4).
Low precision training
This performance builds on years of algorithmic work on low-precision training and inference techniques. The chip is the first to support IBM’s special 8-bit hybrid floating point format (hybrid FP8) which was first presented at NeurIPS 2019. This new format has been developed especially to allow 8-bit training, halving the compute required for 16-bit training, without negatively affecting results (read more about number formats for AI processing here).
“What we’ve learned in our various studies over the years is that low precision training is very challenging, but you can do 8-bit training if you have the right number formats,” Kailash Gopalakrishnan, IBM Fellow and senior manager for accelerator architectures and machine learning at IBM Research told EE Times. “The understanding of the right numerical formats and putting them on the right tensors in deep learning was a critical part of it.”
Hybrid FP8 is actually a combination of two different formats. One format is used for weights and activations in the forward pass of deep learning, and another is used in the backward pass. Inference uses the forward pass only, whereas training requires both forward and backward phases.
“What we learned is that you need more fidelity, more precision, in terms of the representation of weights and activations in the forward pass of deep learning,” Gopalakrishnan said. “On the other side of things [the backward phase], the gradients have a high dynamic range, and that’s where we recognize the need to have a [bigger] exponent… this is the trade-off between how some tensors in deep learning need more accuracy, higher fidelity representation, while other tensors need a wider dynamic range. This is the genesis of the hybrid FP8 format that we presented in late 2019, which has now translated into hardware.”
IBM’s work determined that the best way to split the 8 bits between the exponent and mantissa is 1-4-3 (one sign bit, a four-bit exponent and a three-bit mantissa) for the forward phase, with an alternative 5-bit exponent version for the backward phase, which gives a dynamic range of 232. Hybrid FP8-capable hardware is designed to support both these formats.
An innovation the researchers call “hierarchical accumulation” allows accumulation to reduce in precision alongside the weights and activations. Typical FP16 training schemes accumulate in 32-bit arithmetic to preserve precision, but IBM’s 8-bit training can accumulate in FP16. Keeping accumulation in FP32 would have limited the advantages gained from moving to FP8 in the first place.
“What happens in floating point arithmetic is if you add a large set of numbers together, let’s say it’s a 10,000 length vector and you’re adding all of it together, the accuracy of the floating point representation itself starts to limit the precision of your sum,” Gopalakrishnan explained. “We concluded the best way to do that is not to do addition in a sequential way, but we tend to break up the long accumulation into groups, what we call chunks. And then we add the chunks to each other, and that minimizes the probability of having these kinds of errors.”
Low precision inference
Most AI inference uses the 8-bit integer format (INT8) today. IBM’s work has shown that 4-bit integer is the state-of-the-art in terms of how low precision can go without losing significant prediction accuracy. Following quantization (the process of converting the model to lower precision numbers), quantization-aware training is performed. This is effectively a re-training scheme which mitigates any errors resulting from quantization. This re-training can minimize accuracy loss; IBM can quantize to 4-bit integer arithmetic “easily” with only half a percent loss in accuracy, which Gopalakrishnan said is “very acceptable” for most applications.
Aside from the focus on low precision arithmetic, there are other hardware innovations which contribute to the chip’s efficiency.
One is on-chip ring communication, a network-on-chip optimized for deep learning that allows each of the cores to multi-cast data to the others. Multi-cast communication is critical to deep learning, since the cores need to share weights and communicate results to other cores. It also allows data loaded from off-chip memory to be broadcast to multiple cores. This reduces the number of times the memory needs to be read, and the amount of data sent overall, minimizing the memory bandwidth required.
“We realized we could run the cores faster than the rings, because the rings involve a lot of long wires,” said Ankur Agrawal, research staff member in machine learning and accelerator architectures at IBM Research. “We decoupled the frequency of operation of the ring from the frequency of operation of the cores… that allows us to independently optimize the performance of the ring with respect to the cores.”
Another of IBM’s innovations was to introduce a frequency scaling scheme to maximize efficiency.
“Deep learning workloads are a bit special, because even during the compilation phase, you know what phases of computation you’re going to encounter in this very large workload,” said Agrawal. “We can do some pre-configuration to figure out what the power profile is going to look like in different parts of the computation.”
Deep learning’s power profile typically has big peaks (for compute-heavy operations like convolution), and troughs (perhaps for activation functions).
IBM’s scheme sets the chip’s initial operating voltage and frequency quite aggressively, such that even for the lowest power modes, the chip is almost at the limit of its power envelope. Then, when more power is required, the operating frequency is reduced.
“The net result is a chip that operates at nearly the peak power throughout the computation, even through the different phases,” Agrawal explained. “Overall, by not having these phases of low power consumption, you are able to do everything faster. You’ve translated any dips in power consumption into performance gains by keeping your power consumption almost at the peak power consumption for all the phases of operation.”
Voltage scaling is not used because it’s harder to do on the fly; the time taken to stabilize at the new voltage is too long for deep learning computation. IBM therefore generally chooses to run the chip at the lowest possible supply voltage for that process node.
IBM’s test chip has four cores, in part to allow testing of all the different features. Gopalakrishnan described how the core size is deliberately chosen to be an optimum; an architecture of thousands of tiny cores is complex to connect together, whereas dividing the problem between large cores can also be difficult. This intermediate core was designed to meet the needs of IBM and its partners in the AI Hardware Center, finding a sweet spot in terms of size.
The architecture can be scaled up or down by changing the number of cores. Eventually, Gopalakrishnan imagines 1-2 core chips would be suitable for edge devices while 32-64 core chips could work in the data center. The fact it supports multiple formats (FP16, hybrid FP8, INT4 and INT2) also makes it versatile enough for most applications, he said.
“Different [application] domains would have different requirements for energy efficiency and precision and so on and so forth,” he said. “Our Swiss army knife of precisions, each of them individually optimized, allows us to target these cores in various domains without necessarily giving up any energy efficiency in that process.”
Along with the hardware, IBM Research has also developed a tool stack (“Deep Tools”) whose compiler enables high utilization of the chip (60-90%).
EE Times’ previous interview with IBM Research revealed that low-precision AI training and inference chips based on this architecture should hit the market in around two years.