Nvidia maintains the lead over Intel in MLPerf 3.1 benchmark and announced a new supercomputer.
The results of the latest version of the MLPerf training benchmark, released today, show that Nvidia’s H100 GPU continues to lead the way in terms of performance and versatility. However, Intel’s Gaudi 2 AI chip shows a significant performance leap compared to the last round, overtaking the A100 and coming much closer to the H100 when training large language models, for example. Analysts expect Gaudi 3 as early as 2024, when Intel’s AI accelerator could finally catch up with Nvidia’s, at least in some areas.
However, Nvidia also shows in the benchmark that it can use its expertise to build enormously powerful systems that scale efficiently: In the benchmark, Nvidia showed for the first time results from the new Eos AI supercomputer, which is equipped with 10,752 H100 GPUs and Nvidia’s Quantum-2 InfiniBand network.
Eos was able to train a GPT-3 model with 175 billion parameters and 1 billion tokens in just 3.9 minutes. This almost tripled the previous record of 10.9 minutes set by Nvidia less than six months ago using just under 3,500 H100 GPUs. Above all, the test shows that Nvidia’s technology scales almost loss-free: Tripling the number of GPUs resulted in a 2.8x performance scaling, which corresponds to an efficiency of 93 %. This is a significant increase in efficiency over last year and is due in part to software optimizations.
In addition to Nvidia, Microsoft also submitted results using Azure HD H100 v5 for a system with 10,752 H100 GPUs, and took just under 4 minutes to complete GPT-3 training.
Nvidia and Microsoft could train GPT-3.5 in 8 days
For a complete training run of a modern GPT-3 model with 175 billion parameters and the optimal amount of data of 3.7 trillion tokens according to Chinchilla’s results, Nvidia’s Eos would only need eight days, according to the company’s projections – and thus generate a model more similar to GPT-3.5, the original model behind ChatGPT.
While it’s unclear how much data OpenAI used to train GPT-3.5, we do know that GPT-3 was trained by OpenAI with only 300-500 billion tokens, and GPT-4 is rumored to have been trained with nearly 13 trillion tokens. The original GPT-3.5 is probably somewhere in between, though the company seems to be going for a smaller model with GPT-3.5-turbo.
For the first time, Stable Diffusion training was included in the MLPerf benchmark: With 1,024 Nvidia H100 GPUs it took 2.5 minutes, with 64 H100s 10 minutes – training the diffusion model does not scale as efficiently as training large language models. Intel’s Gaudi 2 took just under 20 minutes with 64 accelerators.
Organizations supporting the MLPerf benchmarks include Amazon, Arm, Baidu, Google, Harvard, HPE, Intel, Lenovo, Meta, Microsoft, Nvidia, Stanford University, and the University of Toronto. The tests are designed to be transparent and objective so that users can rely on the results to make informed purchasing decisions.