GPT-4 “crushes” other LLMs according to new benchmark suite


Benchmarks are a key driver of progress in AI. But they also have many shortcomings. The new GPT-Fathom benchmark suite aims to reduce some of these pitfalls.

Benchmarks allow AI developers to measure the performance of their models on a variety of tasks. In the case of language models, for example, answering knowledge questions or solving logic tasks. Depending on its performance, the model receives a score that can then be compared with the results of other models.

These benchmarking results form the basis for further research decisions and, ultimately, investments. They also provide information about the strengths and weaknesses of individual methods.

Although many LLM benchmarks and rankings exist, they often lack consistent parameters and specifications, such as prompting methods, or don’t adequately account for prompt sensitivity.



This lack of consistency makes it difficult to compare or reproduce results across studies.

GPT-Fathom aims to bring structure to LLM benchmarking

Enter GPT-Fathom, an open-source evaluation kit for LLMs that addresses the above challenges. It was developed by researchers at ByteDance and the University of Illinois at Urbana-Champaign based on the existing OpenAI LLM benchmarking framework Evals.

In the paper, the researchers also outline the evolution of OpenAI’s GPT models from GPT-3 to GPT-4. | Image: Zhang et al.

GPT-Fathom aims to address key issues in LLM evaluation, including inconsistent settings – such as the number of examples (“shots”) in the prompt, incomplete collections of models and benchmarks, and insufficient consideration of the sensitivity of models to different prompting methods.

The team used its proprietary system to compare more than ten leading LLMs on more than 20 carefully curated benchmarks in seven skill categories, such as knowledge, logic, or programming, under consistent settings.

GPT-4 is clearly ahead

If you are a regular user of different LLMs, then the main result will not come as a surprise: GPT-4, the model behind the paid version of ChatGPT, “crushes” the competition in most benchmarks, the research team writes. GPT-4 also emerged as the winning model in a recently published benchmark on hallucinations.


Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top