Shortly after the release of Meta’s Code Llama code model, the open-source community tries to fine-tune it – and immediately achieves a new top score, surpassing OpenAI’s GPT-4.
Phind, an AI co-programming startup, has announced that it has achieved a new high score on the HumanEval benchmark, an important evaluation test for AI programming tasks, with a fine-tuned 34B variant of Meta’s just-released Code Llama.
In the first run, the fine-tuned standard and Python models scored 67.6 and 69.5 percent, respectively. OpenAI’s GPT-4 scored 67 percent on the same benchmark when it was released in March. The standard Code Lama model with 34 billion parameters scored 48.8 percent, according to Meta, while the Python variant scored 53.7 percent.
|Phind 34B standard model||67.6%|
|Phind 34B Python model||69.5%|
|GPT-4 (OpenAI model)||67%|
|Meta Code-Llama 34B||48.8%|
|Meta Code-Llama 34B Python||53.7%|
|Meta Unnatural Code Llama (not released)||62%|
The two Phind models were fine-tuned natively on a custom dataset of about 80,000 high-quality programming tasks and solutions. According to Phind, Meta already fine-tuned Code Llama with a 62 percent success rate on HumanEval. However, Meta only used 15,000 examples to refine Unnatural Code Llama.
The Phind models were trained using 32 A100-80 GB GPUs and a sequence length of 4096 tokens in three hours. The researchers used DeepSpeed ZeRO 3 and Flash Attention 2 for faster and more efficient training.
Open-Source Community accelerates Meta’s AI development
The Llama license allows both scientific and commercial use, but the latter is restricted as a special license is required for use in widespread applications. In addition, data generated with Llama 2 may not be used to train new AI models.
Meta’s Llama 2 language model also now has numerous refinements that outperform Meta’s original release in benchmarks. This is likely Meta’s goal: to improve their models faster thanks to the open-source community.