European Research Foundation wants to promote open source LLM diversity



summary
Summary

With the publication of ten intermediary 7B model checkpoints for five European languages, the academic research collective Occiglot takes a step towards preserving European language diversity and digital sovereignty.

US tech companies dominate the market for large language models (LLM). The academic, non-profit research collective Occiglot aims to counteract this. The initiative aims to strengthen Europe’s academic and economic competitiveness and AI sovereignty.

“Occiglot strongly believes that dedicated language modeling solutions are key to maintaining Europe’s academic and economic competitiveness and AI sovereignty,” the announcement reads.

Occiglot has initially released ten intermediary 7B model checkpoints focused on the five major European languages: English, German, French, Spanish, and Italian.

Ad

Ad

The models, based on the existing Mistral 7B model, have been optimized with 700 billion additional multilingual tokens for continuous pre-training and about 1 billion tokens for instruction tuning. Details can be found in the technical report.

In addition, a multilingual model covering all five languages has been developed. All models are available on Hugging Face under the Apache 2.0 license.

Occiglot’s roadmap for the coming months is to develop a unified language modeling process that supports all 24 official languages of the European Union as well as several unofficial and regional languages. To this end, a corpus of about 1 trillion tokens of non-English pre-training data has already been collected.

The German Hessian AI innovation lab hessian.AI has pledged its support: it intends to provide a “significant amount of compute” on its AI supercomputer fortytwo.

Occiglot is looking for partners

Occiglot calls for collaboration and exchange within the academic and non-academic machine learning, AI and natural language processing community.

Recommendation

get in touch via Discord.

Occiglot’s initiators, supported by the German Research Center for Artificial Intelligence (DFKI), the hessian.AI Innovation Lab and the hessian.AISC Service Center, see their initiative as a key to preserving Europe’s language and cultural diversity.

Leave a Comment

Your email address will not be published. Required fields are marked *

Scroll to Top