MultiSynt: Multilingual synthetic pre-training dataset creation and performance evaluation

500,000 Awarded Resources (in node hours)

MareNostrum5 ACC System Partition

December 2025 - 12 months Allocation Period

The development of multilingual foundation LLMs with strong generalisation and reasoning capabilities requires diverse, high-quality pre-training data across languages. While English-language resources are abundant, most European languages lack sufficient open pre-training data in both quantity and quality. Current collection efforts cannot fully address this scarcity, limiting representation of many languages in multilingual models. Even well-resourced languages face gaps in diversity and quality of available datasets, hampering the development of effective cross-lingual models. Without addressing these dataset composition deficiencies, we risk producing underperforming models that lack the capabilities needed for effective downstream applications.

In an initial 6-month EuroHPC AI Factory allocation (AIF-2025LS01-028), we have already operationalised this vision at scale: we successfully generated trillions of multilingual tokens for LLM pre-training and validated their efficacy by training smaller LLMs on different data mixtures and evaluating them on multilingual benchmarks. Doing so, we now have several folds more data for many European lower-resource languages. Training on this synthetic data results results in large performance gains on multilingual benchmarks and can reduce compute requirements to reach baseline performance by 75%. In parallel, we developed a multilingual data annotator model that can estimate document-level data quality and other salient properties (e.g. domain, safety, and language-specific characteristics) for large pre-training corpora. These results demonstrate both the feasibility and impact of our approach and provide a strong empirical foundation for a continuation of this project.

This focused project directly supports the broader EuroLLM and OpenEuroLLM initiatives by addressing a critical bottleneck – the availability of high-quality pre-training data – distinct from the large-scale model training requested in our parallel Extreme Scale proposals. In this next phase, we will use the new allocation to run our data annotator model at scale over large multilingual corpora, refine our filtering and selection strategies based on its signals, and drive larger-scale synthetic data generation informed by these quality estimates. This approach uses generative models to enhance existing content, targeting improvements in language representation, domain coverage, and content diversity across EU languages and beyond.

Building on the methodology established by Nemotron-CC [1] for English, to which we propose innovative components to overcome some of their weaknesses, our 4-phase approach, for which the use of continued computing access will be crucial, includes: 1) large-scale quality estimation of available multilingual pre-training data, including state-of-the-art quality estimation models such as our newly developed annotator; 2) experimentation with multilingual synthetic data creation; 3) evaluation of the efficacy of different methods for various languages, including running end-to-end ablation studies by training smaller LLMs; 4) production of larger-scale generation of synthetic data for 40 languages: the 24 official EU languages, 9 candidate-member languages, 3 co-official in member states and some others considered of strategic and economic interest (i.e. Norwegian, Icelandic, etc.).

We will produce high-quality, synthetic, multilingual datasets using already strong generative models prompted to produce further texts in the languages, text types, quantity and quality needed to pre-train strong open and multilingual LLMs for 40 languages. We will assess our dataset through ablation studies, training models of various sizes and evaluating their performance against multilingual benchmarks to provide quantitative evidence of effectiveness. By making these datasets openly available, we aim to improve access to quality pre-training resources for all European languages and to provide a tested, reusable pipeline for future multilingual data creation efforts.

Our efforts and collaboration from the two major LLM initiatives for open and transparent AI in Europe, EuroLLM and OpenEuroLLM, led by strong companies and research groups from different corners of Europe, is composed of experienced engineers and scientists with expertise in foundation model training, large-scale training datasets, and leveraging high-performance computing infrastructure.

Principal Investigator, Company and Country