At the Open Innovation Dialogue Hub in Strasbourg, Shaoxiong Ji, PI of the MaLA-LM, presented FineOPUS, a massively multilingual translation corpus and pipeline. The event brought together researchers, policymakers, and industry partners to discuss open innovation and multilingual language technology for underrepresented languages.
The LINGUA Community Day was held on June 15, 2026, at the Council of Europe in Strasbourg. Grantees of the Microsoft LINGUA program presented projects aiming to extend state-of-the-art AI models to a broader set of languages. Modern AI performance remains inextricably linked to data quality and scale, yet web-scraped data is often noisy, mislabeled, or semantically misaligned, resulting in a "digital language divide" where translation systems fail for thousands of low-resource languages. To address this bottleneck and prioritize high-fidelity datasets for underrepresented languages, FineOPUS systematically curates the open-source OPUS repository, refining existing parallel resources into a highly clean, foundational multilingual corpus. Conceived as a "digital refinery" for the world's largest open parallel repository, the project adapts the "FineWeb" philosophy to the multilingual parallel domain, shifting the focus from quantity to high-fidelity quality to empower researchers in building more inclusive AI technologies. The recent release MaLA-LM/FineOPUS-Filtered-Stage4 showcases its massive scale of 1.65 trillion tokens (tokenized with DeepSeek V4) and 9,532 language pairs curated from 83.9 billion raw parallel lines down to 36.1 billion high-quality lines.
On the evening of June 15, researchers participated in an MEP Dinner at the European Parliament, hosted by MEP LorĂ¡nt Vincze. During the session, Shaoxiong Ji pitched the FineOPUS dataset and highlighted the importance of inclusive, open-source language models.
FineOPUS: Curation and Quality
This collaborative effort of the MaLA-LM team includes researchers from ELLIS Institute Finland, TurkuNLP, and Helsinki-NLP. The project operates as a "digital refinery" for the world's largest open parallel repository, focusing on curating high-fidelity parallel datasets for underrepresented languages.
While the open-source OPUS repository remains a widely used parallel dataset for machine translation, web-scraped data often contains noise, semantic misalignments, and language contamination. FineOPUS's systematic curation improves the reliability of parallel data. By applying rigorous data filtering, cleaning, and augmentation techniques, the project refines existing parallel resources into a highly clean, foundational multilingual corpus.
The project is still under actively development and implementation. Upon completion, the team is committed to open science and will release three key assets to the community:
- The FineOPUS Dataset: A high-quality, curated parallel corpus ready for training translation and language models.
- The FineOPUS Pipeline: A documented, reproducible open-source data processing code.
- The Technical Report: A detailed report explaining the data processing decisions and model evaluations.
Through these releases, the project aims to support developers and researchers in building more capable, diverse, and robust language technologies.
Sponsorship & Support
To expand the coverage and accelerate the development of the FineOPUS dataset and open curation pipeline, the MaLA-LM team welcomes sponsorships, computational resources, and research support. Organizations and industry partners interested in supporting this open science initiative are invited to collaborate with the team to help bridge the resource gaps for underrepresented languages.
Learn more about the project at the official FineOPUS website.