DataComp for Vision-Language Models (DCVLM)

82500 Awarded Resources (in node hours)

Leonardo BOOSTER System Partition

December 2025 - June 2026 Allocation Period

Recent advances in vision-language models (VLMs) have demonstrated that their performance scales predictably with compute, architecture, and, most critically, the quantity and quality of training data. However, while architectural and optimization choices have been extensively studied, systematic understanding of how data composition and filtering influence model scalability and generalization remains limited. In particular, the data pipelines underlying state-of-the-art VLMs are often described in limited details, hindering open progress and reproducible outcomes. This project aims to establish an open framework and benchmark for studying VLM dataset quality. We will curate a large, diverse multimodal corpus (~6T tokens), obtained from open-source datasets and prior work, standardize the training infrastructure and evaluation suite, and promote exploration of dataset design choices under controlled compute budgets. This project's experiments will span model scales from 1B to 4B parameters, enabling robust analysis of how data filtering strategies, such as quality-based selection, modality balancing, and the inclusion of different data types, affect VLM pre-training and downstream transfer. Through extensive ablations, the project seeks to answer key questions:

How effective are the common multimodal data filters?
How does their utility evolve with model scale?
What are optimal strategies for mixing heterogeneous data sources? and
Do data curation benefits persist through later training stages such as instruction-tuning?

The project will yield empirically grounded guidelines for scalable VLM data curation, and release training code, evaluation suite, filtering tools, and annotated data pools. This will in turn foster open, reproducible research in data-centric multimodal AI.

Principal Investigator, Company and Country