This project proposes a new vision foundation model that rivals—and often surpasses—the performance of leading proprietary models such as DINOv2, CLIP, and SigLIPv2. The project is built on a fully transparent training pipeline inspired by Web-SSL, using only publicly available datasets such as ImageNet-21K and a subset of ReLAION-2B.
In this EuroHPC project, the researchers will significantly scale up and extend this vision foundation model along three critical axes: (1) high-resolution finetuning to further enhance performance on dense prediction and localization tasks, (2) distillation of large-scale models into smaller, efficient variants to enable deployment in resource-constrained environments, and (3) training a 7B parameter vision foundation model to serve as a highly capable, open-source backbone for multimodal and downstream tasks.
Alongside model development, this project addresses fundamental limitations in self-supervised learning (SSL) clustering methods. Contemporary approaches rely heavily on clustering algorithms like Sinkhorn-Knopp, which ignore the semantic ambiguity present in image representations. To overcome this, the project introduces a parameter-efficient, multi-head clustering projector based on nested Matryoshka representations, enabling progressive feature refinement into increasingly granular clusters without increasing model size. This supports both performance and memory efficiency.
This work sets a new bar for open, reproducible, and high-performance vision foundation models, and aligns with EuroHPC’s mission to support large-scale, cutting-edge AI research that benefits the broader scientific and industrial community.
Shashanka Venkataramanan, Valeo.ai, France.