Nested Matryoshka Clustering for Scalable Visual Representation Learning

200,000

Awarded Resources (in node hours)

Leonardo BOOSTER

System Partition

August 2025 - 6 months

Allocation Period

This project proposes a new vision foundation model that rivals—and often surpasses—the performance of leading proprietary models such as DINOv2, CLIP, and SigLIPv2. The project is built on a fully transparent training pipeline inspired by Web-SSL, using only publicly available datasets such as ImageNet-21K and a subset of ReLAION-2B.

In this EuroHPC project, the researchers will significantly scale up and extend this vision foundation model along three critical axes: (1) high-resolution finetuning to further enhance performance on dense prediction and localization tasks, (2) distillation of large-scale models into smaller, efficient variants to enable deployment in resource-constrained environments, and (3) training a 7B parameter vision foundation model to serve as a highly capable, open-source backbone for multimodal and downstream tasks.

Alongside model development, this project addresses fundamental limitations in self-supervised learning (SSL) clustering methods. Contemporary approaches rely heavily on clustering algorithms like Sinkhorn-Knopp, which ignore the semantic ambiguity present in image representations. To overcome this, the project introduces a parameter-efficient, multi-head clustering projector based on nested Matryoshka representations, enabling progressive feature refinement into increasingly granular clusters without increasing model size. This supports both performance and memory efficiency.

This work sets a new bar for open, reproducible, and high-performance vision foundation models, and aligns with EuroHPC’s mission to support large-scale, cutting-edge AI research that benefits the broader scientific and industrial community.