The AURORA project aims to develop the first European family of perceptually grounded multimodal foundation models with robust spatial, geometric, and object-centric understanding.
Current state-of-the-art MLLMs achieve strong high-level vision-language reasoning but rely predominantly on caption-based supervision, leading to limited perceptual grounding and frequent hallucinations. AURORA advances beyond this paradigm by integrating structural visual learning signals, such as masked prediction, latent-feature reconstruction, and region-level alignment, directly into the core of MLLMs.
These objectives expose the model to the spatial and semantic regularities present in images, enabling it to infer missing structure, reason about spatial relations, and align linguistic understanding with perceptual evidence.
The project will construct large-scale multimodal datasets with dense scene annotations, develop alignment modules and training strategies for both multimodal alignment and end-to-end visual instruction tuning stages, and scale the approach across multiple LLM sizes and vision encoders.
The resulting models are expected to deliver significant improvements in depth perception, geometric reasoning, spatial coherence, and object-centric analysis, enabling more reliable deployment in perception-intensive applications such as robotics, industrial inspection, and assistive technologies. The collaboration with the Robotics and Autonomous Systems group at AMD Silo AI strengthens the project’s ability to optimize large-scale training workflows and integrate state-of-the-art distributed learning practices.
Leveraging EuroHPC resources is essential for scaling, as training involves billions of parameters and multimodal datasets with millions of images and region-level annotations, requiring distributed GPU clusters and high-throughput data pipelines.
AURORA will deliver publicly available models, datasets, and tools, reinforcing Europe’s leadership in grounded multimodal AI and contributing to initiatives such as ELLIOT and MINERVA that aim to build sovereign, high-performance foundation models for Europe.
Marcella Cornia, University of Modena and Reggio Emilia, Italy