Skip to main content
The European High Performance Computing Joint Undertaking (EuroHPC JU)

Demistifying Visual Understanding in Multimodal Large Language Models

87964 Awarded Resources (in node hours)
Leonardo BOOSTER System Partition
August 2026 - February 2027 Allocation Period

AI Technology: Vision (image recognition, image generation, text recognition OCR, etc.); Generative Language Modeling; Deep Learning.

Multimodal Large Language Models (MLLMs) represent a major step forward in Artificial Intelligence by integrating visual perception with natural language understanding and generation. 

Despite rapid progress, competing Multimodal LLMs vary in their use of attention schemes, loss functions, resolution strategies, visual token management, and encoder designs. These individual advances are often evaluated under different compute budgets, datasets, and benchmarks, rendering the field fragmented and introducing a lack of comparability that creates significant uncertainty about what truly drives progress in visual understanding.

This project aims to establish the first controlled framework to fairly evaluate design choices in MLLMs. 

The team will systematically explore seven critical design axes through carefully designed controlled experiments and one final integrated run, all while fixing compute, data, and evaluation protocols. 

The project will deliver three key outcomes:

  1. First, reproducible evidence on which design decisions matter and which provide marginal or negligible benefit.
  2. Second, a principled ranking of design axes by their contribution to visual reasoning.
  3. Third, a strong vision-centric MLLM baseline, supported by open-source code, to serve as a foundation for future multimodal research.In line with the EU Project ELLIOT, this work will strive to guide both academic and industrial research in the development of open, fully reproducible Multimodal Foundation Models. 

The expected impact is a measurable acceleration in the development of multimodal systems with stronger visual reasoning, achieved through transparent evaluation practices and community-oriented resources.

Principal Investigator, Institution and Country

Matteo Farina, University of Trento, Italy