DEVLA - Design of Efficient Vision Language Model Architectures

50000

Awarded Resources (in node hours)

Leonardo BOOSTER

System Partition

June 2025 - May 2026

Allocation Period

AI Technology: Generative Language Modeling & Vision (image recognition, image generation, text recognition OCR, etc.)

Vision language models (VLM) represent a multimodal extension of text-based language models that allow to incorporate visual information into the generation process of the output text.

Recent developments in VLMs have led to unseen performance in a wide range of applications scenarios. Efficient models that are suitable for local deployment, e.g., on modern laptops or desktop PCs are gaining more and more attention.

Such VLMs enable applications where computational constraints, low latency, or missing data connections play a crucial role. Additionally, local processing of user data is important to meet strict privacy and data security aspects.

This project, a novel, efficient VLM architecture will be introduced to combines selective state space models and attention mechanisms, including an attention mask that considers image-specific ordering of the patch-tokens in the sliding window attention.

Comprehensive performance evaluations will be provided considering different combinations of the basic building blocks of VLMs including both, commonly used state-of-the-art approaches and the new architecture components.

The corresponding results will contribute to making the VLM development process and architectural design choices significantly more transparent to the community.