Skip to main content
The European High Performance Computing Joint Undertaking (EuroHPC JU)

Conditional Video Generation Based on Diffusion Models

22,000
Awarded Resources (in node hours)
Leonardo BOOSTER
System Partition
October 2025 - April 2026
Allocation Period

Recent advancements in foundation models have significantly transformed the landscape of computer vision, enabling unprecedented generalisation across diverse tasks and modalities. 

This project investigates the intersection of large-scale vision models and video generation, aiming to unify spatial and temporal understanding within a single generative framework. Leveraging pre-trained vision-language backbones and diffusion-based video generation pipelines, we explore how semantic consistency and fine-grained motion can be jointly modeled. The research project further proposes a scalable architecture that adapts foundation models to video synthesis tasks via lightweight temporal adapters and cross-frame attention mechanisms. 

Extensive experiments on benchmark datasets demonstrate that this approach achieves state-of-the-art performance in both zero-shot video generation and fine-tuned downstream tasks, offering new insights into the integration of static vision models with dynamic content synthesis.