Conditional Video Generation Based on Diffusion Models

22,000

Awarded Resources (in node hours)

Leonardo BOOSTER

System Partition

October 2025 - April 2026

Allocation Period

Recent advancements in foundation models have significantly transformed the landscape of computer vision, enabling unprecedented generalisation across diverse tasks and modalities.

This project investigates the intersection of large-scale vision models and video generation, aiming to unify spatial and temporal understanding within a single generative framework. Leveraging pre-trained vision-language backbones and diffusion-based video generation pipelines, we explore how semantic consistency and fine-grained motion can be jointly modeled. The research project further proposes a scalable architecture that adapts foundation models to video synthesis tasks via lightweight temporal adapters and cross-frame attention mechanisms.

Extensive experiments on benchmark datasets demonstrate that this approach achieves state-of-the-art performance in both zero-shot video generation and fine-tuned downstream tasks, offering new insights into the integration of static vision models with dynamic content synthesis.

Principal Investigator, Company and Country