Recent advancements in foundation models have significantly transformed the landscape of computer vision, enabling unprecedented generalisation across diverse tasks and modalities.
This project investigates the intersection of large-scale vision models and video generation, aiming to unify spatial and temporal understanding within a single generative framework. Leveraging pre-trained vision-language backbones and diffusion-based video generation pipelines, we explore how semantic consistency and fine-grained motion can be jointly modeled. The research project further proposes a scalable architecture that adapts foundation models to video synthesis tasks via lightweight temporal adapters and cross-frame attention mechanisms.
Extensive experiments on benchmark datasets demonstrate that this approach achieves state-of-the-art performance in both zero-shot video generation and fine-tuned downstream tasks, offering new insights into the integration of static vision models with dynamic content synthesis.
Michael Blaschko, KU Leuven, Belgium