Scaling Video-Panda: Efficient Long-Video Understanding with Encoder-Free Video Language Models

50000

Awarded Resources (in node hours)

Leonardo BOOSTER

System Partition

June 2025 - May 2026

Allocation Period

AI Technology: Vision (image recognition, image generation, text recognition OCR, etc.), Deep Learning, Generative Language Modeling.

The team proposes a large-scale computational effort to address critical scaling limitations identified in a previous initial research.

Video-Panda, as the first encoder-free video-language conversational model, demonstrated competitive performance with dramatically reduced computational requirements—using only 45M parameters for visual processing compared to 300M-1.4B in traditional approaches.

However, computational constraints during the research phase prevented the team from fully exploring the model's capabilities with larger language model backbones, expanded datasets, and extended video lengths.

This proposal focuses specifically on scaling Video-Panda to overcome these limitations by:

integrating recent language model backbones with larger context windows (Phi-3, Qwen2.5, LlaMA-2);
expanding training from 800K to the full 6M VideoChat-Flash dataset;
implementing advanced token optimization techniques; and
enhancing performance on long-form video benchmarks spanning 3-120 minutes.

By systematically scaling the established model architecture and incorporating complementary methodologies like Mamba-based temporal compression, the study aims to deliver a unified, parameter-efficient VLM that effectively processes long videos while maintaining the inherent advantages of our encoder-free approach.

This scaling effort will produce comprehensive analyses of model configuration behaviors, new benchmarks for long video understanding, and open-source implementations to democratize access to advanced video understanding technologies across diverse scientific domains.