AI Technology: Vision (image recognition, image generation, text recognition OCR, etc.), Deep Learning, Generative Language Modeling.
The team proposes a large-scale computational effort to address critical scaling limitations identified in a previous initial research.
Video-Panda, as the first encoder-free video-language conversational model, demonstrated competitive performance with dramatically reduced computational requirements—using only 45M parameters for visual processing compared to 300M-1.4B in traditional approaches.
However, computational constraints during the research phase prevented the team from fully exploring the model's capabilities with larger language model backbones, expanded datasets, and extended video lengths.
This proposal focuses specifically on scaling Video-Panda to overcome these limitations by:
- integrating recent language model backbones with larger context windows (Phi-3, Qwen2.5, LlaMA-2);
- expanding training from 800K to the full 6M VideoChat-Flash dataset;
- implementing advanced token optimization techniques; and
- enhancing performance on long-form video benchmarks spanning 3-120 minutes.
By systematically scaling the established model architecture and incorporating complementary methodologies like Mamba-based temporal compression, the study aims to deliver a unified, parameter-efficient VLM that effectively processes long videos while maintaining the inherent advantages of our encoder-free approach.
This scaling effort will produce comprehensive analyses of model configuration behaviors, new benchmarks for long video understanding, and open-source implementations to democratize access to advanced video understanding technologies across diverse scientific domains.
Juergen Gall, University of Bonn, Germany