The goal of this project is to design a unified video–language grounding system towards robust understanding and tracking of object state changes over time.
Building on our previous efforts for grounded video captioning and grounding, the team aims to develop a flexible multi-task framework that jointly supports grounded video captioning, temporal grounding, and state-change recognition and tracking with segmentation.
Concretely, the model will perform temporal localisation of when state changes occur and spatio-temporal tracking of these changes for each instance of the transformed object, including all resulting pieces, and will be able to reason about what state-change an object is undergoing and temporally detect the exact phase of this state-change process.
In this setting, the model must not only detect when objects are being transformed (e.g. a vegetable being cut) but also find and segment all resulting object instances after the change (e.g. all pieces of the cut vegetable), and describe these state changes in natural language, moving beyond conventional assumptions of stable appearance.
A central component is the construction of new large-scale pseudo-labelled training data that unifies grounded captioning, temporal grounding and state-change tracking with segmentation. We will exploit existing large-scale video datasets with instructional videos and automatically annotate them using a combination of visual foundation models (for detection, segmentation, and temporal proposals) and large language models (for parsing text, generating task formulations, and describing object state changes), yielding an expanded multi-task corpus for pre-training.
The resulting datasets and models could support long-horizon robotic tasks where understanding how objects evolve over time is crucial, for example with the goal to enable a robot to track all pieces of a chopped carrot. More broadly, they could provide richly grounded trajectories linking language, perception, time and object state changes, which are key training signals for emerging Vision–Language–Action models in assistive and autonomous systems.
Principal Investigator, Institution & Country
Evangelos Kazakos, Czech Technical University in Prague, Czechia