From 2D to 3D: Empowering Fine-Grained Video Representation through Point Tracking in 3D space

50000

Awarded Resources (in node hours)

Leonardo BOOSTER

System Partition

June 2025 - May 2026

Allocation Period

AI Technology: Vision (image recognition, image generation, text recognition OCR, etc.)

Advancements in computer vision have significantly enhanced video analysis, yet challenges remain in accurately understanding precise 3D motion.

For robots, humans, and other embodied agents to interact effectively with the physical 3D world, a deep understanding of a scene’s structure and dynamics is essential.

Current methods like optical flow and its 3D extension, scene flow, are limited as they capture only instantaneous motion and struggle with occlusion-aware, long-term pixel associations.

This project hopes to propose a new framework centered on point tracking in 3D space to improve spatio-temporal correspondence understanding in video sequences.

The study's approach extends the task of Tracking Any Points(TAP) into three dimensions, creating an occlusion-aware method for better video representation.

The team proposes a new efficient sparse approach built upon TAPVid-3D, a new benchmark designed to evaluate general motion understanding for models performing point tracking in 3D space.

The team also aims to set a new baseline for 3D Tracking Any Points and inspire broader adoption across diverse applications in video editing, controllable video generation, and robotic manipulation in the future.

Principal Investigator, Research Team Institution & Country