Cross-Modal Alignment for Semantic Processing and Effective Representation (CASPER)

90000 Awarded Resources (in node hours)

Leonardo BOOSTER System Partition

May 2026 - November 2026 Allocation Period

Inspired by recent advancements in 2D video generation, this project investigates the learning of 3D world simulators directly from multi-view RGB videos. The team employs a state representation composed of a 3D particle set, which not only enables the learning of complex dynamics but also allows for the distillation of language and semantic information into the particles. The research team's previous work, 3DGSim, demonstrated the viability of this approach on synthetic datasets, successfully capturing a range of dynamics from rigid and elastic bodies to cloth. This project aims to extend 3DGSim by incorporating action conditioning and scaling the framework to real-world data, thereby paving the way for more generalizable and interactive 3D simulators. This project requires a significant GPU grant to pioneer the next generation of foundation models: 3D world simulators learned directly from multi-view video. The computational demand is substantial, driven by processing vast video data into explicit 3D particle representations and processing those particle sets for training.

Principal Investigator, Company and Country