World models enable AI systems to learn internal representations for understanding, prediction, and planning. EWM advances the Joint-Embedding Predictive Architecture (JEPA) framework to create scalable, multimodal world models trained on diverse data including internet-scale video, images, text, and robotic trajectories.
The project delivers three core contributions: enhanced JEPA encoders achieving state-of-the-art performance on vision tasks, Vision-Language Models (VLMs) integrating JEPA representations with European sovereign large language models (LLMs) for multilingual multimodal reasoning, and Vision-Language-Action models (VLAs) enabling language-conditioned robot control.
All models, code, and weights will be released under permissive open-source licenses. By developing sovereign AI capabilities on European infrastructure, EWM establishes competitive alternatives to proprietary systems, ensuring European researchers, industries, and citizens benefit from cutting-edge foundational models with transparent governance.
Principal Investigator, Institution and Country
Sebastian Houben, Bonn-Rhein-Sieg University of Applied Sciences, Germany