This project investigates inference-time scaling in multimodal generative models, focusing on unified diffusion-transformer architectures. While scaling inference-time compute has led to major advances in reasoning for large language models, its potential in multimodal generation remains largely unexplored. At the same time, noise selection in diffusion models remains heuristic, despite evidence that structured approaches may improve efficiency and output quality.We introduce a verifier-guided framework for discrete noise space search, enabling systematic exploration of inference-time scaling, structured noise selection, verifier robustness, and trade-offs across scheduling strategies. The project will deliver new insights into scaling effects in multimodal settings, a robust methodology for guided inference, and open-source models and datasets to support future research.
Cees G.M. Snoek, University of Amsterdam, Netherlands