Skip to main content
The European High Performance Computing Joint Undertaking (EuroHPC JU)

ARComix

40,000
Awarded Resources (in node hours)
MareNostrum5 ACC
System Partition
October 2025 - April 2026
Allocation Period

The recent introduction of Visual AutoRegressive modeling (VAR) has provided a transformative approach to autoregressive learning on images, employing a novel coarse-to-fine strategy for "next-scale prediction" that diverges from traditional raster-scan "next-token prediction." This methodology has demonstrated superior learning speed and generalisation in autoregressive transformers, enabling VAR to surpass diffusion-based methods in image generation performance. 

In previous work, the project research team created and annotated large-scale datasets using foundation models, and successfully trained VAR models combining textual and visual features to generate coherent single-panel images with characters and dialogue. These experiments were conducted on a compact ~318M parameter VAR model, establishing feasibility and validating the approach. 

Building on these results, the project now aims to extend the framework by: 

(i) training models conditioned only on textual features, removing the reliance on image features; 

(ii) generating both text-inclusive and text-free images (e.g., panels with or without dialogue balloons); and 

(iii) exploring alternative input modalities, including graphs, text, and high-level visual features. 

A particularly ambitious extension, contingent on the computational resources available, will be scaling to the larger 3B parameter VAR model and applying it to multi-panel generation (3–4 panels, i.e. strips). This scenario requires significantly higher compute, as each training run involves training the full 3B model from scratch. If feasible, this would open the door to richer narrative modeling and coherent multi-panel visual storytelling. 

Through these extensions, this project seeks to further demonstrate the adaptability of VAR beyond single-panel generation, broadening its potential applications in personalized content creation, structured media adaptation, and multimodal learning.