Skip to main content
The European High Performance Computing Joint Undertaking (EuroHPC JU)

Emergent Inpainting Capabilities in Large-Scale Protein Language Models: A Novel Supervised Approach

300,000
Awarded Resources (in node hours)
MareNostrum5 ACC
System Partition
June 2025 - 6 months
Allocation Period

The design of novel proteins with tailored functionalities is a critical enabler for advances in biotechnology, pharmaceuticals, and agro-industry. While existing Protein Language Models (PLMs) excel at unsupervised sequence generation, they lack the ability to accurately “inpaint” masked regions of a protein structure—a capability vital for targeted redesign and de novo design of functional domains. 

This project proposes “Emergent Inpainting Capabilities in Large-Scale Protein Language Models: A Novel Supervised Approach”, which will develop a 980 million-parameter transformer (StructureGPT-XL) trained on 214 million structure-sequence pairs. By integrating structural masking tokens and a supervised training regime inspired by Masked Language Models, the project aims to awaken high-precision inpainting abilities, pushing Multiclass Accuracy for masked positions from baseline 50–55 % to over 92 %.

Leveraging EuroHPC AI Factory Large Scale resources, the project researchers will parallelise data preprocessing, model training, and hyperparameter exploration across distributed GPU clusters, reducing time-to-solution and enabling extensive ablation studies. The supervised inpainting framework is unique within PLM research, combining computer-vision autoencoder principles with protein sequence modeling. 

Expected outcomes include a supervised PLM with state-of-the-art inpainting performance, a benchmark dataset of masked structure-sequence pairs, and training recipes. This work will accelerate protein engineering workflows, enabling rapid prototyping of enzyme variants, therapeutic antibodies, and novel protein scaffolds with direct industrial applications.