Emergent Inpainting Capabilities in Large-Scale Protein Language Models: A Novel Supervised Approach

300,000

Awarded Resources (in node hours)

MareNostrum5 ACC

System Partition

June 2025 - 6 months

Allocation Period

The design of novel proteins with tailored functionalities is a critical enabler for advances in biotechnology, pharmaceuticals, and agro-industry. While existing Protein Language Models (PLMs) excel at unsupervised sequence generation, they lack the ability to accurately “inpaint” masked regions of a protein structure—a capability vital for targeted redesign and de novo design of functional domains.

This project proposes “Emergent Inpainting Capabilities in Large-Scale Protein Language Models: A Novel Supervised Approach”, which will develop a 980 million-parameter transformer (StructureGPT-XL) trained on 214 million structure-sequence pairs. By integrating structural masking tokens and a supervised training regime inspired by Masked Language Models, the project aims to awaken high-precision inpainting abilities, pushing Multiclass Accuracy for masked positions from baseline 50–55 % to over 92 %.

Leveraging EuroHPC AI Factory Large Scale resources, the project researchers will parallelise data preprocessing, model training, and hyperparameter exploration across distributed GPU clusters, reducing time-to-solution and enabling extensive ablation studies. The supervised inpainting framework is unique within PLM research, combining computer-vision autoencoder principles with protein sequence modeling.

Expected outcomes include a supervised PLM with state-of-the-art inpainting performance, a benchmark dataset of masked structure-sequence pairs, and training recipes. This work will accelerate protein engineering workflows, enabling rapid prototyping of enzyme variants, therapeutic antibodies, and novel protein scaffolds with direct industrial applications.

Principal Investigator, Company and Country