Current Large Language Models (LLMs) relate to human language in the same way that genome Foundation Models (FMs) relate to genomes: they learn the underlying "grammar" through unsupervised pre-training on vast amounts of raw data, producing features that are transferable to supervised tasks. While current state-of-the-art genome FMs successfully map local DNA syntax, they treat sequences in isolation and rely on data-inefficient Masked Language Modeling (MLM), remaining fundamentally blind to evolutionary history. This project proposes the development of a Phylogenetic Foundation Model, an FM for eukaryotic genomes that natively embeds cross-species conservation and divergence. By replacing standard 5% MLM with a novel phylogenetic self-supervision loss based on Continuous-Time Markov Chain (CTMC) posteriors from massive multiple sequence alignments (MSAs), the model is trained bi-directionally on 100% of genome sites simultaneously.To realize this, we request 49,200 node hours on the Leonardo Booster system. The project will be executed in four stages (1) capacity pre-training of our baseline architecture across 10 distinct additional clades, (2) massive phylogenetic computations of CTMC models, (3) integration of the novel phylogenetic loss, and (4) capability scaling to a > 1 Billion parameter sequence-to-sequence model. This final stage requires Leonardo's high-bandwidth InfiniBand network to implement Fully Sharded Data Parallel (FSDP) architecture. The resulting Phylo FM will enable fast and evidence-independent inference to annotate millions of species for the Earth BioGenome Project and accurately predict genetic variant effects.
Principal Investigator, Company and Country
Mario Stanke , University of Greifswald, Germany