Joint Proof of Concept — Federated Training of a 32B-Parameter LLM Across Heterogeneous European HPCs - The European High Performance Computing Joint Undertaking (EuroHPC JU)

800,000 Awarded Resources (in node hours)

LUMI-G System Partition

November 2025 - 3 months Allocation Period

Project Summary: Federated Training of a 32B-Parameter LLM Across Heterogeneous European HPCs using SYNNQ Pulse

1. Objective

This document outlines the technical architecture and execution plan for the POC with University of Torinao and LUMI-G for asynchronous federated core training of a 32 billion-parameter foundational language model across Heterogeneous European HPCs—the phase 2(b) of the EURO STACK LLM—using SYNNQ Pulse, a distributed orchestration infrastructure designed for privacy-compliant, scalable AI training. This 32B model forms the next baseline layer of a sovereign European LLM, trained exclusively on curated, audited, and legally compliant datasets, across a heterogeneous and decentralised network of compute nodes in Europe.

2. Training Framework Overview

The SYNNQ Pulse system facilitates the training of large-scale models in environments where:

- Data cannot be centralised (due to legal or trust constraints),

- Compute is highly heterogeneous (ranging from HPC clusters to enterprise GPUs),

- Interoperability and fault tolerance are critical.

The goal is to distribute the training process across dozens or hundreds of compute nodes using federated orchestration, while ensuring:

- Training data integrity and version control,

- Hardware-aware workload scheduling,

- Secure training result integration.

3. Dataset Preparation and Sharding

3.1 Dataset Curation

SYNNQ Pulse curates the additional training dataset from high-quality sources that satisfy:

- GDPR and EU AI Act compliance,

- Verified licensing and content provenance,

- Sectoral and linguistic diversity reflecting European use cases.

3.2 Data Sharding

Once curated, the dataset is partitioned into training shards, each a discrete semantic unit (e.g., by topic, language, document type). Shards are defined by:

- Uniform token count (target ~10M tokens per shard),

- Balanced content complexity,

- Consistent domain representation.

Each shard is assigned a unique identifier and is digitally signed to ensure auditability and content traceability.

4. Training Orchestration Steps

Step 1: Shard Allocation

Using node capability profiles, SYNNQ Pulse matches data shards to participating nodes. Matching criteria include:

- Architecture compatibility (e.g., optimized for CUDA or ROCm),

- Training batch size vs. memory capacity,

- Energy efficiency and thermal constraints.

Step 2: Local Training Cycle

Each node unpacks the training shard and begins fine-tuning a shared model architecture checkpoint, using containerised training environments (Docker/Singularity) to ensure consistency.

Training duration:

- Target: 5 epochs per shard (~2–5 GPU-hours depending on node class).

- Training environment: PyTorch + DeepSpeed/FSDP optimisation.

Each local model logs:

- Gradient norms,

- Loss curves,

- Token throughput,

- System diagnostics (power draw, utilization).

Step 3: Submission and Integration

Nodes send back the trained shard checkpoint and associated logs to the SYNNQ Aggregation Layer. This layer performs:

- Integrity checks (checksum, signature validation),

- Performance scoring (e.g., convergence, underfitting detection),

- Outlier filtering (detects noise or adversarial gradients),

- Optional differential privacy pass-through (configurable by context).

Step 4: Federated Aggregation

Training outputs are merged using Federated Averaging and Gradient Scaling techniques. Key steps:

- Layer-wise normalisation of updates,

- Time-decay weighting for asynchronous contributors,

- Domain balancing to prevent data dominance from any sector.

A new global checkpoint is constructed and redistributed for the next training round.

5. Training Cycle Management

Training follows iterative synchronized rounds.

Each round involves:

- New shard-node mappings (avoiding training redundancy),

- Global checkpoint refinement and redistribution,

- Updated learning rate schedule and optimizer states.

Checkpoint evaluation uses:

- Cross-validation on held-out EU test set,

- Perplexity and F1 benchmarks on public NLP tasks,

- Internal SYNNQ compliance test suite (bias, toxicity, explainability).

6. Fault Tolerance and Redundancy

The system incorporates real-time monitoring via SYNNQ Pulse’s Control Plane. Key features:

- Node drop detection and fallback reassignment,

- Checkpoint failover and rollback,

- Shard redistribution upon node failure or underperformance,

- Anomaly detection in training curves.

All interactions are logged immutably for auditing and reproducibility.

7. Security and Data Handling Protocols

To ensure security and compliance:

- Data shards are encrypted at rest and in transit.

- Only training output (no raw text) is sent back to SYNNQ.

- Participating nodes sign a Federated Compute Participation Agreement, affirming data retention, isolation, and destruction terms.

- All training logs are anonymized prior to central analysis.

8. Expected Output

Upon completion of the iterative training process, SYNNQ Pulse will release:

- A 32B parameter LLM checkpoint trained entirely on audited EU data, with verified provenance.

- Evaluation benchmarks and model cards detailing compliance, use cases, and known limitations.

- Inference APIs for initial use by stakeholders (e.g., government, healthcare, legal).

This model serves as the baseline foundation for larger follow-on models (24B, 70B, and beyond), reusing the same federated orchestration layer.

9. Conclusion

The core training of a 32B parameter foundational LLM using SYNNQ Pulse proves that asynchronous federated, privacy-compliant AI training is not only possible, but scalable, secure, and efficient. By intelligently matching curated data with Europe’s diverse compute infrastructure, SYNNQ Pulse lays the foundation for a truly sovereign AI capability — built by Europe, for Europe.

Principal Investigator, Company and Country