Skip to main content
The European High Performance Computing Joint Undertaking (EuroHPC JU)

Federated Core Training of Large Language Models

600,000
Awarded Resources (in node hours)
Leonardo BOOSTER
System Partition
August 2025 - 3 months
Allocation Period

Federated Core Training of a 27B EURO STACK LLM Using SYNNQ Pulse.

Project Summary

1. Objective

This document outlines the technical architecture and execution plan for the federated core training of a 27 billion-parameter foundational language model—the next phase of the EURO STACK LLM—using SYNNQ Pulse, a distributed orchestration infrastructure designed for privacy-compliant, scalable AI training. This 27B model forms the next baseline layer of a sovereign European LLM, trained exclusively on curated, audited, and legally compliant datasets, across a heterogeneous and decentralized network of compute nodes in Europe.

2. Training Framework Overview

The SYNNQ Pulse system facilitates the training of large-scale models in environments where:

- Data cannot be centralized (due to legal or trust constraints),

- Compute is highly heterogeneous (ranging from HPC clusters to enterprise GPUs),

- Interoperability and fault tolerance are critical.

The goal is to distribute the training process across dozens or hundreds of compute nodes using federated orchestration, while ensuring:

- Training data integrity and version control,

- Hardware-aware workload scheduling,

- Secure training result integration.

3. Dataset Preparation and Sharding

3.1 Dataset Curation

SYNNQ Pulse curates the additional 20B training dataset from high-quality sources that satisfy:

- GDPR and EU AI Act compliance,

- Verified licensing and content provenance,

- Sectoral and linguistic diversity reflecting European use cases.

3.2 Data Sharding

Once curated, the dataset is partitioned into training shards, each a discrete semantic unit (e.g., by topic, language, document type). Shards are defined by:

- Uniform token count (target ~10M tokens per shard),

- Balanced content complexity,

- Consistent domain representation.

Each shard is assigned a unique identifier and is digitally signed to ensure auditability and content traceability.

4. Training Orchestration Steps

Step 1: Shard Allocation

Using node capability profiles, SYNNQ Pulse matches data shards to participating nodes. Matching criteria include:

- Architecture compatibility (e.g., optimized for CUDA or ROCm),

- Training batch size vs. memory capacity,

- Energy efficiency and thermal constraints.

Step 2: Local Training Cycle

Each node unpacks the training shard and begins fine-tuning a shared model architecture checkpoint, using containerized training environments (Docker/Singularity) to ensure consistency.

Training duration:

- Target: 5 epochs per shard (~2–5 GPU-hours depending on node class).

- Training environment: PyTorch + DeepSpeed/FSDP optimization.

Each local model logs:

- Gradient norms,

- Loss curves,

- Token throughput,

- System diagnostics (power draw, utilization).

Step 3: Submission and Integration

Nodes send back the trained shard checkpoint and associated logs to the SYNNQ Aggregation Layer. This layer performs:

- Integrity checks (checksum, signature validation),

- Performance scoring (e.g., convergence, underfitting detection),

- Outlier filtering (detects noise or adversarial gradients),

- Optional differential privacy pass-through (configurable by context).

Step 4: Federated Aggregation

Training outputs are merged using Federated Averaging and Gradient Scaling techniques. Key steps:

- Layer-wise normalization of updates,

- Time-decay weighting for asynchronous contributors,

- Domain balancing to prevent data dominance from any sector.

A new global checkpoint is constructed and redistributed for the next training round.

5. Training Cycle Management

Training follows iterative synchronized rounds. 

Each round involves:

- New shard-node mappings (avoiding training redundancy),

- Global checkpoint refinement and redistribution,

- Updated learning rate schedule and optimizer states.

Checkpoint evaluation uses:

- Cross-validation on held-out EU test set,

- Perplexity and F1 benchmarks on public NLP tasks,

- Internal SYNNQ compliance test suite (bias, toxicity, explainability).

6. Fault Tolerance and Redundancy

The system incorporates real-time monitoring via SYNNQ Pulse’s Control Plane. Key features:

- Node drop detection and fallback reassignment,

- Checkpoint failover and rollback,

- Shard redistribution upon node failure or underperformance,

- Anomaly detection in training curves.

All interactions are logged immutably for auditing and reproducibility.

7. Security and Data Handling Protocols

To ensure security and compliance:

- Data shards are encrypted at rest and in transit.

- Only training output (no raw text) is sent back to SYNNQ.

- Participating nodes sign a Federated Compute Participation Agreement, affirming data retention, isolation, and destruction terms.

- All training logs are anonymized prior to central analysis.

8. Expected Output

Upon completion of the iterative training process, SYNNQ Pulse will release:

- A 27B parameter LLM checkpoint trained entirely on audited EU data, with verified provenance.

- Evaluation benchmarks and model cards detailing compliance, use cases, and known limitations.

- Inference APIs for initial use by stakeholders (e.g., government, healthcare, legal).

This model serves as the baseline foundation for larger follow-on models (24B, 70B, and beyond), reusing the same federated orchestration layer.

9. Technical Benefits

Benefit                | Description

Hardware-agnostic      | Utilizes any CUDA/ROCm-compatible GPU (MI250X, A100, etc.)

Energy-efficient       | Leverages idle infrastructure, reducing carbon footprint

Privacy-preserving     | No raw data leaves the node

Legally compliant      | Fully aligns with GDPR and EU AI Act

Transparent & auditable| Full process logging and shard traceability

10. Conclusion

The core training of a 27B parameter foundational LLM using SYNNQ Pulse proves that federated, privacy-compliant AI training is not only possible, but scalable, secure, and efficient. By intelligently matching curated data with Europe’s diverse compute infrastructure, SYNNQ Pulse lays the foundation for a truly sovereign AI capability — built by Europe, for Europe.