Abkarino

610,000

Awarded Resources (in node hours)

Leonardo BOOSTER

System Partition

August 2025 - 12 months

Allocation Period

Abkarino is a domain-specialised, text-only AI foundation model for computational chemistry. It supports day-to-day work in organic, inorganic/non-organic, and physical chemistry while natively integrating regulatory and compliance reasoning. Target use cases include reaction troubleshooting, solvent and catalyst optimisation, greener process design, and early detection of regulatory risk.

The project requested 610,000 GPU-hours on the EuroHPC Leonardo Booster system and required high-throughput storage for dataset shards and multi-terabyte checkpoint storage to support large-scale distributed training.

Over a 12-month period, the project will complete the first end-to-end pretraining, expert alignment, and evaluation cycle. The training data consists of a curated corpus of public-domain/licensed chemistry literature (~100 GB raw text), expanded via structured reaction-graph linearisation to a ~900B-token training set.

Project objectives in this phase are threefold. First, to pretrain a 6.7B-parameter transformer optimised for 32–64k-token long-context reasoning across the expanded chemistry corpus. Second, aligning the model with chemical-expert human feedback to ensure mechanistic plausibility, realistic yields, safety awareness, and accurate interpretation of regulatory language. Third, evaluating performance on reaction troubleshooting, greener alternative suggestions, catalyst justification, and REACH-style compliance scenarios spanning all chemistry domains.

Innovation is built into the design. Abkarino performs literature-grounded reasoning that links mechanisms, kinetics, thermodynamics, and safety guidance without relying on multimodal inputs. It is jointly tuned for scientific validity and compliance, so outputs explain both why a step is plausible and whether it risks non-compliance. Generation is guided by green-chemistry heuristics (“greener-by-design” decoding). Recommendations come with transparent, auditable rationales backed by built-in validators such as atom/charge balance checks and incompatible-reagent flags. The training strategy is sample-efficient, drawing strength from a focused, high-quality corpus.

Data governance and privacy are treated as first-class requirements. Sources are public-domain or licensed; we maintain a provenance ledger and honour rightsholder opt-out. No confidential data is used for pretraining. Optional fine-tuning or retrieval-augmented generation (RAG) is strictly opt-in and access-controlled, with PII scrubbing, encryption in transit and at rest, and audit logging. Memorisation audits and output filters reduce verbatim regurgitation while still enabling reference citation.

The expected impact is industrial and immediate. With the requested compute and storage, Abkarino enables capabilities suited to a highly regulated sector: faster resolution of reaction failures, improved catalyst/solvent selection, greener process pathways, and early compliance flagging. Together, these outcomes accelerate R&D while reducing operational and regulatory risk.

Principal Investigator, Company and Country