The PDF2XSL project aims to develop and train a foundational model capable of converting PDF documents into their corresponding XSLFO (Extensible Stylesheet Language Formatting Objects) templates. This technology seeks to streamline the digital transformation of companies operating in the Customer Communications Management (CCM) sector, enabling the automated migration of legacy document templates with minimal human intervention and high fidelity to the original design.
The project is organized into several key phases:
(i) Dataset Construction and Anonymisation – We will curate a large-scale dataset consisting of paired PDFs and their corresponding XSL-FO templates, ensuring the anonymisation of all proprietary and client-specific data. In this phase, the project will also explore strategies for segmenting or chunking PDFs into meaningful visual–structural units and pairing each chunk with the corresponding portion of the original XSL-FO template. This fine-grained alignment will support more accurate model training, enable localised supervision, and improve the system’s ability to learn detailed layout–to–code correspondences, while making long-context sequences manageable.
(ii) Design and Training of a Novel PDF Embedder – A dedicated module will be developed to represent PDFs through a combination of OCRbased textual extraction, graphical feature analysis, and metadata processing, producing rich multimodal embeddings.
(iii) Pretraining and Generalisation Study – The model architecture will be pretrained on the collected dataset to assess its generalisation capabilities across diverse document types and layouts.
(iv) Encoder–Decoder Fine-tuning Stage – The entire model will be fine-tuned to achieve precise alignment between visual and structural document representations, while investigating its generalisation capabilities.
(v) Scalability and Zero-shot Evaluation – Finally, the scalability of the approach will be tested across different template families, evaluating its zero-shot transfer capabilities and potential for domain adaptation.
The proposed model will be primarily based on a Transformer architecture that, given a sequence of PDF-derived chunks, autoregressively generates the corresponding XSL-FO code. Both encoder–decoder and decoder-only configurations will be explored: the former due to the intrinsic difference between visual and structural representations (analogous to a machine translation setting), and the latter for its efficiency and proven robustness in large language model design.
Currently, dataset creation and feature extraction pipelines are under development. An algorithm for aligning PDF segments with their template counterparts has been implemented, and preliminary experiments on embedding extraction are ongoing. Once both components are finalised, a large-scale pretraining phase will follow, leading to an extensive final training phase once model stability is achieved.
Subsequent fine-tuning experiments will assess the adaptability of the approach to novel templates and domains.
Robert Dosen, doxee, Italy