Explainable Internal Representations: Tracing Thought Processes in Multimodal LLMs

50000

Awarded Resources (in node hours)

Leonardo BOOSTER

System Partition

June 2025 - May 2026

Allocation Period

AI Technology: Generative Language Modeling, Deep Learning, Vision (image recognition, image generation, text recognition OCR, etc.), Audio (speech recognition, speech synthesis, etc.).

This project aims to replicate and extend Anthropic’s recent breakthrough in explainability (Lindsay et al., 2025) into multimodal LLMs, specifically visual LLMs such as Qwen-VL and Gemma.

The study proposes utilising XAI techniques to identify specialized internal representations across distinct modalities—language, vision, and audio.

Inspired by cognitive neuroscience, the project team will investigate whether transformer architectures exhibit modality-specific computational localization, analogous to the human brain's specialized processing areas (i.e., Broca’s and Wernicke’s areas for language, occipital cortex for vision, auditory cortex for sound).

Benchmarking will involve established multimodal datasets such as Flickr SoundNet that contain Audio-Image pairs.

The research aims to advance explainability in multimodal LLMs, clarifying how complex reasoning emerges and is structured internally, therefore enhancing the interpretability, transparency, and practical utility of such models.

Principal Investigator, Research Team Institution & Country