Events

Talks

Invited Talk Virtual

LatentQA: Teaching LLMs to Decode Activations Into Natural Language

Alexander Pan · Meta
Friday 27 March 2026 · 4:00 PM – 5:00 PM GMT (12:00 PM – 1:00 PM ET)

What is a language model actually representing when it processes text? LatentQA reframes activation interpretation as a QA task: a decoder LLM is trained to answer open-ended questions about the internal representations of a subject model, enabling flexible, scalable probing of beliefs, intentions, and attributes — without fixed concept vocabularies. Alexander will present the method and its implications for interpretability and safety monitoring. Based on ICLR 2026 work with Lijie Chen and Jacob Steinhardt.

About the Speaker

Alexander Pan is a researcher at Meta working on agentic security and evals. Previously, he led the safety team at xAI and finished his PhD at UC Berkeley, advised by Jacob Steinhardt. He is interested in understanding and mitigating risks from misaligned AI agents.