Interpretability | TSG Lab – Technical Safety & Governance Lab

Interpretability | TSG Lab – Technical Safety & Governance Labhttps://tsglab.github.io/tag/interpretability/InterpretabilityHugo Blox Builder (https://hugoblox.com)en-usSun, 01 Mar 2026 00:00:00 +0000https://tsglab.github.io/media/logo.svgInterpretabilityhttps://tsglab.github.io/tag/interpretability/Old Habits Die Hard: How Conversational History Geometrically Traps LLMshttps://tsglab.github.io/publication/old-habits-die-hard-conversational-history/Sun, 01 Mar 2026 00:00:00 +0000https://tsglab.github.io/publication/old-habits-die-hard-conversational-history/Same Answer, Different Representations: Hidden Instability in VLMshttps://tsglab.github.io/publication/same-answer-different-representations/Sun, 01 Feb 2026 00:00:00 +0000https://tsglab.github.io/publication/same-answer-different-representations/The Hitchhiker's Guide to Actionable Interpretabilityhttps://tsglab.github.io/publication/hitchhikers-guide-actionable-interpretability/Thu, 15 Jan 2026 00:00:00 +0000https://tsglab.github.io/publication/hitchhikers-guide-actionable-interpretability/Automated Interpretability-Driven Model Auditing and Control: A Research Agendahttps://tsglab.github.io/publication/automated-interpretability-model-auditing/Thu, 01 Jan 2026 00:00:00 +0000https://tsglab.github.io/publication/automated-interpretability-model-auditing/Interpretability Can Be Actionablehttps://tsglab.github.io/publication/interpretability-can-be-actionable/Thu, 01 Jan 2026 00:00:00 +0000https://tsglab.github.io/publication/interpretability-can-be-actionable/Quantifying the Effect of Test Set Contamination on Generative Evaluationshttps://tsglab.github.io/publication/quantifying-test-set-contamination/Thu, 01 Jan 2026 00:00:00 +0000https://tsglab.github.io/publication/quantifying-test-set-contamination/Context Matters: Analyzing the Generalizability of Linear Probing and Steering Across Diverse Scenarioshttps://tsglab.github.io/publication/context-matters-linear-probing-steering/Mon, 01 Dec 2025 00:00:00 +0000https://tsglab.github.io/publication/context-matters-linear-probing-steering/Beyond Linear Steering: Unified Multi-Attribute Control for Language Modelshttps://tsglab.github.io/publication/beyond-linear-steering-multi-attribute-control/Sat, 01 Nov 2025 00:00:00 +0000https://tsglab.github.io/publication/beyond-linear-steering-multi-attribute-control/Precise In-Parameter Concept Erasure in Large Language Modelshttps://tsglab.github.io/publication/precise-concept-erasure-llms/Sat, 01 Nov 2025 00:00:00 +0000https://tsglab.github.io/publication/precise-concept-erasure-llms/Query Circuits: Explaining How Language Models Answer User Promptshttps://tsglab.github.io/publication/query-circuits/Mon, 01 Sep 2025 00:00:00 +0000https://tsglab.github.io/publication/query-circuits/Towards Understanding Subliminal Learning: When and How Hidden Biases Transferhttps://tsglab.github.io/publication/subliminal-learning-hidden-biases/Mon, 01 Sep 2025 00:00:00 +0000https://tsglab.github.io/publication/subliminal-learning-hidden-biases/Do Sparse Autoencoders Generalize? A Case Study of Answerabilityhttps://tsglab.github.io/publication/sparse-autoencoders-generalize-answerability/Tue, 01 Jul 2025 00:00:00 +0000https://tsglab.github.io/publication/sparse-autoencoders-generalize-answerability/Scaling Sparse Feature Circuit Finding for In-Context Learninghttps://tsglab.github.io/publication/scaling-sparse-feature-circuit-finding/Tue, 01 Jul 2025 00:00:00 +0000https://tsglab.github.io/publication/scaling-sparse-feature-circuit-finding/Towards Interpreting Visual Information Processing in Vision-Language Modelshttps://tsglab.github.io/publication/visual-information-processing-vlms/Tue, 01 Apr 2025 00:00:00 +0000https://tsglab.github.io/publication/visual-information-processing-vlms/Chain-of-Thought Is Not Explainabilityhttps://tsglab.github.io/publication/chain-of-thought-not-explainability/Sat, 01 Feb 2025 00:00:00 +0000https://tsglab.github.io/publication/chain-of-thought-not-explainability/Interpreting Learned Feedback Patterns in Large Language Modelshttps://tsglab.github.io/publication/interpreting-feedback-patterns-llms/Sun, 01 Dec 2024 00:00:00 +0000https://tsglab.github.io/publication/interpreting-feedback-patterns-llms/Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencodershttps://tsglab.github.io/publication/feature-aligned-sparse-autoencoders/Fri, 01 Nov 2024 00:00:00 +0000https://tsglab.github.io/publication/feature-aligned-sparse-autoencoders/Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactionshttps://tsglab.github.io/publication/attention-mlp-interactions/Fri, 01 Nov 2024 00:00:00 +0000https://tsglab.github.io/publication/attention-mlp-interactions/Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Modelshttps://tsglab.github.io/publication/interpretable-sequence-continuation/Fri, 01 Nov 2024 00:00:00 +0000https://tsglab.github.io/publication/interpretable-sequence-continuation/Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencodershttps://tsglab.github.io/publication/feature-space-universality-sparse-autoencoders/Tue, 01 Oct 2024 00:00:00 +0000https://tsglab.github.io/publication/feature-space-universality-sparse-autoencoders/Mechanistic Interpretability Workshop at ICML 2024https://tsglab.github.io/publication/mechanistic-interpretability-workshop-icml-2024/Mon, 01 Jul 2024 00:00:00 +0000https://tsglab.github.io/publication/mechanistic-interpretability-workshop-icml-2024/The Scaling Behavior of Large Language Modelshttps://tsglab.github.io/publication/scaling-behavior-llms/Mon, 01 Jul 2024 00:00:00 +0000https://tsglab.github.io/publication/scaling-behavior-llms/Visualizing Neural Network Imaginationhttps://tsglab.github.io/publication/visualizing-neural-network-imagination/Mon, 01 Jul 2024 00:00:00 +0000https://tsglab.github.io/publication/visualizing-neural-network-imagination/Understanding Addition in Transformershttps://tsglab.github.io/publication/understanding-addition-transformers/Wed, 01 May 2024 00:00:00 +0000https://tsglab.github.io/publication/understanding-addition-transformers/Increasing Trust in Language Models Through the Reuse of Verified Circuitshttps://tsglab.github.io/publication/verified-circuits-trust/Thu, 01 Feb 2024 00:00:00 +0000https://tsglab.github.io/publication/verified-circuits-trust/What Does GPT Store in Its MLP Weights? A Case Study of Long-Range Dependencieshttps://tsglab.github.io/publication/gpt-mlp-weights-long-range/Mon, 01 Jan 2024 00:00:00 +0000https://tsglab.github.io/publication/gpt-mlp-weights-long-range/DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Modelshttps://tsglab.github.io/publication/deepdecipher/Fri, 01 Dec 2023 00:00:00 +0000https://tsglab.github.io/publication/deepdecipher/Detecting Edit Failures in Large Language Models: An Improved Specificity Benchmarkhttps://tsglab.github.io/publication/detecting-edit-failures-llms/Sat, 01 Jul 2023 00:00:00 +0000https://tsglab.github.io/publication/detecting-edit-failures-llms/The Larger They Are, the Harder They Fail: Language Models Do Not Recognize Identifier Swaps in Pythonhttps://tsglab.github.io/publication/identifier-swaps-python/Sat, 01 Jul 2023 00:00:00 +0000https://tsglab.github.io/publication/identifier-swaps-python/Neuron to Graph: Interpreting Language Model Neurons at Scalehttps://tsglab.github.io/publication/neuron-to-graph/Mon, 01 May 2023 00:00:00 +0000https://tsglab.github.io/publication/neuron-to-graph/Identifying a Preliminary Circuit for Predicting Gendered Pronouns in GPT-2 Smallhttps://tsglab.github.io/publication/circuit-gendered-pronouns-gpt2/Sun, 01 Jan 2023 00:00:00 +0000https://tsglab.github.io/publication/circuit-gendered-pronouns-gpt2/