Article | TSG Lab – Technical Safety & Governance Lab

Article | TSG Lab – Technical Safety & Governance Labhttps://tsglab.github.io/publication-type/article/ArticleHugo Blox Builder (https://hugoblox.com)en-usSun, 01 Mar 2026 00:00:00 +0000https://tsglab.github.io/media/logo.svgArticlehttps://tsglab.github.io/publication-type/article/AutoControl Arena: Synthesizing Executable Test Environments for Frontier AI Risk Evaluationhttps://tsglab.github.io/publication/autocontrol-arena/Sun, 01 Mar 2026 00:00:00 +0000https://tsglab.github.io/publication/autocontrol-arena/Old Habits Die Hard: How Conversational History Geometrically Traps LLMshttps://tsglab.github.io/publication/old-habits-die-hard-conversational-history/Sun, 01 Mar 2026 00:00:00 +0000https://tsglab.github.io/publication/old-habits-die-hard-conversational-history/Token Taxes: Mitigating AGI's Economic Riskshttps://tsglab.github.io/publication/token-taxes-agi-economic-risks/Sun, 01 Mar 2026 00:00:00 +0000https://tsglab.github.io/publication/token-taxes-agi-economic-risks/Same Answer, Different Representations: Hidden Instability in VLMshttps://tsglab.github.io/publication/same-answer-different-representations/Sun, 01 Feb 2026 00:00:00 +0000https://tsglab.github.io/publication/same-answer-different-representations/The Hitchhiker's Guide to Actionable Interpretabilityhttps://tsglab.github.io/publication/hitchhikers-guide-actionable-interpretability/Thu, 15 Jan 2026 00:00:00 +0000https://tsglab.github.io/publication/hitchhikers-guide-actionable-interpretability/Automated Interpretability-Driven Model Auditing and Control: A Research Agendahttps://tsglab.github.io/publication/automated-interpretability-model-auditing/Thu, 01 Jan 2026 00:00:00 +0000https://tsglab.github.io/publication/automated-interpretability-model-auditing/Interpretability Can Be Actionablehttps://tsglab.github.io/publication/interpretability-can-be-actionable/Thu, 01 Jan 2026 00:00:00 +0000https://tsglab.github.io/publication/interpretability-can-be-actionable/Quantifying the Effect of Test Set Contamination on Generative Evaluationshttps://tsglab.github.io/publication/quantifying-test-set-contamination/Thu, 01 Jan 2026 00:00:00 +0000https://tsglab.github.io/publication/quantifying-test-set-contamination/The Capability Frontier: Benchmarks Miss 82% of Model Performancehttps://tsglab.github.io/publication/capability-frontier-benchmarks/Thu, 01 Jan 2026 00:00:00 +0000https://tsglab.github.io/publication/capability-frontier-benchmarks/When AI Systems Learn During Deployment, Our Safety Evaluations Breakhttps://tsglab.github.io/publication/safety-evaluations-break-deployment/Thu, 01 Jan 2026 00:00:00 +0000https://tsglab.github.io/publication/safety-evaluations-break-deployment/Chain-of-Thought Hijackinghttps://tsglab.github.io/publication/chain-of-thought-hijacking/Wed, 01 Oct 2025 00:00:00 +0000https://tsglab.github.io/publication/chain-of-thought-hijacking/HACK: Hallucinations Along Certainty and Knowledge Axeshttps://tsglab.github.io/publication/hack-hallucinations-certainty-knowledge/Wed, 01 Oct 2025 00:00:00 +0000https://tsglab.github.io/publication/hack-hallucinations-certainty-knowledge/Val-Bench: Measuring Value Alignment in Language Modelshttps://tsglab.github.io/publication/val-bench-value-alignment/Wed, 01 Oct 2025 00:00:00 +0000https://tsglab.github.io/publication/val-bench-value-alignment/Beyond Linear Probes: Dynamic Safety Monitoring for Language Modelshttps://tsglab.github.io/publication/dynamic-safety-monitoring-linear-probes/Mon, 01 Sep 2025 00:00:00 +0000https://tsglab.github.io/publication/dynamic-safety-monitoring-linear-probes/Query Circuits: Explaining How Language Models Answer User Promptshttps://tsglab.github.io/publication/query-circuits/Mon, 01 Sep 2025 00:00:00 +0000https://tsglab.github.io/publication/query-circuits/Towards Understanding Subliminal Learning: When and How Hidden Biases Transferhttps://tsglab.github.io/publication/subliminal-learning-hidden-biases/Mon, 01 Sep 2025 00:00:00 +0000https://tsglab.github.io/publication/subliminal-learning-hidden-biases/Beyond Monoliths: Expert Orchestration for More Capable, Democratic, and Safe Language Modelshttps://tsglab.github.io/publication/beyond-monoliths-expert-orchestration/Sun, 01 Jun 2025 00:00:00 +0000https://tsglab.github.io/publication/beyond-monoliths-expert-orchestration/The Singapore Consensus on Global AI Safety Research Prioritieshttps://tsglab.github.io/publication/singapore-consensus-ai-safety/Sun, 01 Jun 2025 00:00:00 +0000https://tsglab.github.io/publication/singapore-consensus-ai-safety/SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviorshttps://tsglab.github.io/publication/safetynet-deceptive-behaviors/Thu, 01 May 2025 00:00:00 +0000https://tsglab.github.io/publication/safetynet-deceptive-behaviors/AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommonshttps://tsglab.github.io/publication/ailuminate-mlcommons/Sat, 01 Mar 2025 00:00:00 +0000https://tsglab.github.io/publication/ailuminate-mlcommons/Chain-of-Thought Is Not Explainabilityhttps://tsglab.github.io/publication/chain-of-thought-not-explainability/Sat, 01 Feb 2025 00:00:00 +0000https://tsglab.github.io/publication/chain-of-thought-not-explainability/Open Problems in Machine Unlearning for AI Safetyhttps://tsglab.github.io/publication/open-problems-machine-unlearning/Wed, 01 Jan 2025 00:00:00 +0000https://tsglab.github.io/publication/open-problems-machine-unlearning/Plan B: Training LLMs to Fail Less Severelyhttps://tsglab.github.io/publication/plan-b-llms-fail-less-severely/Wed, 01 Jan 2025 00:00:00 +0000https://tsglab.github.io/publication/plan-b-llms-fail-less-severely/Safety Frameworks and Standards: A Comparative Analysis to Advance Risk Management of Frontier AIhttps://tsglab.github.io/publication/safety-frameworks-standards/Wed, 01 Jan 2025 00:00:00 +0000https://tsglab.github.io/publication/safety-frameworks-standards/Toward Resisting AI-Enabled Authoritarianismhttps://tsglab.github.io/publication/resisting-ai-authoritarianism/Wed, 01 Jan 2025 00:00:00 +0000https://tsglab.github.io/publication/resisting-ai-authoritarianism/Verification for International AI Governancehttps://tsglab.github.io/publication/verification-international-ai-governance/Wed, 01 Jan 2025 00:00:00 +0000https://tsglab.github.io/publication/verification-international-ai-governance/Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approachhttps://tsglab.github.io/publication/jailbreak-defense-narrow-domain/Sun, 01 Dec 2024 00:00:00 +0000https://tsglab.github.io/publication/jailbreak-defense-narrow-domain/Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencodershttps://tsglab.github.io/publication/feature-aligned-sparse-autoencoders/Fri, 01 Nov 2024 00:00:00 +0000https://tsglab.github.io/publication/feature-aligned-sparse-autoencoders/Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencodershttps://tsglab.github.io/publication/feature-space-universality-sparse-autoencoders/Tue, 01 Oct 2024 00:00:00 +0000https://tsglab.github.io/publication/feature-space-universality-sparse-autoencoders/Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Modelshttps://tsglab.github.io/publication/sycophancy-to-subterfuge/Sat, 01 Jun 2024 00:00:00 +0000https://tsglab.github.io/publication/sycophancy-to-subterfuge/Increasing Trust in Language Models Through the Reuse of Verified Circuitshttps://tsglab.github.io/publication/verified-circuits-trust/Thu, 01 Feb 2024 00:00:00 +0000https://tsglab.github.io/publication/verified-circuits-trust/Safeguarding AI in Finance: Lessons for Regulated Industrieshttps://tsglab.github.io/publication/safeguarding-ai-finance/Mon, 01 Jan 2024 00:00:00 +0000https://tsglab.github.io/publication/safeguarding-ai-finance/Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Traininghttps://tsglab.github.io/publication/sleeper-agents/Mon, 01 Jan 2024 00:00:00 +0000https://tsglab.github.io/publication/sleeper-agents/What Does GPT Store in Its MLP Weights? A Case Study of Long-Range Dependencieshttps://tsglab.github.io/publication/gpt-mlp-weights-long-range/Mon, 01 Jan 2024 00:00:00 +0000https://tsglab.github.io/publication/gpt-mlp-weights-long-range/AI Systems of Concernhttps://tsglab.github.io/publication/ai-systems-of-concern/Sun, 01 Oct 2023 00:00:00 +0000https://tsglab.github.io/publication/ai-systems-of-concern/Fairness in AI and Its Long-Term Implications on Societyhttps://tsglab.github.io/publication/fairness-ai-long-term-implications/Sun, 01 Jan 2023 00:00:00 +0000https://tsglab.github.io/publication/fairness-ai-long-term-implications/Identifying a Preliminary Circuit for Predicting Gendered Pronouns in GPT-2 Smallhttps://tsglab.github.io/publication/circuit-gendered-pronouns-gpt2/Sun, 01 Jan 2023 00:00:00 +0000https://tsglab.github.io/publication/circuit-gendered-pronouns-gpt2/