Et Al. | TSG Lab – Technical Safety & Governance Lab

Et Al. | TSG Lab – Technical Safety & Governance Labhttps://tsglab.github.io/author/et-al./Et Al.Hugo Blox Builder (https://hugoblox.com)en-usSun, 01 Feb 2026 00:00:00 +0000https://tsglab.github.io/media/logo.svgEt Al.https://tsglab.github.io/author/et-al./Same Answer, Different Representations: Hidden Instability in VLMshttps://tsglab.github.io/publication/same-answer-different-representations/Sun, 01 Feb 2026 00:00:00 +0000https://tsglab.github.io/publication/same-answer-different-representations/The Hitchhiker's Guide to Actionable Interpretabilityhttps://tsglab.github.io/publication/hitchhikers-guide-actionable-interpretability/Thu, 15 Jan 2026 00:00:00 +0000https://tsglab.github.io/publication/hitchhikers-guide-actionable-interpretability/Agentic Product Maturity Ladder V0.1https://tsglab.github.io/publication/agentic-product-maturity-ladder/Thu, 01 Jan 2026 00:00:00 +0000https://tsglab.github.io/publication/agentic-product-maturity-ladder/Interpretability Can Be Actionablehttps://tsglab.github.io/publication/interpretability-can-be-actionable/Thu, 01 Jan 2026 00:00:00 +0000https://tsglab.github.io/publication/interpretability-can-be-actionable/Quantifying the Effect of Test Set Contamination on Generative Evaluationshttps://tsglab.github.io/publication/quantifying-test-set-contamination/Thu, 01 Jan 2026 00:00:00 +0000https://tsglab.github.io/publication/quantifying-test-set-contamination/The Capability Frontier: Benchmarks Miss 82% of Model Performancehttps://tsglab.github.io/publication/capability-frontier-benchmarks/Thu, 01 Jan 2026 00:00:00 +0000https://tsglab.github.io/publication/capability-frontier-benchmarks/Establishing Best Practices for Building Rigorous Agentic Benchmarkshttps://tsglab.github.io/publication/agentic-benchmarks-best-practices/Mon, 01 Dec 2025 00:00:00 +0000https://tsglab.github.io/publication/agentic-benchmarks-best-practices/Full-Stack Alignment: Co-Aligning AI and Institutions with Thicker Models of Valuehttps://tsglab.github.io/publication/full-stack-alignment-institutions/Mon, 01 Dec 2025 00:00:00 +0000https://tsglab.github.io/publication/full-stack-alignment-institutions/HACK: Hallucinations Along Certainty and Knowledge Axeshttps://tsglab.github.io/publication/hack-hallucinations-certainty-knowledge/Wed, 01 Oct 2025 00:00:00 +0000https://tsglab.github.io/publication/hack-hallucinations-certainty-knowledge/Beyond Monoliths: Expert Orchestration for More Capable, Democratic, and Safe Language Modelshttps://tsglab.github.io/publication/beyond-monoliths-expert-orchestration/Sun, 01 Jun 2025 00:00:00 +0000https://tsglab.github.io/publication/beyond-monoliths-expert-orchestration/In Which Areas of Technical AI Safety Could Geopolitical Rivals Cooperate?https://tsglab.github.io/publication/geopolitical-rivals-ai-safety-cooperation/Sun, 01 Jun 2025 00:00:00 +0000https://tsglab.github.io/publication/geopolitical-rivals-ai-safety-cooperation/The Singapore Consensus on Global AI Safety Research Prioritieshttps://tsglab.github.io/publication/singapore-consensus-ai-safety/Sun, 01 Jun 2025 00:00:00 +0000https://tsglab.github.io/publication/singapore-consensus-ai-safety/AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommonshttps://tsglab.github.io/publication/ailuminate-mlcommons/Sat, 01 Mar 2025 00:00:00 +0000https://tsglab.github.io/publication/ailuminate-mlcommons/Safety Frameworks and Standards: A Comparative Analysis to Advance Risk Management of Frontier AIhttps://tsglab.github.io/publication/safety-frameworks-standards/Wed, 01 Jan 2025 00:00:00 +0000https://tsglab.github.io/publication/safety-frameworks-standards/Verification for International AI Governancehttps://tsglab.github.io/publication/verification-international-ai-governance/Wed, 01 Jan 2025 00:00:00 +0000https://tsglab.github.io/publication/verification-international-ai-governance/Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approachhttps://tsglab.github.io/publication/jailbreak-defense-narrow-domain/Sun, 01 Dec 2024 00:00:00 +0000https://tsglab.github.io/publication/jailbreak-defense-narrow-domain/Mechanistic Interpretability Workshop at ICML 2024https://tsglab.github.io/publication/mechanistic-interpretability-workshop-icml-2024/Mon, 01 Jul 2024 00:00:00 +0000https://tsglab.github.io/publication/mechanistic-interpretability-workshop-icml-2024/Position: Near to Mid-Term Risks and Opportunities of Open-Source Generative AIhttps://tsglab.github.io/publication/open-source-generative-ai-risks/Mon, 01 Jul 2024 00:00:00 +0000https://tsglab.github.io/publication/open-source-generative-ai-risks/Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Modelshttps://tsglab.github.io/publication/sycophancy-to-subterfuge/Sat, 01 Jun 2024 00:00:00 +0000https://tsglab.github.io/publication/sycophancy-to-subterfuge/Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Traininghttps://tsglab.github.io/publication/sleeper-agents/Mon, 01 Jan 2024 00:00:00 +0000https://tsglab.github.io/publication/sleeper-agents/The Alan Turing Institute's Response to the House of Lords Large Language Models Call for Evidencehttps://tsglab.github.io/publication/turing-institute-lords-llm-evidence/Sun, 01 Jan 2023 00:00:00 +0000https://tsglab.github.io/publication/turing-institute-lords-llm-evidence/