Safety & Alignment | TSG Lab – Technical Safety & Governance Lab

Safety & Alignment | TSG Lab – Technical Safety & Governance Labhttps://tsglab.github.io/tag/safety-alignment/Safety & AlignmentHugo Blox Builder (https://hugoblox.com)en-usThu, 01 Jan 2026 00:00:00 +0000https://tsglab.github.io/media/logo.svgSafety & Alignmenthttps://tsglab.github.io/tag/safety-alignment/When AI Systems Learn During Deployment, Our Safety Evaluations Breakhttps://tsglab.github.io/publication/safety-evaluations-break-deployment/Thu, 01 Jan 2026 00:00:00 +0000https://tsglab.github.io/publication/safety-evaluations-break-deployment/Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustnesshttps://tsglab.github.io/publication/latent-adversarial-prompt-robustness/Sat, 01 Nov 2025 00:00:00 +0000https://tsglab.github.io/publication/latent-adversarial-prompt-robustness/Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMshttps://tsglab.github.io/publication/trust-me-im-wrong-hallucinations/Sat, 01 Nov 2025 00:00:00 +0000https://tsglab.github.io/publication/trust-me-im-wrong-hallucinations/Chain-of-Thought Hijackinghttps://tsglab.github.io/publication/chain-of-thought-hijacking/Wed, 01 Oct 2025 00:00:00 +0000https://tsglab.github.io/publication/chain-of-thought-hijacking/HACK: Hallucinations Along Certainty and Knowledge Axeshttps://tsglab.github.io/publication/hack-hallucinations-certainty-knowledge/Wed, 01 Oct 2025 00:00:00 +0000https://tsglab.github.io/publication/hack-hallucinations-certainty-knowledge/Rethinking Safety in LLM Fine-Tuning: An Optimization Perspectivehttps://tsglab.github.io/publication/rethinking-safety-llm-finetuning/Wed, 01 Oct 2025 00:00:00 +0000https://tsglab.github.io/publication/rethinking-safety-llm-finetuning/Val-Bench: Measuring Value Alignment in Language Modelshttps://tsglab.github.io/publication/val-bench-value-alignment/Wed, 01 Oct 2025 00:00:00 +0000https://tsglab.github.io/publication/val-bench-value-alignment/Beyond Linear Probes: Dynamic Safety Monitoring for Language Modelshttps://tsglab.github.io/publication/dynamic-safety-monitoring-linear-probes/Mon, 01 Sep 2025 00:00:00 +0000https://tsglab.github.io/publication/dynamic-safety-monitoring-linear-probes/PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoninghttps://tsglab.github.io/publication/poisonbench/Tue, 01 Jul 2025 00:00:00 +0000https://tsglab.github.io/publication/poisonbench/SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviorshttps://tsglab.github.io/publication/safetynet-deceptive-behaviors/Thu, 01 May 2025 00:00:00 +0000https://tsglab.github.io/publication/safetynet-deceptive-behaviors/AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommonshttps://tsglab.github.io/publication/ailuminate-mlcommons/Sat, 01 Mar 2025 00:00:00 +0000https://tsglab.github.io/publication/ailuminate-mlcommons/Open Problems in Machine Unlearning for AI Safetyhttps://tsglab.github.io/publication/open-problems-machine-unlearning/Wed, 01 Jan 2025 00:00:00 +0000https://tsglab.github.io/publication/open-problems-machine-unlearning/Plan B: Training LLMs to Fail Less Severelyhttps://tsglab.github.io/publication/plan-b-llms-fail-less-severely/Wed, 01 Jan 2025 00:00:00 +0000https://tsglab.github.io/publication/plan-b-llms-fail-less-severely/Best-of-N Jailbreakinghttps://tsglab.github.io/publication/best-of-n-jailbreaking/Sun, 01 Dec 2024 00:00:00 +0000https://tsglab.github.io/publication/best-of-n-jailbreaking/Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approachhttps://tsglab.github.io/publication/jailbreak-defense-narrow-domain/Sun, 01 Dec 2024 00:00:00 +0000https://tsglab.github.io/publication/jailbreak-defense-narrow-domain/Large Language Models Relearn Removed Conceptshttps://tsglab.github.io/publication/llms-relearn-removed-concepts/Thu, 01 Aug 2024 00:00:00 +0000https://tsglab.github.io/publication/llms-relearn-removed-concepts/Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Modelshttps://tsglab.github.io/publication/sycophancy-to-subterfuge/Sat, 01 Jun 2024 00:00:00 +0000https://tsglab.github.io/publication/sycophancy-to-subterfuge/Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Traininghttps://tsglab.github.io/publication/sleeper-agents/Mon, 01 Jan 2024 00:00:00 +0000https://tsglab.github.io/publication/sleeper-agents/Measuring Value Alignmenthttps://tsglab.github.io/publication/measuring-value-alignment/Fri, 01 Dec 2023 00:00:00 +0000https://tsglab.github.io/publication/measuring-value-alignment/System III: Learning with Domain Knowledge for Safety Constraintshttps://tsglab.github.io/publication/system-iii-safety-constraints/Thu, 01 Dec 2022 00:00:00 +0000https://tsglab.github.io/publication/system-iii-safety-constraints/