<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom"><channel><title>Safety &amp; Alignment | TSG Lab – Technical Safety &amp; Governance Lab</title><link>https://tsglab.github.io/tag/safety-alignment/</link><atom:link href="https://tsglab.github.io/tag/safety-alignment/index.xml" rel="self" type="application/rss+xml"/><description>Safety &amp; Alignment</description><generator>Hugo Blox Builder (https://hugoblox.com)</generator><language>en-us</language><lastBuildDate>Thu, 01 Jan 2026 00:00:00 +0000</lastBuildDate><image><url>https://tsglab.github.io/media/logo.svg</url><title>Safety &amp; Alignment</title><link>https://tsglab.github.io/tag/safety-alignment/</link></image><item><title>When AI Systems Learn During Deployment, Our Safety Evaluations Break</title><link>https://tsglab.github.io/publication/safety-evaluations-break-deployment/</link><pubDate>Thu, 01 Jan 2026 00:00:00 +0000</pubDate><guid>https://tsglab.github.io/publication/safety-evaluations-break-deployment/</guid><description/></item><item><title>Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness</title><link>https://tsglab.github.io/publication/latent-adversarial-prompt-robustness/</link><pubDate>Sat, 01 Nov 2025 00:00:00 +0000</pubDate><guid>https://tsglab.github.io/publication/latent-adversarial-prompt-robustness/</guid><description/></item><item><title>Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMs</title><link>https://tsglab.github.io/publication/trust-me-im-wrong-hallucinations/</link><pubDate>Sat, 01 Nov 2025 00:00:00 +0000</pubDate><guid>https://tsglab.github.io/publication/trust-me-im-wrong-hallucinations/</guid><description/></item><item><title>Chain-of-Thought Hijacking</title><link>https://tsglab.github.io/publication/chain-of-thought-hijacking/</link><pubDate>Wed, 01 Oct 2025 00:00:00 +0000</pubDate><guid>https://tsglab.github.io/publication/chain-of-thought-hijacking/</guid><description/></item><item><title>HACK: Hallucinations Along Certainty and Knowledge Axes</title><link>https://tsglab.github.io/publication/hack-hallucinations-certainty-knowledge/</link><pubDate>Wed, 01 Oct 2025 00:00:00 +0000</pubDate><guid>https://tsglab.github.io/publication/hack-hallucinations-certainty-knowledge/</guid><description/></item><item><title>Rethinking Safety in LLM Fine-Tuning: An Optimization Perspective</title><link>https://tsglab.github.io/publication/rethinking-safety-llm-finetuning/</link><pubDate>Wed, 01 Oct 2025 00:00:00 +0000</pubDate><guid>https://tsglab.github.io/publication/rethinking-safety-llm-finetuning/</guid><description/></item><item><title>Val-Bench: Measuring Value Alignment in Language Models</title><link>https://tsglab.github.io/publication/val-bench-value-alignment/</link><pubDate>Wed, 01 Oct 2025 00:00:00 +0000</pubDate><guid>https://tsglab.github.io/publication/val-bench-value-alignment/</guid><description/></item><item><title>Beyond Linear Probes: Dynamic Safety Monitoring for Language Models</title><link>https://tsglab.github.io/publication/dynamic-safety-monitoring-linear-probes/</link><pubDate>Mon, 01 Sep 2025 00:00:00 +0000</pubDate><guid>https://tsglab.github.io/publication/dynamic-safety-monitoring-linear-probes/</guid><description/></item><item><title>PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning</title><link>https://tsglab.github.io/publication/poisonbench/</link><pubDate>Tue, 01 Jul 2025 00:00:00 +0000</pubDate><guid>https://tsglab.github.io/publication/poisonbench/</guid><description/></item><item><title>SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviors</title><link>https://tsglab.github.io/publication/safetynet-deceptive-behaviors/</link><pubDate>Thu, 01 May 2025 00:00:00 +0000</pubDate><guid>https://tsglab.github.io/publication/safetynet-deceptive-behaviors/</guid><description/></item><item><title>AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons</title><link>https://tsglab.github.io/publication/ailuminate-mlcommons/</link><pubDate>Sat, 01 Mar 2025 00:00:00 +0000</pubDate><guid>https://tsglab.github.io/publication/ailuminate-mlcommons/</guid><description/></item><item><title>Open Problems in Machine Unlearning for AI Safety</title><link>https://tsglab.github.io/publication/open-problems-machine-unlearning/</link><pubDate>Wed, 01 Jan 2025 00:00:00 +0000</pubDate><guid>https://tsglab.github.io/publication/open-problems-machine-unlearning/</guid><description/></item><item><title>Plan B: Training LLMs to Fail Less Severely</title><link>https://tsglab.github.io/publication/plan-b-llms-fail-less-severely/</link><pubDate>Wed, 01 Jan 2025 00:00:00 +0000</pubDate><guid>https://tsglab.github.io/publication/plan-b-llms-fail-less-severely/</guid><description/></item><item><title>Best-of-N Jailbreaking</title><link>https://tsglab.github.io/publication/best-of-n-jailbreaking/</link><pubDate>Sun, 01 Dec 2024 00:00:00 +0000</pubDate><guid>https://tsglab.github.io/publication/best-of-n-jailbreaking/</guid><description/></item><item><title>Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach</title><link>https://tsglab.github.io/publication/jailbreak-defense-narrow-domain/</link><pubDate>Sun, 01 Dec 2024 00:00:00 +0000</pubDate><guid>https://tsglab.github.io/publication/jailbreak-defense-narrow-domain/</guid><description/></item><item><title>Large Language Models Relearn Removed Concepts</title><link>https://tsglab.github.io/publication/llms-relearn-removed-concepts/</link><pubDate>Thu, 01 Aug 2024 00:00:00 +0000</pubDate><guid>https://tsglab.github.io/publication/llms-relearn-removed-concepts/</guid><description/></item><item><title>Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models</title><link>https://tsglab.github.io/publication/sycophancy-to-subterfuge/</link><pubDate>Sat, 01 Jun 2024 00:00:00 +0000</pubDate><guid>https://tsglab.github.io/publication/sycophancy-to-subterfuge/</guid><description/></item><item><title>Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training</title><link>https://tsglab.github.io/publication/sleeper-agents/</link><pubDate>Mon, 01 Jan 2024 00:00:00 +0000</pubDate><guid>https://tsglab.github.io/publication/sleeper-agents/</guid><description/></item><item><title>Measuring Value Alignment</title><link>https://tsglab.github.io/publication/measuring-value-alignment/</link><pubDate>Fri, 01 Dec 2023 00:00:00 +0000</pubDate><guid>https://tsglab.github.io/publication/measuring-value-alignment/</guid><description/></item><item><title>System III: Learning with Domain Knowledge for Safety Constraints</title><link>https://tsglab.github.io/publication/system-iii-safety-constraints/</link><pubDate>Thu, 01 Dec 2022 00:00:00 +0000</pubDate><guid>https://tsglab.github.io/publication/system-iii-safety-constraints/</guid><description/></item></channel></rss>