Paper-Conference | TSG Lab – Technical Safety & Governance Lab

Paper-Conference | TSG Lab – Technical Safety & Governance Labhttps://tsglab.github.io/publication-type/paper-conference/Paper-ConferenceHugo Blox Builder (https://hugoblox.com)en-usMon, 01 Dec 2025 00:00:00 +0000https://tsglab.github.io/media/logo.svgPaper-Conferencehttps://tsglab.github.io/publication-type/paper-conference/Context Matters: Analyzing the Generalizability of Linear Probing and Steering Across Diverse Scenarioshttps://tsglab.github.io/publication/context-matters-linear-probing-steering/Mon, 01 Dec 2025 00:00:00 +0000https://tsglab.github.io/publication/context-matters-linear-probing-steering/Emerging Risks from Embodied AI Require Urgent Policy Actionhttps://tsglab.github.io/publication/emerging-risks-embodied-ai/Mon, 01 Dec 2025 00:00:00 +0000https://tsglab.github.io/publication/emerging-risks-embodied-ai/Establishing Best Practices for Building Rigorous Agentic Benchmarkshttps://tsglab.github.io/publication/agentic-benchmarks-best-practices/Mon, 01 Dec 2025 00:00:00 +0000https://tsglab.github.io/publication/agentic-benchmarks-best-practices/Full-Stack Alignment: Co-Aligning AI and Institutions with Thicker Models of Valuehttps://tsglab.github.io/publication/full-stack-alignment-institutions/Mon, 01 Dec 2025 00:00:00 +0000https://tsglab.github.io/publication/full-stack-alignment-institutions/Beyond Linear Steering: Unified Multi-Attribute Control for Language Modelshttps://tsglab.github.io/publication/beyond-linear-steering-multi-attribute-control/Sat, 01 Nov 2025 00:00:00 +0000https://tsglab.github.io/publication/beyond-linear-steering-multi-attribute-control/Precise In-Parameter Concept Erasure in Large Language Modelshttps://tsglab.github.io/publication/precise-concept-erasure-llms/Sat, 01 Nov 2025 00:00:00 +0000https://tsglab.github.io/publication/precise-concept-erasure-llms/Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustnesshttps://tsglab.github.io/publication/latent-adversarial-prompt-robustness/Sat, 01 Nov 2025 00:00:00 +0000https://tsglab.github.io/publication/latent-adversarial-prompt-robustness/Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMshttps://tsglab.github.io/publication/trust-me-im-wrong-hallucinations/Sat, 01 Nov 2025 00:00:00 +0000https://tsglab.github.io/publication/trust-me-im-wrong-hallucinations/Rethinking Safety in LLM Fine-Tuning: An Optimization Perspectivehttps://tsglab.github.io/publication/rethinking-safety-llm-finetuning/Wed, 01 Oct 2025 00:00:00 +0000https://tsglab.github.io/publication/rethinking-safety-llm-finetuning/Do Sparse Autoencoders Generalize? A Case Study of Answerabilityhttps://tsglab.github.io/publication/sparse-autoencoders-generalize-answerability/Tue, 01 Jul 2025 00:00:00 +0000https://tsglab.github.io/publication/sparse-autoencoders-generalize-answerability/PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoninghttps://tsglab.github.io/publication/poisonbench/Tue, 01 Jul 2025 00:00:00 +0000https://tsglab.github.io/publication/poisonbench/Scaling Sparse Feature Circuit Finding for In-Context Learninghttps://tsglab.github.io/publication/scaling-sparse-feature-circuit-finding/Tue, 01 Jul 2025 00:00:00 +0000https://tsglab.github.io/publication/scaling-sparse-feature-circuit-finding/In Which Areas of Technical AI Safety Could Geopolitical Rivals Cooperate?https://tsglab.github.io/publication/geopolitical-rivals-ai-safety-cooperation/Sun, 01 Jun 2025 00:00:00 +0000https://tsglab.github.io/publication/geopolitical-rivals-ai-safety-cooperation/Rethinking AI Cultural Alignmenthttps://tsglab.github.io/publication/rethinking-ai-cultural-alignment/Tue, 01 Apr 2025 00:00:00 +0000https://tsglab.github.io/publication/rethinking-ai-cultural-alignment/Towards Interpreting Visual Information Processing in Vision-Language Modelshttps://tsglab.github.io/publication/visual-information-processing-vlms/Tue, 01 Apr 2025 00:00:00 +0000https://tsglab.github.io/publication/visual-information-processing-vlms/Best-of-N Jailbreakinghttps://tsglab.github.io/publication/best-of-n-jailbreaking/Sun, 01 Dec 2024 00:00:00 +0000https://tsglab.github.io/publication/best-of-n-jailbreaking/Interpreting Learned Feedback Patterns in Large Language Modelshttps://tsglab.github.io/publication/interpreting-feedback-patterns-llms/Sun, 01 Dec 2024 00:00:00 +0000https://tsglab.github.io/publication/interpreting-feedback-patterns-llms/Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactionshttps://tsglab.github.io/publication/attention-mlp-interactions/Fri, 01 Nov 2024 00:00:00 +0000https://tsglab.github.io/publication/attention-mlp-interactions/Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Modelshttps://tsglab.github.io/publication/interpretable-sequence-continuation/Fri, 01 Nov 2024 00:00:00 +0000https://tsglab.github.io/publication/interpretable-sequence-continuation/Large Language Models Relearn Removed Conceptshttps://tsglab.github.io/publication/llms-relearn-removed-concepts/Thu, 01 Aug 2024 00:00:00 +0000https://tsglab.github.io/publication/llms-relearn-removed-concepts/Mechanistic Interpretability Workshop at ICML 2024https://tsglab.github.io/publication/mechanistic-interpretability-workshop-icml-2024/Mon, 01 Jul 2024 00:00:00 +0000https://tsglab.github.io/publication/mechanistic-interpretability-workshop-icml-2024/Position: Near to Mid-Term Risks and Opportunities of Open-Source Generative AIhttps://tsglab.github.io/publication/open-source-generative-ai-risks/Mon, 01 Jul 2024 00:00:00 +0000https://tsglab.github.io/publication/open-source-generative-ai-risks/The Scaling Behavior of Large Language Modelshttps://tsglab.github.io/publication/scaling-behavior-llms/Mon, 01 Jul 2024 00:00:00 +0000https://tsglab.github.io/publication/scaling-behavior-llms/Visualizing Neural Network Imaginationhttps://tsglab.github.io/publication/visualizing-neural-network-imagination/Mon, 01 Jul 2024 00:00:00 +0000https://tsglab.github.io/publication/visualizing-neural-network-imagination/Understanding Addition in Transformershttps://tsglab.github.io/publication/understanding-addition-transformers/Wed, 01 May 2024 00:00:00 +0000https://tsglab.github.io/publication/understanding-addition-transformers/DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Modelshttps://tsglab.github.io/publication/deepdecipher/Fri, 01 Dec 2023 00:00:00 +0000https://tsglab.github.io/publication/deepdecipher/Measuring Value Alignmenthttps://tsglab.github.io/publication/measuring-value-alignment/Fri, 01 Dec 2023 00:00:00 +0000https://tsglab.github.io/publication/measuring-value-alignment/Detecting Edit Failures in Large Language Models: An Improved Specificity Benchmarkhttps://tsglab.github.io/publication/detecting-edit-failures-llms/Sat, 01 Jul 2023 00:00:00 +0000https://tsglab.github.io/publication/detecting-edit-failures-llms/The Larger They Are, the Harder They Fail: Language Models Do Not Recognize Identifier Swaps in Pythonhttps://tsglab.github.io/publication/identifier-swaps-python/Sat, 01 Jul 2023 00:00:00 +0000https://tsglab.github.io/publication/identifier-swaps-python/Neuron to Graph: Interpreting Language Model Neurons at Scalehttps://tsglab.github.io/publication/neuron-to-graph/Mon, 01 May 2023 00:00:00 +0000https://tsglab.github.io/publication/neuron-to-graph/System III: Learning with Domain Knowledge for Safety Constraintshttps://tsglab.github.io/publication/system-iii-safety-constraints/Thu, 01 Dec 2022 00:00:00 +0000https://tsglab.github.io/publication/system-iii-safety-constraints/