D. Krueger | TSG Lab – Technical Safety & Governance Lab

D. Krueger | TSG Lab – Technical Safety & Governance Labhttps://tsglab.github.io/author/d.-krueger/D. KruegerHugo Blox Builder (https://hugoblox.com)en-usWed, 01 Oct 2025 00:00:00 +0000https://tsglab.github.io/media/logo.svgD. Kruegerhttps://tsglab.github.io/author/d.-krueger/Rethinking Safety in LLM Fine-Tuning: An Optimization Perspectivehttps://tsglab.github.io/publication/rethinking-safety-llm-finetuning/Wed, 01 Oct 2025 00:00:00 +0000https://tsglab.github.io/publication/rethinking-safety-llm-finetuning/PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoninghttps://tsglab.github.io/publication/poisonbench/Tue, 01 Jul 2025 00:00:00 +0000https://tsglab.github.io/publication/poisonbench/Towards Interpreting Visual Information Processing in Vision-Language Modelshttps://tsglab.github.io/publication/visual-information-processing-vlms/Tue, 01 Apr 2025 00:00:00 +0000https://tsglab.github.io/publication/visual-information-processing-vlms/Open Problems in Machine Unlearning for AI Safetyhttps://tsglab.github.io/publication/open-problems-machine-unlearning/Wed, 01 Jan 2025 00:00:00 +0000https://tsglab.github.io/publication/open-problems-machine-unlearning/Interpreting Learned Feedback Patterns in Large Language Modelshttps://tsglab.github.io/publication/interpreting-feedback-patterns-llms/Sun, 01 Dec 2024 00:00:00 +0000https://tsglab.github.io/publication/interpreting-feedback-patterns-llms/Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencodershttps://tsglab.github.io/publication/feature-aligned-sparse-autoencoders/Fri, 01 Nov 2024 00:00:00 +0000https://tsglab.github.io/publication/feature-aligned-sparse-autoencoders/Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencodershttps://tsglab.github.io/publication/feature-space-universality-sparse-autoencoders/Tue, 01 Oct 2024 00:00:00 +0000https://tsglab.github.io/publication/feature-space-universality-sparse-autoencoders/