Dr Fazl Barez

Principal Investigator

University of Oxford

Department of Engineering Science

Fazl Barez is a Principal Investigator at the Technical Safety & Governance Lab (TSG), Department of Engineering Science, University of Oxford. His research focuses on understanding how frontier AI systems work internally through interpretability, developing tools for effective AI governance, and analyzing the societal impact of advanced AI.

Interests

Interpretability
AI Safety
AI Governance
Frontier AI Systems

Publications

AutoControl Arena: Synthesizing Executable Test Environments for Frontier AI Risk EvaluationPreprint
Old Habits Die Hard: How Conversational History Geometrically Traps LLMsPreprint
Token Taxes: Mitigating AGI's Economic RisksPreprint
Same Answer, Different Representations: Hidden Instability in VLMsPreprint
The Hitchhiker's Guide to Actionable InterpretabilityPreprint
Agentic Product Maturity Ladder V0.1Policy
Automated Interpretability-Driven Model Auditing and Control: A Research AgendaPreprint
Interpretability Can Be ActionablePreprint
Quantifying the Effect of Test Set Contamination on Generative EvaluationsPreprint
When AI Systems Learn During Deployment, Our Safety Evaluations BreakPreprint
Context Matters: Analyzing the Generalizability of Linear Probing and Steering Across Diverse ScenariosWorkshop
Emerging Risks from Embodied AI Require Urgent Policy ActionConference
Establishing Best Practices for Building Rigorous Agentic BenchmarksConference
Full-Stack Alignment: Co-Aligning AI and Institutions with Thicker Models of ValueWorkshop
Beyond Linear Steering: Unified Multi-Attribute Control for Language ModelsConference
Precise In-Parameter Concept Erasure in Large Language ModelsConference
Same Question, Different Words: A Latent Adversarial Framework for Prompt RobustnessConference
Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMsConference
Chain-of-Thought HijackingPreprint
HACK: Hallucinations Along Certainty and Knowledge AxesPreprint
Rethinking Safety in LLM Fine-Tuning: An Optimization PerspectiveConference
Val-Bench: Measuring Value Alignment in Language ModelsPreprint
Beyond Linear Probes: Dynamic Safety Monitoring for Language ModelsPreprint
Query Circuits: Explaining How Language Models Answer User PromptsPreprint
Towards Understanding Subliminal Learning: When and How Hidden Biases TransferPreprint
Do Sparse Autoencoders Generalize? A Case Study of AnswerabilityWorkshop
PoisonBench: Assessing Large Language Model Vulnerability to Data PoisoningConference
Scaling Sparse Feature Circuit Finding for In-Context LearningConference
Beyond Monoliths: Expert Orchestration for More Capable, Democratic, and Safe Language ModelsPreprint
In Which Areas of Technical AI Safety Could Geopolitical Rivals Cooperate?Conference
The Singapore Consensus on Global AI Safety Research PrioritiesPreprint
SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive BehaviorsPreprint
Rethinking AI Cultural AlignmentWorkshop
Towards Interpreting Visual Information Processing in Vision-Language ModelsConference
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommonsPreprint
Chain-of-Thought Is Not ExplainabilityPreprint
Open Problems in Machine Unlearning for AI SafetyPreprint
Plan B: Training LLMs to Fail Less SeverelyPreprint
Safety Frameworks and Standards: A Comparative Analysis to Advance Risk Management of Frontier AIPreprint
Toward Resisting AI-Enabled AuthoritarianismPreprint
Best-of-N JailbreakingConference
Interpreting Learned Feedback Patterns in Large Language ModelsConference
Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier ApproachPreprint
Enhancing Neural Network Interpretability with Feature-Aligned Sparse AutoencodersPreprint
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP InteractionsConference
Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language ModelsConference
Quantifying Feature Space Universality Across Large Language Models via Sparse AutoencodersPreprint
Large Language Models Relearn Removed ConceptsConference
Mechanistic Interpretability Workshop at ICML 2024Workshop
Position: Near to Mid-Term Risks and Opportunities of Open-Source Generative AIConference
The Scaling Behavior of Large Language ModelsWorkshop
Visualizing Neural Network ImaginationWorkshop
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language ModelsPreprint
Understanding Addition in TransformersConference
Increasing Trust in Language Models Through the Reuse of Verified CircuitsPreprint
Safeguarding AI in Finance: Lessons for Regulated IndustriesPolicy
Sleeper Agents: Training Deceptive LLMs That Persist Through Safety TrainingPreprint
What Does GPT Store in Its MLP Weights? A Case Study of Long-Range DependenciesPreprint
DeepDecipher: Accessing and Investigating Neuron Activation in Large Language ModelsWorkshop
Measuring Value AlignmentWorkshop
AI Systems of ConcernPreprint
Detecting Edit Failures in Large Language Models: An Improved Specificity BenchmarkConference
The Larger They Are, the Harder They Fail: Language Models Do Not Recognize Identifier Swaps in PythonConference
Neuron to Graph: Interpreting Language Model Neurons at ScaleWorkshop
Fairness in AI and Its Long-Term Implications on SocietyJournal
Identifying a Preliminary Circuit for Predicting Gendered Pronouns in GPT-2 SmallPreprint
The Alan Turing Institute's Response to the House of Lords Large Language Models Call for EvidencePolicy
System III: Learning with Domain Knowledge for Safety ConstraintsWorkshop

Dr Fazl Barez

Principal Investigator

University of Oxford

Department of Engineering Science

Publications

Events