Preprint
Token Taxes: Mitigating AGI's Economic Risks
L. Irwin, T.-Y. Wu, F. Barez
Preprint
Old Habits Die Hard: How Conversational History Geometrically Traps LLMs
A. Simhi, F. Barez, M. Tutek, Y. Belinkov, S. B. Cohen
Preprint
AutoControl Arena: Synthesizing Executable Test Environments for Frontier AI Risk Evaluation
C. Li, P. Lu, X. Pan, F. Barez, M. Yang
Preprint
Same Answer, Different Representations: Hidden Instability in VLMs
F. A. Wani, A. Suglia, R. Saxena, A. P. Gema, W. C. Kwan, F. Barez, et al.
Preprint
The Hitchhiker's Guide to Actionable Interpretability
H. Orgad, F. Barez, T. Haklay, I. Lee, M. Mosbach, A. Reusch, N. Saphra, et al.
Preprint
When AI Systems Learn During Deployment, Our Safety Evaluations Break
F. Barez
Preprint
The Capability Frontier: Benchmarks Miss 82% of Model Performance
B. Fowler, R. Smith, D. T. Graviet, W. Myers, J. Greaves, N. F. Oozeer, A. García, et al.
Preprint
Quantifying the Effect of Test Set Contamination on Generative Evaluations
R. Schaeffer, J. Kazdan, B. Abbasi, K. Z. Liu, B. Miranda, A. Ahmed, F. Barez, et al.
Preprint
Interpretability Can Be Actionable
H. Orgad, F. Barez, T. Haklay, I. Lee, M. Mosbach, A. Reusch, N. Saphra, et al.
Preprint
Automated Interpretability-Driven Model Auditing and Control: A Research Agenda
F. Barez
MLCommons
Agentic Product Maturity Ladder V0.1
S. McGregor, D. Nathani, L. Saouma, F. Barez, A. Foundjem, et al.
NeurIPS 2025 WS
Full-Stack Alignment: Co-Aligning AI and Institutions with Thicker Models of Value
R. Lowe, J. Edelman, T. Zhi-Xuan, O. Klingefjord, E. Hain, V. Wang, A. Sarkar, F. Barez, et al.
NeurIPS 2025
Establishing Best Practices for Building Rigorous Agentic Benchmarks
Y. Zhu, T. Jin, Y. Pruksachatkun, A. Zhang, S. Liu, S. Cui, S. Kapoor, F. Barez, et al.
NeurIPS 2025
Emerging Risks from Embodied AI Require Urgent Policy Action
J. Perlo, A. Robey, F. Barez, J. Mökander
NeurIPS 2025 WS
Context Matters: Analyzing the Generalizability of Linear Probing and Steering Across Diverse Scenarios
I. Agarwal, S. Navani, F. Barez
EMNLP 2025
Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMs
A. Simhi, I. Itzhak, F. Barez, G. Stanovsky, Y. Belinkov
EMNLP 2025
Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness
T. Fu, F. Barez
EMNLP 2025
Precise In-Parameter Concept Erasure in Large Language Models
Y. Gur-Arieh, C. Suslik, Y. Hong, F. Barez, M. Geva
EMNLP 2025
Beyond Linear Steering: Unified Multi-Attribute Control for Language Models
N. Oozeer, L. Marks, F. Barez, A. Abdullah
Preprint
Val-Bench: Measuring Value Alignment in Language Models
A. Gupta, D. O'Shea, F. Barez
COLM 2025
Rethinking Safety in LLM Fine-Tuning: An Optimization Perspective
M. Kim, J. M. Kwak, L. Alssum, B. Ghanem, P. Torr, D. Krueger, F. Barez†, A. Bibi†
Preprint
HACK: Hallucinations Along Certainty and Knowledge Axes
A. Simhi, J. Herzig, I. Itzhak, D. Arad, Z. Gekhman, R. Reichart, F. Barez, et al.
Preprint
Chain-of-Thought Hijacking
J. Zhao, T. Fu, R. Schaeffer, M. Sharma, F. Barez
Preprint
Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer
S. Schrodi, E. Kempf, F. Barez, T. Brox
Preprint
Query Circuits: Explaining How Language Models Answer User Prompts
T.-Y. Wu, F. Barez
Preprint
Beyond Linear Probes: Dynamic Safety Monitoring for Language Models
J. Oldfield, P. Torr, I. Patras, A. Bibi, F. Barez
ICML 2025
Scaling Sparse Feature Circuit Finding for In-Context Learning
D. Kharlapenko, S. Shabalin, F. Barez, A. Conmy, N. Nanda
ICML 2025
PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning
T. Fu*, M. Sharma, P. Torr, S. B. Cohen, D. Krueger, F. Barez*
ICML 2025 WS
Do Sparse Autoencoders Generalize? A Case Study of Answerability
L. Heindrich, P. Torr, F. Barez, V. Thost
Preprint
The Singapore Consensus on Global AI Safety Research Priorities
Y. Bengio, T. Maharaj, L. Ong, S. Russell, D. Song, M. Tegmark, L. Xue, F. Barez, et al.
FAccT 2025
In Which Areas of Technical AI Safety Could Geopolitical Rivals Cooperate?
B. Bucknall, S. Siddiqui, L. Thurnherr, C. McGurk, B. Harack, A. Reuel, F. Barez, et al.
Preprint
Beyond Monoliths: Expert Orchestration for More Capable, Democratic, and Safe Language Models
P. Quirke, N. Oozeer, C. Bandi, A. Abdullah, J. Hoelscher-Obermaier, F. Barez, et al.
Preprint
SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviors
M. Chaudhary, F. Barez
ICLR 2025
Towards Interpreting Visual Information Processing in Vision-Language Models
C. Neo, L. Ong, P. Torr, M. Geva, D. Krueger, F. Barez
ICLR 2025 WS
Rethinking AI Cultural Alignment
M. Bravansky, F. Trhlík, F. Barez
Preprint
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons
S. Ghosh, H. Frase, A. Williams, S. Luger, P. Röttger, F. Barez, S. McGregor, et al.
Preprint
Chain-of-Thought Is Not Explainability
F. Barez, T.-Y. Wu, I. Arcuschin, M. Lan, V. Wang, N. Siegel, N. Collignon, C. Neo, I. Lee, A. Paren, A. Bibi, R. Trager, D. Fornasiere, J. Yan, Y. Elazar, Y. Bengio
OGI Oxford
Verification for International AI Governance
B. Harack, R. Trager, A. Reuel, D. Manheim, M. Brundage, O. Aarne, et al.
Preprint
Toward Resisting AI-Enabled Authoritarianism
F. Barez, I. Friend, K. Reid, I. Krawczuk, V. Wang, J. Mökander, P. Torr, J. Morse, R. Trager
Preprint
Safety Frameworks and Standards: A Comparative Analysis to Advance Risk Management of Frontier AI
M. Ziosi, J. Gealy, M. Plueckebaum, D. Kossack, S. Campos, L. Saouma, F. Barez, et al.
Preprint
Plan B: Training LLMs to Fail Less Severely
J. Stastny, N. Warncke, D. Xu, A. Lynch, F. Barez, H. Sleight, E. Perez
Preprint
Open Problems in Machine Unlearning for AI Safety
F. Barez, T. Fu, A. Prabhu, S. Casper, A. Sanyal, A. Bibi, A. O'Gara, R. Kirk, B. Bucknall, T. Fist, L. Ong, P. Torr, K.-Y. Lam, R. Trager, D. Krueger, S. Mindermann, J. Hernández-Orallo, M. Geva, Y. Gal
Preprint
Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach
T. T. Wang, J. Hughes, H. Sleight, R. Schaeffer, R. Agrawal, F. Barez, et al.
NeurIPS 2024
Interpreting Learned Feedback Patterns in Large Language Models
L. Marks*, A. Abdullah*, C. Neo, R. Arike, D. Krueger, P. Torr, F. Barez*
NeurIPS 2025
Best-of-N Jailbreaking
J. Hughes, S. Price, A. Lynch, R. Schaeffer, F. Barez, S. Koyejo, H. Sleight, E. Jones, E. Perez
EMNLP 2024
Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models
M. Lan, P. Torr, F. Barez
EMNLP 2024
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
C. Neo*, S. B. Cohen, F. Barez*
Preprint
Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders
L. Marks, A. Paren, D. Krueger, F. Barez
Preprint
Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencoders
M. Lan, P. Torr, A. Meek, D. Krueger, F. Barez
ACL 2024
Large Language Models Relearn Removed Concepts
M. Lo*, S. B. Cohen, F. Barez*
ICML 2024 WS
Visualizing Neural Network Imagination
N. Wichers, V. Tao, R. Volpato, F. Barez
WS 2024
The Scaling Behavior of Large Language Models
A. V. Miceli-Barone, F. Barez, S. B. Cohen, E. Voita, U. Germann, M. Lukasik
ICML 2024
Position: Near to Mid-Term Risks and Opportunities of Open-Source Generative AI
F. Eiras, A. Petrov, B. Vidgen, C. S. de Witt, F. Pizzati, K. Elkins, F. Barez, et al.
ICML 2024
Mechanistic Interpretability Workshop at ICML 2024
F. Barez, M. Geva, L. Chan, A. Geiger, K. Yin, N. Nanda, et al.
Preprint
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
C. Denison, M. MacDiarmid, F. Barez, D. Duvenaud, S. Kravec, S. Marks, et al.
ICLR 2024
Understanding Addition in Transformers
P. Quirke, F. Barez
Preprint
Increasing Trust in Language Models Through the Reuse of Verified Circuits
P. Quirke, C. Neo, F. Barez
Preprint
What Does GPT Store in Its MLP Weights? A Case Study of Long-Range Dependencies
T. Clark, S. B. Cohen, F. Barez
Preprint
Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training
E. Hubinger, C. Denison, J. Mu, M. Lambert, M. Tong, M. MacDiarmid, F. Barez, et al.
SSRN
Safeguarding AI in Finance: Lessons for Regulated Industries
F. Barez, L. Marks
NeurIPS 2023 WS
Measuring Value Alignment
F. Barez, P. Torr
NeurIPS 2023 WS
DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models
A. Garde, E. Kran, F. Barez
Preprint
AI Systems of Concern
K. Matteucci, S. Avin, F. Barez, S. Ó hÉigeartaigh
ACL 2023
The Larger They Are, the Harder They Fail: Language Models Do Not Recognize Identifier Swaps in Python
A. V. M. Barone*, F. Barez*, I. Konstas, S. B. Cohen
ACL 2023
Detecting Edit Failures in Large Language Models: An Improved Specificity Benchmark
J. Hoelscher-Obermaier*, J. Persson*, E. Kran, I. Konstas, F. Barez*
ICLR 2023 WS
Neuron to Graph: Interpreting Language Model Neurons at Scale
A. Foote*, N. Nanda, E. Kran, I. Konstas, S. Cohen, F. Barez*
Policy
The Alan Turing Institute's Response to the House of Lords Large Language Models Call for Evidence
F. Barez, P. H. S. Torr, A. Petrov, C. Ashurst, J. Ding, A. Janjeva, A. Babuta, et al.
Preprint
Identifying a Preliminary Circuit for Predicting Gendered Pronouns in GPT-2 Small
C. Mathwin, G. Corlouer, E. Kran, F. Barez, N. Nanda
ESJ
Fairness in AI and Its Long-Term Implications on Society
O. Bohdal*, T. Hospedales, P. H. S. Torr, F. Barez*
NeurIPS 2022 WS
System III: Learning with Domain Knowledge for Safety Constraints
F. Barez, H. Hasanbieg, A. Abbate