Publications

Topic
Type

Preprint Token Taxes: Mitigating AGI's Economic Risks
L. Irwin, T.-Y. Wu, F. Barez
Preprint Old Habits Die Hard: How Conversational History Geometrically Traps LLMs
A. Simhi, F. Barez, M. Tutek, Y. Belinkov, S. B. Cohen
Preprint AutoControl Arena: Synthesizing Executable Test Environments for Frontier AI Risk Evaluation
C. Li, P. Lu, X. Pan, F. Barez, M. Yang
Preprint Same Answer, Different Representations: Hidden Instability in VLMs
F. A. Wani, A. Suglia, R. Saxena, A. P. Gema, W. C. Kwan, F. Barez, et al.
Preprint The Hitchhiker's Guide to Actionable Interpretability
H. Orgad, F. Barez, T. Haklay, I. Lee, M. Mosbach, A. Reusch, N. Saphra, et al.
Preprint When AI Systems Learn During Deployment, Our Safety Evaluations Break
F. Barez
Preprint The Capability Frontier: Benchmarks Miss 82% of Model Performance
B. Fowler, R. Smith, D. T. Graviet, W. Myers, J. Greaves, N. F. Oozeer, A. García, et al.
Preprint Quantifying the Effect of Test Set Contamination on Generative Evaluations
R. Schaeffer, J. Kazdan, B. Abbasi, K. Z. Liu, B. Miranda, A. Ahmed, F. Barez, et al.
Preprint Interpretability Can Be Actionable
H. Orgad, F. Barez, T. Haklay, I. Lee, M. Mosbach, A. Reusch, N. Saphra, et al.
Preprint Automated Interpretability-Driven Model Auditing and Control: A Research Agenda
F. Barez
MLCommons Agentic Product Maturity Ladder V0.1
S. McGregor, D. Nathani, L. Saouma, F. Barez, A. Foundjem, et al.
NeurIPS 2025 WS Full-Stack Alignment: Co-Aligning AI and Institutions with Thicker Models of Value
R. Lowe, J. Edelman, T. Zhi-Xuan, O. Klingefjord, E. Hain, V. Wang, A. Sarkar, F. Barez, et al.
NeurIPS 2025 Establishing Best Practices for Building Rigorous Agentic Benchmarks
Y. Zhu, T. Jin, Y. Pruksachatkun, A. Zhang, S. Liu, S. Cui, S. Kapoor, F. Barez, et al.
NeurIPS 2025 Emerging Risks from Embodied AI Require Urgent Policy Action
J. Perlo, A. Robey, F. Barez, J. Mökander
NeurIPS 2025 WS Context Matters: Analyzing the Generalizability of Linear Probing and Steering Across Diverse Scenarios
I. Agarwal, S. Navani, F. Barez
EMNLP 2025 Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMs
A. Simhi, I. Itzhak, F. Barez, G. Stanovsky, Y. Belinkov
EMNLP 2025 Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness
T. Fu, F. Barez
EMNLP 2025 Precise In-Parameter Concept Erasure in Large Language Models
Y. Gur-Arieh, C. Suslik, Y. Hong, F. Barez, M. Geva
EMNLP 2025 Beyond Linear Steering: Unified Multi-Attribute Control for Language Models
N. Oozeer, L. Marks, F. Barez, A. Abdullah
Preprint Val-Bench: Measuring Value Alignment in Language Models
A. Gupta, D. O'Shea, F. Barez
COLM 2025 Rethinking Safety in LLM Fine-Tuning: An Optimization Perspective
M. Kim, J. M. Kwak, L. Alssum, B. Ghanem, P. Torr, D. Krueger, F. Barez†, A. Bibi†
Preprint HACK: Hallucinations Along Certainty and Knowledge Axes
A. Simhi, J. Herzig, I. Itzhak, D. Arad, Z. Gekhman, R. Reichart, F. Barez, et al.
Preprint Chain-of-Thought Hijacking
J. Zhao, T. Fu, R. Schaeffer, M. Sharma, F. Barez
Preprint Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer
S. Schrodi, E. Kempf, F. Barez, T. Brox
Preprint Query Circuits: Explaining How Language Models Answer User Prompts
T.-Y. Wu, F. Barez
Preprint Beyond Linear Probes: Dynamic Safety Monitoring for Language Models
J. Oldfield, P. Torr, I. Patras, A. Bibi, F. Barez
ICML 2025 Scaling Sparse Feature Circuit Finding for In-Context Learning
D. Kharlapenko, S. Shabalin, F. Barez, A. Conmy, N. Nanda
ICML 2025 PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning
T. Fu*, M. Sharma, P. Torr, S. B. Cohen, D. Krueger, F. Barez*
ICML 2025 WS Do Sparse Autoencoders Generalize? A Case Study of Answerability
L. Heindrich, P. Torr, F. Barez, V. Thost
Preprint The Singapore Consensus on Global AI Safety Research Priorities
Y. Bengio, T. Maharaj, L. Ong, S. Russell, D. Song, M. Tegmark, L. Xue, F. Barez, et al.
FAccT 2025 In Which Areas of Technical AI Safety Could Geopolitical Rivals Cooperate?
B. Bucknall, S. Siddiqui, L. Thurnherr, C. McGurk, B. Harack, A. Reuel, F. Barez, et al.
Preprint Beyond Monoliths: Expert Orchestration for More Capable, Democratic, and Safe Language Models
P. Quirke, N. Oozeer, C. Bandi, A. Abdullah, J. Hoelscher-Obermaier, F. Barez, et al.
Preprint SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviors
M. Chaudhary, F. Barez
ICLR 2025 Towards Interpreting Visual Information Processing in Vision-Language Models
C. Neo, L. Ong, P. Torr, M. Geva, D. Krueger, F. Barez
ICLR 2025 WS Rethinking AI Cultural Alignment
M. Bravansky, F. Trhlík, F. Barez
Preprint AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons
S. Ghosh, H. Frase, A. Williams, S. Luger, P. Röttger, F. Barez, S. McGregor, et al.
Preprint Chain-of-Thought Is Not Explainability
F. Barez, T.-Y. Wu, I. Arcuschin, M. Lan, V. Wang, N. Siegel, N. Collignon, C. Neo, I. Lee, A. Paren, A. Bibi, R. Trager, D. Fornasiere, J. Yan, Y. Elazar, Y. Bengio
OGI Oxford Verification for International AI Governance
B. Harack, R. Trager, A. Reuel, D. Manheim, M. Brundage, O. Aarne, et al.
Preprint Toward Resisting AI-Enabled Authoritarianism
F. Barez, I. Friend, K. Reid, I. Krawczuk, V. Wang, J. Mökander, P. Torr, J. Morse, R. Trager
Preprint Safety Frameworks and Standards: A Comparative Analysis to Advance Risk Management of Frontier AI
M. Ziosi, J. Gealy, M. Plueckebaum, D. Kossack, S. Campos, L. Saouma, F. Barez, et al.
Preprint Plan B: Training LLMs to Fail Less Severely
J. Stastny, N. Warncke, D. Xu, A. Lynch, F. Barez, H. Sleight, E. Perez
Preprint Open Problems in Machine Unlearning for AI Safety
F. Barez, T. Fu, A. Prabhu, S. Casper, A. Sanyal, A. Bibi, A. O'Gara, R. Kirk, B. Bucknall, T. Fist, L. Ong, P. Torr, K.-Y. Lam, R. Trager, D. Krueger, S. Mindermann, J. Hernández-Orallo, M. Geva, Y. Gal
Preprint Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach
T. T. Wang, J. Hughes, H. Sleight, R. Schaeffer, R. Agrawal, F. Barez, et al.
NeurIPS 2024 Interpreting Learned Feedback Patterns in Large Language Models
L. Marks*, A. Abdullah*, C. Neo, R. Arike, D. Krueger, P. Torr, F. Barez*
NeurIPS 2025 Best-of-N Jailbreaking
J. Hughes, S. Price, A. Lynch, R. Schaeffer, F. Barez, S. Koyejo, H. Sleight, E. Jones, E. Perez
EMNLP 2024 Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models
M. Lan, P. Torr, F. Barez
EMNLP 2024 Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
C. Neo*, S. B. Cohen, F. Barez*
Preprint Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders
L. Marks, A. Paren, D. Krueger, F. Barez
Preprint Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencoders
M. Lan, P. Torr, A. Meek, D. Krueger, F. Barez
ACL 2024 Large Language Models Relearn Removed Concepts
M. Lo*, S. B. Cohen, F. Barez*
ICML 2024 WS Visualizing Neural Network Imagination
N. Wichers, V. Tao, R. Volpato, F. Barez
WS 2024 The Scaling Behavior of Large Language Models
A. V. Miceli-Barone, F. Barez, S. B. Cohen, E. Voita, U. Germann, M. Lukasik
ICML 2024 Position: Near to Mid-Term Risks and Opportunities of Open-Source Generative AI
F. Eiras, A. Petrov, B. Vidgen, C. S. de Witt, F. Pizzati, K. Elkins, F. Barez, et al.
ICML 2024 Mechanistic Interpretability Workshop at ICML 2024
F. Barez, M. Geva, L. Chan, A. Geiger, K. Yin, N. Nanda, et al.
Preprint Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
C. Denison, M. MacDiarmid, F. Barez, D. Duvenaud, S. Kravec, S. Marks, et al.
ICLR 2024 Understanding Addition in Transformers
P. Quirke, F. Barez
Preprint Increasing Trust in Language Models Through the Reuse of Verified Circuits
P. Quirke, C. Neo, F. Barez
Preprint What Does GPT Store in Its MLP Weights? A Case Study of Long-Range Dependencies
T. Clark, S. B. Cohen, F. Barez
Preprint Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training
E. Hubinger, C. Denison, J. Mu, M. Lambert, M. Tong, M. MacDiarmid, F. Barez, et al.
SSRN Safeguarding AI in Finance: Lessons for Regulated Industries
F. Barez, L. Marks
NeurIPS 2023 WS Measuring Value Alignment
F. Barez, P. Torr
NeurIPS 2023 WS DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models
A. Garde, E. Kran, F. Barez
Preprint AI Systems of Concern
K. Matteucci, S. Avin, F. Barez, S. Ó hÉigeartaigh
ACL 2023 The Larger They Are, the Harder They Fail: Language Models Do Not Recognize Identifier Swaps in Python
A. V. M. Barone*, F. Barez*, I. Konstas, S. B. Cohen
ACL 2023 Detecting Edit Failures in Large Language Models: An Improved Specificity Benchmark
J. Hoelscher-Obermaier*, J. Persson*, E. Kran, I. Konstas, F. Barez*
ICLR 2023 WS Neuron to Graph: Interpreting Language Model Neurons at Scale
A. Foote*, N. Nanda, E. Kran, I. Konstas, S. Cohen, F. Barez*
Policy The Alan Turing Institute's Response to the House of Lords Large Language Models Call for Evidence
F. Barez, P. H. S. Torr, A. Petrov, C. Ashurst, J. Ding, A. Janjeva, A. Babuta, et al.
Preprint Identifying a Preliminary Circuit for Predicting Gendered Pronouns in GPT-2 Small
C. Mathwin, G. Corlouer, E. Kran, F. Barez, N. Nanda
ESJ Fairness in AI and Its Long-Term Implications on Society
O. Bohdal*, T. Hospedales, P. H. S. Torr, F. Barez*
NeurIPS 2022 WS System III: Learning with Domain Knowledge for Safety Constraints
F. Barez, H. Hasanbieg, A. Abbate