Article

AutoControl Arena: Synthesizing Executable Test Environments for Frontier AI Risk Evaluation

C. Li, P. Lu, X. Pan, F. Barez, M. Yang

Old Habits Die Hard: How Conversational History Geometrically Traps LLMs

A. Simhi, F. Barez, M. Tutek, Y. Belinkov, S. B. Cohen

Token Taxes: Mitigating AGI's Economic Risks

L. Irwin, T.-Y. Wu, F. Barez

Same Answer, Different Representations: Hidden Instability in VLMs

F. A. Wani, A. Suglia, R. Saxena, A. P. Gema, W. C. Kwan, F. Barez, Et Al.

The Hitchhiker's Guide to Actionable Interpretability

H. Orgad, F. Barez, T. Haklay, I. Lee, M. Mosbach, A. Reusch, N. Saphra, Et Al.

Automated Interpretability-Driven Model Auditing and Control: A Research Agenda

Interpretability Can Be Actionable

H. Orgad, F. Barez, T. Haklay, I. Lee, M. Mosbach, A. Reusch, N. Saphra, Et Al.

Quantifying the Effect of Test Set Contamination on Generative Evaluations

R. Schaeffer, J. Kazdan, B. Abbasi, K. Z. Liu, B. Miranda, A. Ahmed, F. Barez, Et Al.

The Capability Frontier: Benchmarks Miss 82% of Model Performance

B. Fowler, R. Smith, D. T. Graviet, W. Myers, J. Greaves, N. F. Oozeer, A. García, Et Al.

When AI Systems Learn During Deployment, Our Safety Evaluations Break