Paper-Conference

Context Matters: Analyzing the Generalizability of Linear Probing and Steering Across Diverse Scenarios

I. Agarwal, S. Navani, F. Barez

Emerging Risks from Embodied AI Require Urgent Policy Action

J. Perlo, A. Robey, F. Barez, J. Mökander

Establishing Best Practices for Building Rigorous Agentic Benchmarks

Y. Zhu, T. Jin, Y. Pruksachatkun, A. Zhang, S. Liu, S. Cui, S. Kapoor, F. Barez, Et Al.

Full-Stack Alignment: Co-Aligning AI and Institutions with Thicker Models of Value

R. Lowe, J. Edelman, T. Zhi-Xuan, O. Klingefjord, E. Hain, V. Wang, A. Sarkar, F. Barez, Et Al.

Beyond Linear Steering: Unified Multi-Attribute Control for Language Models

N. Oozeer, L. Marks, F. Barez, A. Abdullah

Precise In-Parameter Concept Erasure in Large Language Models

Y. Gur-Arieh, C. Suslik, Y. Hong, F. Barez, M. Geva

Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness

T. Fu, F. Barez

Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMs

A. Simhi, I. Itzhak, F. Barez, G. Stanovsky, Y. Belinkov

Rethinking Safety in LLM Fine-Tuning: An Optimization Perspective

M. Kim, J. M. Kwak, L. Alssum, B. Ghanem, P. Torr, D. Krueger, F. Barez†, A. Bibi†

Do Sparse Autoencoders Generalize? A Case Study of Answerability

L. Heindrich, P. Torr, F. Barez, V. Thost