NeurIPS

Context Matters: Analyzing the Generalizability of Linear Probing and Steering Across Diverse Scenarios

I. Agarwal, S. Navani, F. Barez

Emerging Risks from Embodied AI Require Urgent Policy Action

J. Perlo, A. Robey, F. Barez, J. Mökander

Establishing Best Practices for Building Rigorous Agentic Benchmarks

Y. Zhu, T. Jin, Y. Pruksachatkun, A. Zhang, S. Liu, S. Cui, S. Kapoor, F. Barez, Et Al.

Full-Stack Alignment: Co-Aligning AI and Institutions with Thicker Models of Value

R. Lowe, J. Edelman, T. Zhi-Xuan, O. Klingefjord, E. Hain, V. Wang, A. Sarkar, F. Barez, Et Al.

Best-of-N Jailbreaking

J. Hughes, S. Price, A. Lynch, R. Schaeffer, F. Barez, S. Koyejo, H. Sleight, E. Jones, E. Perez

Interpreting Learned Feedback Patterns in Large Language Models

L. Marks*, A. Abdullah*, C. Neo, R. Arike, D. Krueger, P. Torr, F. Barez*

DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models

A. Garde, E. Kran, F. Barez

Measuring Value Alignment

F. Barez, P. Torr

System III: Learning with Domain Knowledge for Safety Constraints

F. Barez, H. Hasanbieg, A. Abbate