Research
Publications
People
Media
Events
Vacancies
Contact
NeurIPS
Context Matters: Analyzing the Generalizability of Linear Probing and Steering Across Diverse Scenarios
I. Agarwal
,
S. Navani
,
F. Barez
Emerging Risks from Embodied AI Require Urgent Policy Action
J. Perlo
,
A. Robey
,
F. Barez
,
J. Mökander
Establishing Best Practices for Building Rigorous Agentic Benchmarks
Y. Zhu
,
T. Jin
,
Y. Pruksachatkun
,
A. Zhang
,
S. Liu
,
S. Cui
,
S. Kapoor
,
F. Barez
,
Et Al.
Full-Stack Alignment: Co-Aligning AI and Institutions with Thicker Models of Value
R. Lowe
,
J. Edelman
,
T. Zhi-Xuan
,
O. Klingefjord
,
E. Hain
,
V. Wang
,
A. Sarkar
,
F. Barez
,
Et Al.
Best-of-N Jailbreaking
J. Hughes
,
S. Price
,
A. Lynch
,
R. Schaeffer
,
F. Barez
,
S. Koyejo
,
H. Sleight
,
E. Jones
,
E. Perez
Interpreting Learned Feedback Patterns in Large Language Models
L. Marks*
,
A. Abdullah*
,
C. Neo
,
R. Arike
,
D. Krueger
,
P. Torr
,
F. Barez*
DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models
A. Garde
,
E. Kran
,
F. Barez
Measuring Value Alignment
F. Barez
,
P. Torr
System III: Learning with Domain Knowledge for Safety Constraints
F. Barez
,
H. Hasanbieg
,
A. Abbate
Cite
×