Research
Publications
People
Media
Events
Vacancies
Contact
Article
Chain-of-Thought Is Not Explainability
F. Barez
,
T.-Y. Wu
,
I. Arcuschin
,
M. Lan
,
V. Wang
,
N. Siegel
,
N. Collignon
,
C. Neo
,
I. Lee
,
A. Paren
,
A. Bibi
,
R. Trager
,
D. Fornasiere
,
J. Yan
,
Y. Elazar
,
Y. Bengio
Open Problems in Machine Unlearning for AI Safety
F. Barez
,
T. Fu
,
A. Prabhu
,
S. Casper
,
A. Sanyal
,
A. Bibi
,
A. O'Gara
,
R. Kirk
,
B. Bucknall
,
T. Fist
,
L. Ong
,
P. Torr
,
K.-Y. Lam
,
R. Trager
,
D. Krueger
,
S. Mindermann
,
J. Hernández-Orallo
,
M. Geva
,
Y. Gal
Plan B: Training LLMs to Fail Less Severely
J. Stastny
,
N. Warncke
,
D. Xu
,
A. Lynch
,
F. Barez
,
H. Sleight
,
E. Perez
Safety Frameworks and Standards: A Comparative Analysis to Advance Risk Management of Frontier AI
M. Ziosi
,
J. Gealy
,
M. Plueckebaum
,
D. Kossack
,
S. Campos
,
L. Saouma
,
F. Barez
,
Et Al.
Toward Resisting AI-Enabled Authoritarianism
F. Barez
,
I. Friend
,
K. Reid
,
I. Krawczuk
,
V. Wang
,
J. Mökander
,
P. Torr
,
J. Morse
,
R. Trager
Verification for International AI Governance
B. Harack
,
R. Trager
,
A. Reuel
,
D. Manheim
,
M. Brundage
,
O. Aarne
,
Et Al.
Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach
T. T. Wang
,
J. Hughes
,
H. Sleight
,
R. Schaeffer
,
R. Agrawal
,
F. Barez
,
Et Al.
Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders
L. Marks
,
A. Paren
,
D. Krueger
,
F. Barez
Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencoders
M. Lan
,
P. Torr
,
A. Meek
,
D. Krueger
,
F. Barez
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
C. Denison
,
M. MacDiarmid
,
F. Barez
,
D. Duvenaud
,
S. Kravec
,
S. Marks
,
Et Al.
«
»
Cite
×