Research
Publications
People
Media
Events
Vacancies
Contact
Safety & Alignment
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons
S. Ghosh
,
H. Frase
,
A. Williams
,
S. Luger
,
P. Röttger
,
F. Barez
,
S. McGregor
,
Et Al.
Open Problems in Machine Unlearning for AI Safety
F. Barez
,
T. Fu
,
A. Prabhu
,
S. Casper
,
A. Sanyal
,
A. Bibi
,
A. O'Gara
,
R. Kirk
,
B. Bucknall
,
T. Fist
,
L. Ong
,
P. Torr
,
K.-Y. Lam
,
R. Trager
,
D. Krueger
,
S. Mindermann
,
J. Hernández-Orallo
,
M. Geva
,
Y. Gal
Plan B: Training LLMs to Fail Less Severely
J. Stastny
,
N. Warncke
,
D. Xu
,
A. Lynch
,
F. Barez
,
H. Sleight
,
E. Perez
Best-of-N Jailbreaking
J. Hughes
,
S. Price
,
A. Lynch
,
R. Schaeffer
,
F. Barez
,
S. Koyejo
,
H. Sleight
,
E. Jones
,
E. Perez
Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach
T. T. Wang
,
J. Hughes
,
H. Sleight
,
R. Schaeffer
,
R. Agrawal
,
F. Barez
,
Et Al.
Large Language Models Relearn Removed Concepts
M. Lo*
,
S. B. Cohen
,
F. Barez*
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
C. Denison
,
M. MacDiarmid
,
F. Barez
,
D. Duvenaud
,
S. Kravec
,
S. Marks
,
Et Al.
Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training
E. Hubinger
,
C. Denison
,
J. Mu
,
M. Lambert
,
M. Tong
,
M. MacDiarmid
,
F. Barez
,
Et Al.
Measuring Value Alignment
F. Barez
,
P. Torr
System III: Learning with Domain Knowledge for Safety Constraints
F. Barez
,
H. Hasanbieg
,
A. Abbate
«
Cite
×