Research
Publications
People
Media
Events
Vacancies
Contact
Article
Increasing Trust in Language Models Through the Reuse of Verified Circuits
P. Quirke
,
C. Neo
,
F. Barez
Safeguarding AI in Finance: Lessons for Regulated Industries
F. Barez
,
L. Marks
Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training
E. Hubinger
,
C. Denison
,
J. Mu
,
M. Lambert
,
M. Tong
,
M. MacDiarmid
,
F. Barez
,
Et Al.
What Does GPT Store in Its MLP Weights? A Case Study of Long-Range Dependencies
T. Clark
,
S. B. Cohen
,
F. Barez
AI Systems of Concern
K. Matteucci
,
S. Avin
,
F. Barez
,
S. Ó HÉigeartaigh
Fairness in AI and Its Long-Term Implications on Society
O. Bohdal*
,
T. Hospedales
,
P. H. S. Torr
,
F. Barez*
Identifying a Preliminary Circuit for Predicting Gendered Pronouns in GPT-2 Small
C. Mathwin
,
G. Corlouer
,
E. Kran
,
F. Barez
,
N. Nanda
«
Cite
×