Research
Publications
People
Media
Events
Vacancies
Contact
Paper-Conference
PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning
T. Fu*
,
M. Sharma
,
P. Torr
,
S. B. Cohen
,
D. Krueger
,
F. Barez*
Scaling Sparse Feature Circuit Finding for In-Context Learning
D. Kharlapenko
,
S. Shabalin
,
F. Barez
,
A. Conmy
,
N. Nanda
In Which Areas of Technical AI Safety Could Geopolitical Rivals Cooperate?
B. Bucknall
,
S. Siddiqui
,
L. Thurnherr
,
C. McGurk
,
B. Harack
,
A. Reuel
,
F. Barez
,
Et Al.
Rethinking AI Cultural Alignment
M. Bravansky
,
F. Trhlík
,
F. Barez
Towards Interpreting Visual Information Processing in Vision-Language Models
C. Neo
,
L. Ong
,
P. Torr
,
M. Geva
,
D. Krueger
,
F. Barez
Best-of-N Jailbreaking
J. Hughes
,
S. Price
,
A. Lynch
,
R. Schaeffer
,
F. Barez
,
S. Koyejo
,
H. Sleight
,
E. Jones
,
E. Perez
Interpreting Learned Feedback Patterns in Large Language Models
L. Marks*
,
A. Abdullah*
,
C. Neo
,
R. Arike
,
D. Krueger
,
P. Torr
,
F. Barez*
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
C. Neo*
,
S. B. Cohen
,
F. Barez*
Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models
M. Lan
,
P. Torr
,
F. Barez
Large Language Models Relearn Removed Concepts
M. Lo*
,
S. B. Cohen
,
F. Barez*
«
»
Cite
×