Research
Publications
People
Media
Events
Vacancies
Contact
Interpretability
Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer
S. Schrodi
,
E. Kempf
,
F. Barez
,
T. Brox
Do Sparse Autoencoders Generalize? A Case Study of Answerability
L. Heindrich
,
P. Torr
,
F. Barez
,
V. Thost
Scaling Sparse Feature Circuit Finding for In-Context Learning
D. Kharlapenko
,
S. Shabalin
,
F. Barez
,
A. Conmy
,
N. Nanda
Towards Interpreting Visual Information Processing in Vision-Language Models
C. Neo
,
L. Ong
,
P. Torr
,
M. Geva
,
D. Krueger
,
F. Barez
Chain-of-Thought Is Not Explainability
F. Barez
,
T.-Y. Wu
,
I. Arcuschin
,
M. Lan
,
V. Wang
,
N. Siegel
,
N. Collignon
,
C. Neo
,
I. Lee
,
A. Paren
,
A. Bibi
,
R. Trager
,
D. Fornasiere
,
J. Yan
,
Y. Elazar
,
Y. Bengio
Interpreting Learned Feedback Patterns in Large Language Models
L. Marks*
,
A. Abdullah*
,
C. Neo
,
R. Arike
,
D. Krueger
,
P. Torr
,
F. Barez*
Enhancing Neural Network Interpretability with Feature-Aligned Sparse Autoencoders
L. Marks
,
A. Paren
,
D. Krueger
,
F. Barez
Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions
C. Neo*
,
S. B. Cohen
,
F. Barez*
Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models
M. Lan
,
P. Torr
,
F. Barez
Quantifying Feature Space Universality Across Large Language Models via Sparse Autoencoders
M. Lan
,
P. Torr
,
A. Meek
,
D. Krueger
,
F. Barez
«
»
Cite
×