Research
Publications
People
Media
Events
Vacancies
Contact
Interpretability
Old Habits Die Hard: How Conversational History Geometrically Traps LLMs
A. Simhi
,
F. Barez
,
M. Tutek
,
Y. Belinkov
,
S. B. Cohen
Same Answer, Different Representations: Hidden Instability in VLMs
F. A. Wani
,
A. Suglia
,
R. Saxena
,
A. P. Gema
,
W. C. Kwan
,
F. Barez
,
Et Al.
The Hitchhiker's Guide to Actionable Interpretability
H. Orgad
,
F. Barez
,
T. Haklay
,
I. Lee
,
M. Mosbach
,
A. Reusch
,
N. Saphra
,
Et Al.
Automated Interpretability-Driven Model Auditing and Control: A Research Agenda
F. Barez
Interpretability Can Be Actionable
H. Orgad
,
F. Barez
,
T. Haklay
,
I. Lee
,
M. Mosbach
,
A. Reusch
,
N. Saphra
,
Et Al.
Quantifying the Effect of Test Set Contamination on Generative Evaluations
R. Schaeffer
,
J. Kazdan
,
B. Abbasi
,
K. Z. Liu
,
B. Miranda
,
A. Ahmed
,
F. Barez
,
Et Al.
Context Matters: Analyzing the Generalizability of Linear Probing and Steering Across Diverse Scenarios
I. Agarwal
,
S. Navani
,
F. Barez
Beyond Linear Steering: Unified Multi-Attribute Control for Language Models
N. Oozeer
,
L. Marks
,
F. Barez
,
A. Abdullah
Precise In-Parameter Concept Erasure in Large Language Models
Y. Gur-Arieh
,
C. Suslik
,
Y. Hong
,
F. Barez
,
M. Geva
Query Circuits: Explaining How Language Models Answer User Prompts
T.-Y. Wu
,
F. Barez
»
Cite
×