Interpretability

Old Habits Die Hard: How Conversational History Geometrically Traps LLMs

A. Simhi, F. Barez, M. Tutek, Y. Belinkov, S. B. Cohen

Same Answer, Different Representations: Hidden Instability in VLMs

F. A. Wani, A. Suglia, R. Saxena, A. P. Gema, W. C. Kwan, F. Barez, Et Al.

The Hitchhiker's Guide to Actionable Interpretability

H. Orgad, F. Barez, T. Haklay, I. Lee, M. Mosbach, A. Reusch, N. Saphra, Et Al.

Automated Interpretability-Driven Model Auditing and Control: A Research Agenda

Interpretability Can Be Actionable

H. Orgad, F. Barez, T. Haklay, I. Lee, M. Mosbach, A. Reusch, N. Saphra, Et Al.

Quantifying the Effect of Test Set Contamination on Generative Evaluations

R. Schaeffer, J. Kazdan, B. Abbasi, K. Z. Liu, B. Miranda, A. Ahmed, F. Barez, Et Al.

Context Matters: Analyzing the Generalizability of Linear Probing and Steering Across Diverse Scenarios

I. Agarwal, S. Navani, F. Barez

Beyond Linear Steering: Unified Multi-Attribute Control for Language Models

N. Oozeer, L. Marks, F. Barez, A. Abdullah

Precise In-Parameter Concept Erasure in Large Language Models

Y. Gur-Arieh, C. Suslik, Y. Hong, F. Barez, M. Geva

Query Circuits: Explaining How Language Models Answer User Prompts

T.-Y. Wu, F. Barez