Interpretability

Mechanistic Interpretability Workshop at ICML 2024

F. Barez, M. Geva, L. Chan, A. Geiger, K. Yin, N. Nanda, Et Al.

The Scaling Behavior of Large Language Models

A. v. Miceli-Barone, F. Barez, S. B. Cohen, E. Voita, U. Germann, M. Lukasik

Visualizing Neural Network Imagination

N. Wichers, V. Tao, R. Volpato, F. Barez

Understanding Addition in Transformers

P. Quirke, F. Barez

Increasing Trust in Language Models Through the Reuse of Verified Circuits

P. Quirke, C. Neo, F. Barez

What Does GPT Store in Its MLP Weights? A Case Study of Long-Range Dependencies

T. Clark, S. B. Cohen, F. Barez

DeepDecipher: Accessing and Investigating Neuron Activation in Large Language Models

A. Garde, E. Kran, F. Barez

Detecting Edit Failures in Large Language Models: An Improved Specificity Benchmark

J. Hoelscher-Obermaier*, J. Persson*, E. Kran, I. Konstas, F. Barez*

The Larger They Are, the Harder They Fail: Language Models Do Not Recognize Identifier Swaps in Python

A. v. M. Barone*, F. Barez*, I. Konstas, S. B. Cohen

Neuron to Graph: Interpreting Language Model Neurons at Scale

A. Foote*, N. Nanda, E. Kran, I. Konstas, S. Cohen, F. Barez*