ACL/EMNLP

Beyond Linear Steering: Unified Multi-Attribute Control for Language Models

N. Oozeer, L. Marks, F. Barez, A. Abdullah

Precise In-Parameter Concept Erasure in Large Language Models

Y. Gur-Arieh, C. Suslik, Y. Hong, F. Barez, M. Geva

Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness

T. Fu, F. Barez

Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMs

A. Simhi, I. Itzhak, F. Barez, G. Stanovsky, Y. Belinkov

Interpreting Context Look-ups in Transformers: Investigating Attention-MLP Interactions

C. Neo*, S. B. Cohen, F. Barez*

Towards Interpretable Sequence Continuation: Analyzing Shared Circuits in Large Language Models

M. Lan, P. Torr, F. Barez

Large Language Models Relearn Removed Concepts

M. Lo*, S. B. Cohen, F. Barez*

Detecting Edit Failures in Large Language Models: An Improved Specificity Benchmark

J. Hoelscher-Obermaier*, J. Persson*, E. Kran, I. Konstas, F. Barez*

The Larger They Are, the Harder They Fail: Language Models Do Not Recognize Identifier Swaps in Python

A. v. M. Barone*, F. Barez*, I. Konstas, S. B. Cohen