Research
Publications
People
Media
Events
Vacancies
Contact
Safety & Alignment
When AI Systems Learn During Deployment, Our Safety Evaluations Break
F. Barez
Same Question, Different Words: A Latent Adversarial Framework for Prompt Robustness
T. Fu
,
F. Barez
Trust Me, I'm Wrong: High-Certainty Hallucinations in LLMs
A. Simhi
,
I. Itzhak
,
F. Barez
,
G. Stanovsky
,
Y. Belinkov
Chain-of-Thought Hijacking
J. Zhao
,
T. Fu
,
R. Schaeffer
,
M. Sharma
,
F. Barez
HACK: Hallucinations Along Certainty and Knowledge Axes
A. Simhi
,
J. Herzig
,
I. Itzhak
,
D. Arad
,
Z. Gekhman
,
R. Reichart
,
F. Barez
,
Et Al.
Rethinking Safety in LLM Fine-Tuning: An Optimization Perspective
M. Kim
,
J. M. Kwak
,
L. Alssum
,
B. Ghanem
,
P. Torr
,
D. Krueger
,
F. Barez†
,
A. Bibi†
Val-Bench: Measuring Value Alignment in Language Models
A. Gupta
,
D. O'Shea
,
F. Barez
Beyond Linear Probes: Dynamic Safety Monitoring for Language Models
J. Oldfield
,
P. Torr
,
I. Patras
,
A. Bibi
,
F. Barez
PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning
T. Fu*
,
M. Sharma
,
P. Torr
,
S. B. Cohen
,
D. Krueger
,
F. Barez*
SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviors
M. Chaudhary
,
F. Barez
»
Cite
×