Research
Publications
People
Media
Events
Vacancies
Contact
Article
Chain-of-Thought Hijacking
J. Zhao
,
T. Fu
,
R. Schaeffer
,
M. Sharma
,
F. Barez
HACK: Hallucinations Along Certainty and Knowledge Axes
A. Simhi
,
J. Herzig
,
I. Itzhak
,
D. Arad
,
Z. Gekhman
,
R. Reichart
,
F. Barez
,
Et Al.
Val-Bench: Measuring Value Alignment in Language Models
A. Gupta
,
D. O'Shea
,
F. Barez
Beyond Linear Probes: Dynamic Safety Monitoring for Language Models
J. Oldfield
,
P. Torr
,
I. Patras
,
A. Bibi
,
F. Barez
Query Circuits: Explaining How Language Models Answer User Prompts
T.-Y. Wu
,
F. Barez
Towards Understanding Subliminal Learning: When and How Hidden Biases Transfer
S. Schrodi
,
E. Kempf
,
F. Barez
,
T. Brox
Beyond Monoliths: Expert Orchestration for More Capable, Democratic, and Safe Language Models
P. Quirke
,
N. Oozeer
,
C. Bandi
,
A. Abdullah
,
J. Hoelscher-Obermaier
,
F. Barez
,
Et Al.
The Singapore Consensus on Global AI Safety Research Priorities
Y. Bengio
,
T. Maharaj
,
L. Ong
,
S. Russell
,
D. Song
,
M. Tegmark
,
L. Xue
,
F. Barez
,
Et Al.
SafetyNet: Detecting Harmful Outputs in LLMs by Modeling and Monitoring Deceptive Behaviors
M. Chaudhary
,
F. Barez
AILuminate: Introducing v1.0 of the AI Risk and Reliability Benchmark from MLCommons
S. Ghosh
,
H. Frase
,
A. Williams
,
S. Luger
,
P. Röttger
,
F. Barez
,
S. McGregor
,
Et Al.
«
»
Cite
×