Research
Publications
People
Media
Events
Vacancies
Contact
Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models
C. Denison
,
M. MacDiarmid
,
F. Barez
,
D. Duvenaud
,
S. Kravec
,
S. Marks
,
Et Al.
June 2024
Type
Preprint
Publication
arXiv:2406.10162
Safety & Alignment
Cite
×