Sycophancy to Subterfuge: Investigating Reward-Tampering in Large Language Models

Publication
arXiv:2406.10162