Research
Publications
People
Media
Events
Vacancies
Contact
Jailbreak Defense in a Narrow Domain: Limitations of Existing Methods and a New Transcript-Classifier Approach
T. T. Wang
,
J. Hughes
,
H. Sleight
,
R. Schaeffer
,
R. Agrawal
,
F. Barez
,
Et Al.
December 2024
Type
Preprint
Publication
arXiv:2412.02159
Safety & Alignment
Cite
×