The Capability Frontier: Benchmarks Miss 82% of Model Performance

Publication
Preprint