Hard2Verify: A Step-Level Verification Benchmark for Open-Ended Frontier Math
Shrey Pandit, Austin Xu, Xuan-Phi Nguyen, Yifei Ming, Caiming Xiong, and Shafiq Joty. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL-26) 2026.
PDF BibTex Slides