J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization

[

J4R: Learning to Judge with Equivalent Initial State Group Relative Policy Optimization
Austin Xu, Yilun Zhou, Xuan-Phi Nguyen, Caiming Xiong, and Shafiq Joty. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (ACL-26) 2026.
PDF BibTex Slides