A Systematic Study of ChatGPT on Benchmark Datasets

[

While large language models have the potential to benefit society greatly, it is important that we use them responsibly. The recently released ChatGPT language model has drawn a lot of attention, with discussions ongoing whether it achieves impressive performance due to its memorization power for being trained on a massive amount of data, or if it has complex reasoning capability to solve challenging tasks. This paper aims to systematically evaluate these issues. In particular, it focuses on investigating the performance of ChatGPT on academic benchmark datasets, assessing the quality of its generated text, exploring its open-domain knowledge and commonsense reasoning, as well as emerging capability. In addition, we study its potential limitations, such as biases, misinformation generation, and ethical concerns. Our extensive evaluation shows that even though ChatGPT is capable of performing a wide variety of tasks, and may obtain impressive performance in several benchmark datasets, it is still far from achieving the ability to reliably solve many language processing tasks in challenging scenarios.

A Systematic Study of ChatGPT on Benchmark Datasets
Md Tahmid, M Saiful, Mizanur Rahman, Md Amran, Shafiq Joty, and Jimmy Huang. In Findings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL'23 Findings) 2023.
PDF Abstract BibTex Slides