In the rapidly evolving field of machine learning (ML), data augmentation (DA) has emerged as a pivotal technique for enhancing model performance by diversifying training examples without the need for additional data collection. This survey explores the transformative impact of Large Language Models (LLMs) on DA, particularly addressing the unique challenges and opportunities they present in the context of natural language processing (NLP) and beyond. We provide a comprehensive overview of methods leveraging LLMs for DA, including a novel exploration of learning paradigms where LLM-generated data is used for further training, thus enhancing model robustness and performance. Additionally, this paper delineates the primary challenges faced in this domain, ranging from controllable data augmentation to multi modal data augmentation. This survey highlights the paradigm shift introduced by LLMs in DA, aims to serve as a foundational guide for researchers and practitioners in this field.
Data Augmentation using LLMs: Methods, Learning Paradigms and Challenges
Bosheng Ding, Chengwei Qin, Ruochen Zhao, Tianze Luo, Xinze Li, Guizhen Chen, Wenhan Xia, Junjie Hu, Anh-Tuan Luu, and Shafiq Joty. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL'24 Findings) 2024.
PDF Abstract BibTex Slides