TY - JOUR
T1 - Text augmentation method with adjustable manipulation intensity based on in-context learning
AU - Cha, Yuho
AU - Lee, Younghoon
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2025.
PY - 2025/7
Y1 - 2025/7
N2 - Text augmentation, a technique for generating new samples through various combinations, noise, and manipulations of small datasets, is an essential technique in natural language processing research. This methodology enables the construction of robust models during the training step by enhancing data diversity. However, determining the manipulation level remains a significant challenge. When the manipulation intensity is too low, insufficient data diversity is generated, leading to suboptimal augmentation effects. Conversely, excessive manipulation can compromise label reliability, resulting in a degradation of model performance. To address the challenge of “manipulation level,” we propose a text augmentation technique that can make systematic adjustments. In particular, we introduce a method for flexibly resetting the range of the candidate pool for manipulations, ensuring an optimal level of randomness during the augmentation process. We also introduce an advanced sentence embedding that supports reliable pseudo-labeling across different manipulation levels. Additionally, we utilize ChatGPT model in the final stage to enhance the coherence and expressiveness of the generated text, thereby improving the quality of the output. To evaluate the effectiveness of our approach, we performed comparisons with existing text augmentation approaches. The experimental results show significant performance improvements in almost all test datasets.
AB - Text augmentation, a technique for generating new samples through various combinations, noise, and manipulations of small datasets, is an essential technique in natural language processing research. This methodology enables the construction of robust models during the training step by enhancing data diversity. However, determining the manipulation level remains a significant challenge. When the manipulation intensity is too low, insufficient data diversity is generated, leading to suboptimal augmentation effects. Conversely, excessive manipulation can compromise label reliability, resulting in a degradation of model performance. To address the challenge of “manipulation level,” we propose a text augmentation technique that can make systematic adjustments. In particular, we introduce a method for flexibly resetting the range of the candidate pool for manipulations, ensuring an optimal level of randomness during the augmentation process. We also introduce an advanced sentence embedding that supports reliable pseudo-labeling across different manipulation levels. Additionally, we utilize ChatGPT model in the final stage to enhance the coherence and expressiveness of the generated text, thereby improving the quality of the output. To evaluate the effectiveness of our approach, we performed comparisons with existing text augmentation approaches. The experimental results show significant performance improvements in almost all test datasets.
KW - Adjustable manipulation intensity
KW - Advanced sentence embedding
KW - In-context learning
KW - Reliable pseudo-labels
KW - Text augmentation
UR - https://www.scopus.com/pages/publications/105002345218
U2 - 10.1007/s10115-025-02413-6
DO - 10.1007/s10115-025-02413-6
M3 - Article
AN - SCOPUS:105002345218
SN - 0219-1377
VL - 67
SP - 5901
EP - 5923
JO - Knowledge and Information Systems
JF - Knowledge and Information Systems
IS - 7
ER -