TY - JOUR
T1 - Regulating the level of manipulation in text augmentation with systematic adjustment and advanced sentence embedding
AU - Cha, Yuho
AU - Lee, Younghoon
N1 - Publisher Copyright:
© The Author(s), under exclusive licence to Springer-Verlag London Ltd., part of Springer Nature 2024.
PY - 2025/2
Y1 - 2025/2
N2 - Text augmentation, a method for generating samples by applying combinations, noise, and other manipulations to a small dataset, is a crucial technique in natural language processing research. It introduced diversity into the training process, thereby enabling the construction of robust models. The level of manipulation is the most important issue in text augmentation; low-level manipulation generates data similar to the original, resulting in inefficient augmentation because it cannot ensure diversity, whereas high-level manipulation causes reliability issues for labels and degrades the model’s performance. Therefore, this paper proposes a systematically adjustable text augmentation technique to address the “level of manipulation” issue. Specifically, it proposes a method for systematically adjusting the data candidate pool for manipulation to provide an appropriate level of randomness during the augmentation process. Furthermore, we propose an advanced sentence embedding methodology to achieve robust pseudo-labeling at the manipulation level. In other words, we leverage combined sentence embedding, which incorporates sentence embedding, document embedding, and XAI information from the original data to assign reliable pseudo-labels. We conducted performance comparisons with existing text augmentation approaches to validate the effectiveness of our proposed methodology. The experimental results demonstrate that the proposed method achieves the highest performance improvement across all the experimental datasets.
AB - Text augmentation, a method for generating samples by applying combinations, noise, and other manipulations to a small dataset, is a crucial technique in natural language processing research. It introduced diversity into the training process, thereby enabling the construction of robust models. The level of manipulation is the most important issue in text augmentation; low-level manipulation generates data similar to the original, resulting in inefficient augmentation because it cannot ensure diversity, whereas high-level manipulation causes reliability issues for labels and degrades the model’s performance. Therefore, this paper proposes a systematically adjustable text augmentation technique to address the “level of manipulation” issue. Specifically, it proposes a method for systematically adjusting the data candidate pool for manipulation to provide an appropriate level of randomness during the augmentation process. Furthermore, we propose an advanced sentence embedding methodology to achieve robust pseudo-labeling at the manipulation level. In other words, we leverage combined sentence embedding, which incorporates sentence embedding, document embedding, and XAI information from the original data to assign reliable pseudo-labels. We conducted performance comparisons with existing text augmentation approaches to validate the effectiveness of our proposed methodology. The experimental results demonstrate that the proposed method achieves the highest performance improvement across all the experimental datasets.
KW - Advanced sentence embedding
KW - Reliable pseudo-labels
KW - Text augmentation
KW - The level of manipulation
UR - http://www.scopus.com/inward/record.url?scp=85212080894&partnerID=8YFLogxK
U2 - 10.1007/s00521-024-10663-8
DO - 10.1007/s00521-024-10663-8
M3 - Article
AN - SCOPUS:85212080894
SN - 0941-0643
VL - 37
SP - 3473
EP - 3487
JO - Neural Computing and Applications
JF - Neural Computing and Applications
IS - 5
M1 - 107732
ER -