TY - JOUR
T1 - An efficient method to determine sample size in oversampling based on classification complexity for imbalanced data
AU - Lee, Dohyun
AU - Kim, Kyoungok
N1 - Publisher Copyright:
© 2021 Elsevier Ltd
PY - 2021/12/1
Y1 - 2021/12/1
N2 - Resampling, one of the approaches to handle class imbalance, is widely used alone or in combination with other approaches, such as cost-sensitive learning and ensemble learning because of its simplicity and independence in learning algorithms. Oversampling methods, in particular, alleviate class imbalance by increasing the size of the minority class. However, previous studies related to oversampling generally have focused on where to add new samples, how to generate new samples, and how to prevent noise and they rarely have investigated how much sampling is sufficient. In many cases, the oversampling size is set so that the minority class has the same size as the majority class. This setting only considers the size of the classes in sample size determination, and the balanced training set can induce overfitting with the addition of too many minority samples. Moreover, the effectiveness of oversampling can be improved by adding synthetics into the appropriate locations. To address this issue, this study proposes a method to determine the oversampling size less than the sample size needed to obtain a balance between classes, while considering not only the absolute imbalance but also the difficulty of classification in a dataset on the basis of classification complexity. The effectiveness of the proposed sample size in oversampling is evaluated using several boosting algorithms with different oversampling methods for 16 imbalanced datasets. The results show that the proposed sample size achieves better classification performance than the sample size for attaining class balance.
AB - Resampling, one of the approaches to handle class imbalance, is widely used alone or in combination with other approaches, such as cost-sensitive learning and ensemble learning because of its simplicity and independence in learning algorithms. Oversampling methods, in particular, alleviate class imbalance by increasing the size of the minority class. However, previous studies related to oversampling generally have focused on where to add new samples, how to generate new samples, and how to prevent noise and they rarely have investigated how much sampling is sufficient. In many cases, the oversampling size is set so that the minority class has the same size as the majority class. This setting only considers the size of the classes in sample size determination, and the balanced training set can induce overfitting with the addition of too many minority samples. Moreover, the effectiveness of oversampling can be improved by adding synthetics into the appropriate locations. To address this issue, this study proposes a method to determine the oversampling size less than the sample size needed to obtain a balance between classes, while considering not only the absolute imbalance but also the difficulty of classification in a dataset on the basis of classification complexity. The effectiveness of the proposed sample size in oversampling is evaluated using several boosting algorithms with different oversampling methods for 16 imbalanced datasets. The results show that the proposed sample size achieves better classification performance than the sample size for attaining class balance.
KW - Adaptive boosting
KW - Class imbalance
KW - Ensemble learning
KW - Oversampling
KW - Sampling size
UR - http://www.scopus.com/inward/record.url?scp=85108998812&partnerID=8YFLogxK
U2 - 10.1016/j.eswa.2021.115442
DO - 10.1016/j.eswa.2021.115442
M3 - Article
AN - SCOPUS:85108998812
SN - 0957-4174
VL - 184
JO - Expert Systems with Applications
JF - Expert Systems with Applications
M1 - 115442
ER -