TY - JOUR
T1 - Comparative study of term-weighting schemes for environmental big data using machine learning
AU - Kim, Jung Jin
AU - Kim, Han Ul
AU - Adamowski, Jan
AU - Hatami, Shadi
AU - Jeong, Hanseok
N1 - Publisher Copyright:
© 2022 Elsevier Ltd
PY - 2022/11
Y1 - 2022/11
N2 - Widely-used term-weighting schemes and machine learning (ML) classifiers with default parameter settings were assessed for their performance when applied to environmental big data analysis. Five term-weighting schemes [term frequency (TF), TF–inverse document frequency (TF-IDF), Best Match 25 (BM25), TF–inverse gravity moment (TF-IGM), and TF–IDF–inverse class frequency (TF-IDF-ICF)] and five different ML classifiers [support vector machine (SVM), Naive Bayes ( NB), logistic regression ( LR), random forest (RF), and extreme gradient boosting (XGBoost)] were tested. The optimal text-classification scheme and classifier were TF-IDF-ICF and LR, respectively. Based on evaluation criteria, their combination resulted in the best performance of all scheme and classifier combinations for the full environmental data analysis. Category classification performance differed according to the environmental section (climate, air, water, or waste/garbage), with the best performance being achieved for climate, and the poorest for water. This demonstrated the importance of selecting term-weighting schemes and ML classifiers in human-generated environmental big data analysis.
AB - Widely-used term-weighting schemes and machine learning (ML) classifiers with default parameter settings were assessed for their performance when applied to environmental big data analysis. Five term-weighting schemes [term frequency (TF), TF–inverse document frequency (TF-IDF), Best Match 25 (BM25), TF–inverse gravity moment (TF-IGM), and TF–IDF–inverse class frequency (TF-IDF-ICF)] and five different ML classifiers [support vector machine (SVM), Naive Bayes ( NB), logistic regression ( LR), random forest (RF), and extreme gradient boosting (XGBoost)] were tested. The optimal text-classification scheme and classifier were TF-IDF-ICF and LR, respectively. Based on evaluation criteria, their combination resulted in the best performance of all scheme and classifier combinations for the full environmental data analysis. Category classification performance differed according to the environmental section (climate, air, water, or waste/garbage), with the best performance being achieved for climate, and the poorest for water. This demonstrated the importance of selecting term-weighting schemes and ML classifiers in human-generated environmental big data analysis.
KW - Environmental digital news
KW - Feature selection
KW - Term-weighting schemes
KW - Text classification
UR - https://www.scopus.com/pages/publications/85139046885
U2 - 10.1016/j.envsoft.2022.105536
DO - 10.1016/j.envsoft.2022.105536
M3 - Article
AN - SCOPUS:85139046885
SN - 1364-8152
VL - 157
JO - Environmental Modelling and Software
JF - Environmental Modelling and Software
M1 - 105536
ER -