TY - JOUR
T1 - Trigonometric comparison measure
T2 - A feature selection method for text categorization
AU - Kim, Kyoungok
AU - Zzang, See Young
N1 - Publisher Copyright:
© 2018 Elsevier B.V.
PY - 2019/1
Y1 - 2019/1
N2 - Text data represented using vector space model is high dimensional data since the number of words can easily grow to tens of thousands for a moderate sized dataset. It may contain lots of redundant or irrelevant features that degrade the performance of a classifier for text categorization. To address this problem, feature selection can be applied for dimensionality reduction and it aims to find a set of highly distinguishing features. Most of filter feature selection methods for text categorization are based on document frequencies in positive and negative classes. Considering only document frequencies favors terms frequently used in a larger class and ignores relative document frequencies in the classes. In this paper, we present a new filter feature selection method, named Trigonometric Comparison Measure (TCM) considering relative document frequencies. The proposed method utilizes true positive rate and false positive rate to determine a better subset of features for text categorization and prefers terms that appear only in documents of one class with high probability. In order to assign a higher rank to terms that are frequently used in one class and rarely appears in another class, TCM calculates off-axis angles of a vector represented as (tpr,fpr) and gives a larger score to terms with a small angle using sin andcos functions. The proposed method is compared with eight well-known filter feature selection methods including balanced accuracy measure (ACC2), information gain (IG), chi-squared (CHI), odds ratio (OR), Gini index (Gini), Deviation from a Poisson distribution (DP), distinguishing feature selector (DFS) and normalized difference measure (NDM) on ten datasets using the multinomial naïve Bayes and support vector machines. The experimental results show that TCM achieves significantly better performance for text categorization.
AB - Text data represented using vector space model is high dimensional data since the number of words can easily grow to tens of thousands for a moderate sized dataset. It may contain lots of redundant or irrelevant features that degrade the performance of a classifier for text categorization. To address this problem, feature selection can be applied for dimensionality reduction and it aims to find a set of highly distinguishing features. Most of filter feature selection methods for text categorization are based on document frequencies in positive and negative classes. Considering only document frequencies favors terms frequently used in a larger class and ignores relative document frequencies in the classes. In this paper, we present a new filter feature selection method, named Trigonometric Comparison Measure (TCM) considering relative document frequencies. The proposed method utilizes true positive rate and false positive rate to determine a better subset of features for text categorization and prefers terms that appear only in documents of one class with high probability. In order to assign a higher rank to terms that are frequently used in one class and rarely appears in another class, TCM calculates off-axis angles of a vector represented as (tpr,fpr) and gives a larger score to terms with a small angle using sin andcos functions. The proposed method is compared with eight well-known filter feature selection methods including balanced accuracy measure (ACC2), information gain (IG), chi-squared (CHI), odds ratio (OR), Gini index (Gini), Deviation from a Poisson distribution (DP), distinguishing feature selector (DFS) and normalized difference measure (NDM) on ten datasets using the multinomial naïve Bayes and support vector machines. The experimental results show that TCM achieves significantly better performance for text categorization.
KW - Dimension reduction
KW - Feature selection
KW - Text categorization
KW - Text classification
UR - http://www.scopus.com/inward/record.url?scp=85056630976&partnerID=8YFLogxK
U2 - 10.1016/j.datak.2018.10.003
DO - 10.1016/j.datak.2018.10.003
M3 - Article
AN - SCOPUS:85056630976
SN - 0169-023X
VL - 119
SP - 1
EP - 21
JO - Data and Knowledge Engineering
JF - Data and Knowledge Engineering
ER -