Abstract
Although outlier detection has received significant attention by practitioners as well as researchers, its application to datasets consisting of both categorical and numerical attributes still remains a challenge. In this paper, a novel approach based on the local outlier factor (LOF) and similarity measure is proposed to tackle the challenge. Occurrence frequency similarity is adopted to measure the closeness ofcategorical data and derive a continuous distance accordingly. Two distances from categorical and numerical attributes are merged and input to the LOF calculation to identify outliers. Test results on various datasets confirm that the proposed approach provides superior performance for all cases compared to the simple numerical approach. The consistent superiority over the benchmark validates that the similarity measure successfully captures the characteristics of categorical data.
| Original language | English |
|---|---|
| Pages (from-to) | 2155-2160 |
| Number of pages | 6 |
| Journal | ICIC Express Letters, Part B: Applications |
| Volume | 7 |
| Issue number | 10 |
| State | Published - 1 Oct 2016 |
Keywords
- Categorical data
- Local outlier factor
- Mixed type data
- Outlier detection
- Similarity