A scalable feature based clustering algorithm for sequences with many distinct items

Sangheum Hwang, Dohyun Kim

Research output: Contribution to journalArticlepeer-review

1 Scopus citations

Abstract

Various sequence data have grown explosively in recent years. As more and more of such data become available, clustering is needed to understand the structure of sequence data. However, the existing clustering algorithms for sequence data are computationally demanding. To avoid such a problem, a feature-based clustering algorithm has been proposed. Notwithstanding that, the algorithm uses only a subset of all possible frequent sequential patterns as features, which may result in the distortion of similarities between sequences in practice, especially when dealing with sequence data with a large number of distinct items such as customer transaction data. Developed in this article is a feature-based clustering algorithm using a complete set of frequent sequential patterns as features for sequences of sets of items as well as sequences of single items which consist of many distinct items. The proposed algorithm projects sequence data into feature space whose dimension consists of a complete set of frequent sequential patterns, and then, employs K-means clustering algorithm. Experimental results show that the proposed algorithm generates more meaningful clusters than the compared algorithms regardless of the dataset and parameters such as the minimum support value of frequent sequential patterns and the number of clusters considered. Moreover, the proposed algorithm can be applied to a large sequence database since it is linearly scalable to the number of sequence data.

Original languageEnglish
Pages (from-to)316-325
Number of pages10
JournalInternational Journal of Fuzzy Logic and Intelligent Systems
Volume18
Issue number4
DOIs
StatePublished - 2018

Keywords

  • Feature-based clustering
  • Frequent sequential patterns
  • Sequence data

Fingerprint

Dive into the research topics of 'A scalable feature based clustering algorithm for sequences with many distinct items'. Together they form a unique fingerprint.

Cite this