Re-Clustering Documents to Enhance Search Accuracy with Imbalanced Abbreviation Data

Woon Kyo Lee, Ja Hee Kim

Research output: Contribution to journalArticlepeer-review

Abstract

Abbreviation ambiguity poses significant challenges when searching academic literature. This study evaluated the accuracy of clustering algorithms on imbalanced datasets with varying ratios of target groups. A corpus consisting of 1052 papers focused on the study of abbreviations. The "MSA" dataset was clustered using TF-IDF, cosine similarity, and k-means. Clustering performance declined as the ratios in the target group deviated from balanced thresholds. A re-clustering method was introduced, involving the selective exclusion of non-target clusters. Re-clustering improved accuracy and F1 scores in most scenarios, demonstrating particular stability with higher cluster counts. The re-clustering performance of comparisons was stronger when compared to k-means and self-adaptive methods. The study highlights issues stemming from data imbalance and presents an effective strategy for enhancing abbreviation search efficiency.

Original languageEnglish
Pages (from-to)1845-1858
Number of pages14
JournalTehnicki Vjesnik
Volume31
Issue number6
DOIs
StatePublished - 2024

Keywords

  • K-means algorithm
  • Re-clustering
  • imbalanced data
  • word sense disambiguation

Fingerprint

Dive into the research topics of 'Re-Clustering Documents to Enhance Search Accuracy with Imbalanced Abbreviation Data'. Together they form a unique fingerprint.

Cite this