Abstract
In this paper, we propose a framework for measuring and comparing the performance of text similarity measurement methods between long and short text documents. Here, we use four kinds of methods for text feature extraction to measure text similarity: 1) TF-IDF, 2) Rake, 3) Word2Vec, and 4) Doc2Vec. Using real news data and SNS data, we analyze top-100 SNS data that show the highest similarity to the news, which describe the events occurred in the same region as the SNS user's residence. As the result, Rake, which can identify 48% of tweets associated with the news, performed better than Word2Vec(22%), Doc2Vec(22~24%), and TF-IDF(3%). Furthermore, we analyze top-100 SNS data that show the highest similarity to the news for the same region and that for another region by applying the proposed framework to the problem of the SNS user location prediction. The result indicates that Rake shows a significant difference, which shows 24 times of difference in the similarity for the same region and that for another region, compared to Word2Vec(2.2 times), Doc2Vec(2.4 ~ 4.4 times), and TF-IDF. The proposed framework showed meaningful performance differences between methods and shows the possibility of the application to the problem of the SNS user location prediction. We expect that it can be utilized to a variety of the applications that need to analyze long and short data such as news data and SNS data comprehensively.
| Translated title of the contribution | Performance Comparison of Similarity Measurement Methods between Long and Short Text Documents and Its Application to the Prediction of the SNS User Location |
|---|---|
| Original language | Korean |
| Pages (from-to) | 76-90 |
| Number of pages | 15 |
| Journal | 데이타베이스연구 |
| Volume | 36 |
| Issue number | 3 |
| State | Published - 2020 |