장문과 단문 간 유사도 측정 방법의 성능 비교 및 SNS 사용자 위치 예측 문제로의 적용

Translated title of the contribution: Performance Comparison of Similarity Measurement Methods between Long and Short Text Documents and Its Application to the Prediction of the SNS User Location

Research output: Contribution to journalArticlepeer-review

Abstract

In this paper, we propose a framework for measuring and comparing the performance of text similarity measurement methods between long and short text documents. Here, we use four kinds of methods for text feature extraction to measure text similarity: 1) TF-IDF, 2) Rake, 3) Word2Vec, and 4) Doc2Vec. Using real news data and SNS data, we analyze top-100 SNS data that show the highest similarity to the news, which describe the events occurred in the same region as the SNS user's residence. As the result, Rake, which can identify 48% of tweets associated with the news, performed better than Word2Vec(22%), Doc2Vec(22~24%), and TF-IDF(3%). Furthermore, we analyze top-100 SNS data that show the highest similarity to the news for the same region and that for another region by applying the proposed framework to the problem of the SNS user location prediction. The result indicates that Rake shows a significant difference, which shows 24 times of difference in the similarity for the same region and that for another region, compared to Word2Vec(2.2 times), Doc2Vec(2.4 ~ 4.4 times), and TF-IDF. The proposed framework showed meaningful performance differences between methods and shows the possibility of the application to the problem of the SNS user location prediction. We expect that it can be utilized to a variety of the applications that need to analyze long and short data such as news data and SNS data comprehensively.
Translated title of the contributionPerformance Comparison of Similarity Measurement Methods between Long and Short Text Documents and Its Application to the Prediction of the SNS User Location
Original languageKorean
Pages (from-to)76-90
Number of pages15
Journal데이타베이스연구
Volume36
Issue number3
StatePublished - 2020

Fingerprint

Dive into the research topics of 'Performance Comparison of Similarity Measurement Methods between Long and Short Text Documents and Its Application to the Prediction of the SNS User Location'. Together they form a unique fingerprint.

Cite this