A complete and fast scraping method for collecting tweets

Jaebeom You, Jaekyu Lee, Hyuk Yoon Kwon

Research output: Chapter in Book/Report/Conference proceedingConference contributionpeer-review

8 Scopus citations

Abstract

In this paper, we propose a scraping method for collecting tweets, which we call DeepScrap. DeepScrap provides the complete scraping for the recent tweets that can be viewed on a specific user's page and crawls with a fast speed that overcomes the limited rates in Twitter APIs. Especially, to improve the crawling speed of DeepScrap, we devise a multiprocessing architecture while assigning different IPs to the multiple processes to follow the robots.txt of Twitter. This allows us to maximize the parallelism of crawling in a machine. We show that DeepScrap can crawl the entire tweets that are crawled by Twitter standard APIs by analyzing the tweets on 97 users. Through extensive experiments, we show that DeepScrap can crawl the entire tweets of 97 users, which amounts to 222, 194 tweets while Twitter standard API can crawl only 12, 586 tweets of them because of the constraints. We also show that multiprocessing of DeepScrap improves single processing of DeepScrap by 2.97 times to crawl 222, 194 tweets for 97 users when four processes are running simultaneously.

Original languageEnglish
Title of host publicationProceedings - 2021 IEEE International Conference on Big Data and Smart Computing, BigComp 2021
EditorsHerwig Unger, Jinho Kim, U Kang, Chakchai So-In, Junping Du, Walid Saad, Young-guk Ha, Christian Wagner, Julien Bourgeois, Chanboon Sathitwiriyawong, Hyuk-Yoon Kwon, Carson Leung
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages24-27
Number of pages4
ISBN (Electronic)9781728189246
DOIs
StatePublished - Jan 2021
Event2021 IEEE International Conference on Big Data and Smart Computing, BigComp 2021 - Jeju Island, Korea, Republic of
Duration: 17 Jan 202120 Jan 2021

Publication series

NameProceedings - 2021 IEEE International Conference on Big Data and Smart Computing, BigComp 2021

Conference

Conference2021 IEEE International Conference on Big Data and Smart Computing, BigComp 2021
Country/TerritoryKorea, Republic of
CityJeju Island
Period17/01/2120/01/21

Keywords

  • Crawling
  • Multiprocessing
  • Tor Network
  • Tweets

Fingerprint

Dive into the research topics of 'A complete and fast scraping method for collecting tweets'. Together they form a unique fingerprint.

Cite this