TY - GEN
T1 - A complete and fast scraping method for collecting tweets
AU - You, Jaebeom
AU - Lee, Jaekyu
AU - Kwon, Hyuk Yoon
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021/1
Y1 - 2021/1
N2 - In this paper, we propose a scraping method for collecting tweets, which we call DeepScrap. DeepScrap provides the complete scraping for the recent tweets that can be viewed on a specific user's page and crawls with a fast speed that overcomes the limited rates in Twitter APIs. Especially, to improve the crawling speed of DeepScrap, we devise a multiprocessing architecture while assigning different IPs to the multiple processes to follow the robots.txt of Twitter. This allows us to maximize the parallelism of crawling in a machine. We show that DeepScrap can crawl the entire tweets that are crawled by Twitter standard APIs by analyzing the tweets on 97 users. Through extensive experiments, we show that DeepScrap can crawl the entire tweets of 97 users, which amounts to 222, 194 tweets while Twitter standard API can crawl only 12, 586 tweets of them because of the constraints. We also show that multiprocessing of DeepScrap improves single processing of DeepScrap by 2.97 times to crawl 222, 194 tweets for 97 users when four processes are running simultaneously.
AB - In this paper, we propose a scraping method for collecting tweets, which we call DeepScrap. DeepScrap provides the complete scraping for the recent tweets that can be viewed on a specific user's page and crawls with a fast speed that overcomes the limited rates in Twitter APIs. Especially, to improve the crawling speed of DeepScrap, we devise a multiprocessing architecture while assigning different IPs to the multiple processes to follow the robots.txt of Twitter. This allows us to maximize the parallelism of crawling in a machine. We show that DeepScrap can crawl the entire tweets that are crawled by Twitter standard APIs by analyzing the tweets on 97 users. Through extensive experiments, we show that DeepScrap can crawl the entire tweets of 97 users, which amounts to 222, 194 tweets while Twitter standard API can crawl only 12, 586 tweets of them because of the constraints. We also show that multiprocessing of DeepScrap improves single processing of DeepScrap by 2.97 times to crawl 222, 194 tweets for 97 users when four processes are running simultaneously.
KW - Crawling
KW - Multiprocessing
KW - Tor Network
KW - Tweets
UR - http://www.scopus.com/inward/record.url?scp=85102972485&partnerID=8YFLogxK
U2 - 10.1109/BigComp51126.2021.00014
DO - 10.1109/BigComp51126.2021.00014
M3 - Conference contribution
AN - SCOPUS:85102972485
T3 - Proceedings - 2021 IEEE International Conference on Big Data and Smart Computing, BigComp 2021
SP - 24
EP - 27
BT - Proceedings - 2021 IEEE International Conference on Big Data and Smart Computing, BigComp 2021
A2 - Unger, Herwig
A2 - Kim, Jinho
A2 - Kang, U
A2 - So-In, Chakchai
A2 - Du, Junping
A2 - Saad, Walid
A2 - Ha, Young-guk
A2 - Wagner, Christian
A2 - Bourgeois, Julien
A2 - Sathitwiriyawong, Chanboon
A2 - Kwon, Hyuk-Yoon
A2 - Leung, Carson
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2021 IEEE International Conference on Big Data and Smart Computing, BigComp 2021
Y2 - 17 January 2021 through 20 January 2021
ER -