TY - GEN
T1 - Performance evaluation of spatial data management systems using geospark
AU - Shin, Hansub
AU - Lee, Kisung
AU - Kwon, Hyuk Yoon
N1 - Publisher Copyright:
© 2020 IEEE.
PY - 2020/2
Y1 - 2020/2
N2 - In this paper, we evaluate the performance of spatial data management systems in distributed computing environments. Given that GeoSpark outperforms other spatial systems in many scenarios as reported in several studies, we choose spatial data management systems using GeoSpark for this evaluation. Even though GeoSpark supports various storage engines as its underlying data store, the effects of the storage engines for spatial data processing have not been well studied. To address this limitation, we evaluate the performance of GeoSpark using two underlying data stores: 1) HDFS and 2) MongoDB. We first design and build distributed experimental environments based on Amazon EC2 and EMR using up to 10 nodes. Through the extensive experiments on three synthetic and real data sets, we show that the overall performance of both HDFS-and MongoDB-based GeoSpark improves as we increase the number of nodes. We also show that HDFS-based GeoSpark generally outperforms MongoDB-based GeoSpark, especially for large-scale data sets. In addition, we demonstrate that the proper use of caching on HDFS-based GeoSpark can improve the overall query processing performance by up to three orders of magnitude.
AB - In this paper, we evaluate the performance of spatial data management systems in distributed computing environments. Given that GeoSpark outperforms other spatial systems in many scenarios as reported in several studies, we choose spatial data management systems using GeoSpark for this evaluation. Even though GeoSpark supports various storage engines as its underlying data store, the effects of the storage engines for spatial data processing have not been well studied. To address this limitation, we evaluate the performance of GeoSpark using two underlying data stores: 1) HDFS and 2) MongoDB. We first design and build distributed experimental environments based on Amazon EC2 and EMR using up to 10 nodes. Through the extensive experiments on three synthetic and real data sets, we show that the overall performance of both HDFS-and MongoDB-based GeoSpark improves as we increase the number of nodes. We also show that HDFS-based GeoSpark generally outperforms MongoDB-based GeoSpark, especially for large-scale data sets. In addition, we demonstrate that the proper use of caching on HDFS-based GeoSpark can improve the overall query processing performance by up to three orders of magnitude.
KW - Distributed environments
KW - GeoSpark
KW - Large-scale spatial data
KW - Performance evaluation
UR - https://www.scopus.com/pages/publications/85084373440
U2 - 10.1109/BigComp48618.2020.00-75
DO - 10.1109/BigComp48618.2020.00-75
M3 - Conference contribution
AN - SCOPUS:85084373440
T3 - Proceedings - 2020 IEEE International Conference on Big Data and Smart Computing, BigComp 2020
SP - 197
EP - 200
BT - Proceedings - 2020 IEEE International Conference on Big Data and Smart Computing, BigComp 2020
A2 - Lee, Wookey
A2 - Chen, Luonan
A2 - Moon, Yang-Sae
A2 - Bourgeois, Julien
A2 - Bennis, Mehdi
A2 - Li, Yu-Feng
A2 - Ha, Young-Guk
A2 - Kwon, Hyuk-Yoon
A2 - Cuzzocrea, Alfredo
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2020 IEEE International Conference on Big Data and Smart Computing, BigComp 2020
Y2 - 19 February 2020 through 22 February 2020
ER -