Abstract
This paper presents Sphinx, a full-fledged open-source system for big spatial data which overcomes the limitations of existing systems by adopting a standard SQL interface, and by providing a high efficient core built inside the core of the Apache Impala system. Sphinx is composed of four main layers, namely, query parser, indexer, query planner, and query executor. The query parser injects spatial data types and functions in the SQL interface of Sphinx. The indexer creates spatial indexes in Sphinx by adopting a two-layered index design. The query planner utilizes these indexes to construct efficient query plans for range query and spatial join operations. Finally, the query executor carries out these plans on big spatial datasets in a distributed cluster. A system prototype of Sphinx running on real datasets shows up-to three orders of magnitude performance improvement over plain-vanilla Impala, SpatialHadoop, and PostGIS.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The idea of Sphinx was first introduced as a poster here [13].
- 2.
Project home page is http://www.spatialworx.com/sphinx/ and source code is available at https://github.com/gistic/SpatialImpala.
References
Markram, H.: The blue brain project. Nat. Rev. Neurosci. 7(2), 153–160 (2006)
Auchincloss, A., et al.: A review of spatial methods in epidemiology: 2000–2010. Annu. Rev. Public Health 33, 107–122 (2012)
Faghmous, J., Kumar, V.: Spatio-temporal data mining for climate data: advances, challenges, and opportunities. In: Chu, W. (ed.) Data Mining and Knowledge Discovery for Big Data. Studies in Big Data, vol. 1, pp. 83–116. Springer, Heidelberg (2014). doi:10.1007/978-3-642-40837-3_3
Sankaranarayanan, J., Samet, H., Teitler, B.E., Sperling, M.: TwitterStand: news in tweets. In: SIGSPATIAL (2009)
Aji, A., et al.: Hadoop-GIS: a high performance spatial data warehousing system over MapReduce. In: VLDB (2013)
Eldawy, A., Mokbel, M.F.: SpatialHadoop: a MapReduce framework for spatial data. In: ICDE (2015)
Nishimura, S., et al.: \({\cal{MD}}\)-HBase: design and implementation of an elastic data infrastructure for cloud-scale location services. DAPD 31(2), 289–319 (2013)
Nidzwetzki, J.K., Güting, R.H.: Distributed SECONDO: a highly available and scalable system for spatial data processing. In: Claramunt, C., Schneider, M., Wong, R.C.-W., Xiong, L., Loh, W.-K., Shahabi, C., Li, K.-J. (eds.) SSTD 2015. LNCS, vol. 9239, pp. 491–496. Springer, Cham (2015). doi:10.1007/978-3-319-22363-6_28
Fox, A., et al.: Spatio-temporal indexing in non-relational distributed databases. In: International Conference on Big Data (2013)
Yu, J., et al.: A demonstration of GeoSpark: a cluster computing framework for processing big spatial data. In: ICDE (2016)
Xie, D., et al.: Simba: efficient in-memory spatial analytics. In: SIGMOD, San Francisco, CA, June 2016
Whitman, R.T., et al.: Spatial indexing and analytics on hadoop. In: SIGSPATIAL (2014)
Eldawy, A., et al.: Sphinx: distributed execution of interactive SQL queries on big spatial data (Poster). In: SIGSPATIAL (2015)
Kornacker, M., et al.: Impala: A Modern. CIDR, Open-Source SQL Engine for Hadoop (2015)
Wanderman-Milne, S., Li, N.: Runtime code generation in cloudera impala. IEEE Data Eng. Bull. 37(1), 31–37 (2014)
Floratou, A., et al.: SQL-on-hadoop: full circle back to shared-nothing database architectures. In: PVLDB (2014)
Thusoo, A., et al.: Hive: a warehousing solution over a map-reduce framework. In: PVLDB (2009)
Armbrust, M., et al.: Spark SQL: relational data processing in spark. In: SIGMOD (2015)
Schnitzer, B., Leutenegger, S.T.: Master-client r-trees: a new parallel r-tree architecture. In: SSDBM (1999)
DeWitt, D., Gray, J.: Parallel database systems: the future of high performance database systems. In: CACM (1992)
Eldawy, A., Alarabi, L., Mokbel, M.F.: Spatial partitioning techniques in SpatialHadoop. In: PVLDB (2015)
Yu, J., et al.: GeoSpark: a cluster computing framework for processing large-scale spatial data. In: SIGSPATIAL (2015)
Leutenegger, S., et al.: STR: a simple and efficient algorithm for R-tree packing. In: ICDE (1997)
den Bercken, J.V., et al.: The bulk index join: a generic approach to processing non-equijoins. In: ICDE (1999)
Patel, J., DeWitt, D.: Partition based spatial-merge join. In: SIGMOD (1996)
Dittrich, J.P., Seeger, B.: Data redundancy and duplicate detection in spatial join processing. In: ICDE (2000)
Brinkhoff, T., Kriegel, H., Seeger, B.: Efficient processing of spatial joins using R-trees. In: SIGMOD, pp. 237–246 (1993)
Arge, L., et al.: Scalable sweeping-based spatial join. In: VLDB (1998)
Zhang, S., et al.: SJMR: parallelizing spatial join with MapReduce on clusters. In: CLUSTER, pp. 1–8 (2009)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)
Olston, C., et al.: Pig latin: a not-so-foreign language for data processing. In: SIGMOD (2008)
Zaharia, M., et al.: Spark: cluster computing with working sets. In: HotCloud (2010)
Cary, A., Sun, Z., Hristidis, V., Rishe, N.: Experiences on processing spatial data with MapReduce. In: Winslett, M. (ed.) SSDBM 2009. LNCS, vol. 5566, pp. 302–319. Springer, Heidelberg (2009). doi:10.1007/978-3-642-02279-1_24
Zhang, S., et al.: Spatial queries evaluation with MapReduce. In: GCC, pp. 287–292 (2009)
Ma, Q., Yang, B., Qian, W., Zhou, A.: Query processing of massive trajectory data based on MapReduce. In: CLOUDDB (2009)
Akdogan, A., et al.: Voronoi-based geospatial query processing with MapReduce. In: CLOUDCOM (2010)
You, S., et al.: Large-scale spatial join query processing in cloud. In: CLOUDDM (2015)
Stonebraker, M., et al.: SciDB: a database management system for applications with complex analytics. Comput. Sci. Eng. 15(3), 54–62 (2013)
Wang, G., et al.: Behavioral Simulations in MapReduce. In: PVLDB (2010)
Lu, J., Guting, R.H.: Parallel secondo: boosting database engines with Hadoop. In: ICPADS (2012)
Acknowledgement
This work is supported in part by the National Science Foundation under Grants IIS-1525953, CNS-1512877, IIS-0952977, and IIS-1218168.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Eldawy, A., Sabek, I., Elganainy, M., Bakeer, A., Abdelmotaleb, A., Mokbel, M.F. (2017). Sphinx: Empowering Impala for Efficient Execution of SQL Queries on Big Spatial Data. In: Gertz, M., et al. Advances in Spatial and Temporal Databases. SSTD 2017. Lecture Notes in Computer Science(), vol 10411. Springer, Cham. https://doi.org/10.1007/978-3-319-64367-0_4
Download citation
DOI: https://doi.org/10.1007/978-3-319-64367-0_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64366-3
Online ISBN: 978-3-319-64367-0
eBook Packages: Computer ScienceComputer Science (R0)