Sphinx: Empowering Impala for Efficient Execution of SQL Queries on Big Spatial Data

Eldawy, Ahmed; Sabek, Ibrahim; Elganainy, Mostafa; Bakeer, Ammar; Abdelmotaleb, Ahmed; Mokbel, Mohamed F.

doi:10.1007/978-3-319-64367-0_4

Ahmed Eldawy²⁵,
Ibrahim Sabek²⁶,
Mostafa Elganainy²⁷,
Ammar Bakeer²⁷,
Ahmed Abdelmotaleb²⁷ &
…
Mohamed F. Mokbel²⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10411))

Included in the following conference series:

International Symposium on Spatial and Temporal Databases

1604 Accesses
6 Citations

Abstract

This paper presents Sphinx, a full-fledged open-source system for big spatial data which overcomes the limitations of existing systems by adopting a standard SQL interface, and by providing a high efficient core built inside the core of the Apache Impala system. Sphinx is composed of four main layers, namely, query parser, indexer, query planner, and query executor. The query parser injects spatial data types and functions in the SQL interface of Sphinx. The indexer creates spatial indexes in Sphinx by adopting a two-layered index design. The query planner utilizes these indexes to construct efficient query plans for range query and spatial join operations. Finally, the query executor carries out these plans on big spatial datasets in a distributed cluster. A system prototype of Sphinx running on real datasets shows up-to three orders of magnitude performance improvement over plain-vanilla Impala, SpatialHadoop, and PostGIS.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The idea of Sphinx was first introduced as a poster here [13].
2.
Project home page is http://www.spatialworx.com/sphinx/ and source code is available at https://github.com/gistic/SpatialImpala.

References

Markram, H.: The blue brain project. Nat. Rev. Neurosci. 7(2), 153–160 (2006)
Article MathSciNet Google Scholar
Auchincloss, A., et al.: A review of spatial methods in epidemiology: 2000–2010. Annu. Rev. Public Health 33, 107–122 (2012)
Article Google Scholar
Faghmous, J., Kumar, V.: Spatio-temporal data mining for climate data: advances, challenges, and opportunities. In: Chu, W. (ed.) Data Mining and Knowledge Discovery for Big Data. Studies in Big Data, vol. 1, pp. 83–116. Springer, Heidelberg (2014). doi:10.1007/978-3-642-40837-3_3
Sankaranarayanan, J., Samet, H., Teitler, B.E., Sperling, M.: TwitterStand: news in tweets. In: SIGSPATIAL (2009)
Google Scholar
Aji, A., et al.: Hadoop-GIS: a high performance spatial data warehousing system over MapReduce. In: VLDB (2013)
Google Scholar
Eldawy, A., Mokbel, M.F.: SpatialHadoop: a MapReduce framework for spatial data. In: ICDE (2015)
Google Scholar
Nishimura, S., et al.: \({\cal{MD}}\)-HBase: design and implementation of an elastic data infrastructure for cloud-scale location services. DAPD 31(2), 289–319 (2013)
Google Scholar
Nidzwetzki, J.K., Güting, R.H.: Distributed SECONDO: a highly available and scalable system for spatial data processing. In: Claramunt, C., Schneider, M., Wong, R.C.-W., Xiong, L., Loh, W.-K., Shahabi, C., Li, K.-J. (eds.) SSTD 2015. LNCS, vol. 9239, pp. 491–496. Springer, Cham (2015). doi:10.1007/978-3-319-22363-6_28
Chapter Google Scholar
Fox, A., et al.: Spatio-temporal indexing in non-relational distributed databases. In: International Conference on Big Data (2013)
Google Scholar
Yu, J., et al.: A demonstration of GeoSpark: a cluster computing framework for processing big spatial data. In: ICDE (2016)
Google Scholar
Xie, D., et al.: Simba: efficient in-memory spatial analytics. In: SIGMOD, San Francisco, CA, June 2016
Google Scholar
Whitman, R.T., et al.: Spatial indexing and analytics on hadoop. In: SIGSPATIAL (2014)
Google Scholar
Eldawy, A., et al.: Sphinx: distributed execution of interactive SQL queries on big spatial data (Poster). In: SIGSPATIAL (2015)
Google Scholar
Kornacker, M., et al.: Impala: A Modern. CIDR, Open-Source SQL Engine for Hadoop (2015)
Google Scholar
Wanderman-Milne, S., Li, N.: Runtime code generation in cloudera impala. IEEE Data Eng. Bull. 37(1), 31–37 (2014)
Google Scholar
Floratou, A., et al.: SQL-on-hadoop: full circle back to shared-nothing database architectures. In: PVLDB (2014)
Google Scholar
Thusoo, A., et al.: Hive: a warehousing solution over a map-reduce framework. In: PVLDB (2009)
Google Scholar
Armbrust, M., et al.: Spark SQL: relational data processing in spark. In: SIGMOD (2015)
Google Scholar
Schnitzer, B., Leutenegger, S.T.: Master-client r-trees: a new parallel r-tree architecture. In: SSDBM (1999)
Google Scholar
DeWitt, D., Gray, J.: Parallel database systems: the future of high performance database systems. In: CACM (1992)
Google Scholar
Eldawy, A., Alarabi, L., Mokbel, M.F.: Spatial partitioning techniques in SpatialHadoop. In: PVLDB (2015)
Google Scholar
Yu, J., et al.: GeoSpark: a cluster computing framework for processing large-scale spatial data. In: SIGSPATIAL (2015)
Google Scholar
Leutenegger, S., et al.: STR: a simple and efficient algorithm for R-tree packing. In: ICDE (1997)
Google Scholar
den Bercken, J.V., et al.: The bulk index join: a generic approach to processing non-equijoins. In: ICDE (1999)
Google Scholar
Patel, J., DeWitt, D.: Partition based spatial-merge join. In: SIGMOD (1996)
Google Scholar
Dittrich, J.P., Seeger, B.: Data redundancy and duplicate detection in spatial join processing. In: ICDE (2000)
Google Scholar
Brinkhoff, T., Kriegel, H., Seeger, B.: Efficient processing of spatial joins using R-trees. In: SIGMOD, pp. 237–246 (1993)
Google Scholar
Arge, L., et al.: Scalable sweeping-based spatial join. In: VLDB (1998)
Google Scholar
Zhang, S., et al.: SJMR: parallelizing spatial join with MapReduce on clusters. In: CLUSTER, pp. 1–8 (2009)
Google Scholar
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)
Article Google Scholar
Olston, C., et al.: Pig latin: a not-so-foreign language for data processing. In: SIGMOD (2008)
Google Scholar
Zaharia, M., et al.: Spark: cluster computing with working sets. In: HotCloud (2010)
Google Scholar
Cary, A., Sun, Z., Hristidis, V., Rishe, N.: Experiences on processing spatial data with MapReduce. In: Winslett, M. (ed.) SSDBM 2009. LNCS, vol. 5566, pp. 302–319. Springer, Heidelberg (2009). doi:10.1007/978-3-642-02279-1_24
Chapter Google Scholar
Zhang, S., et al.: Spatial queries evaluation with MapReduce. In: GCC, pp. 287–292 (2009)
Google Scholar
Ma, Q., Yang, B., Qian, W., Zhou, A.: Query processing of massive trajectory data based on MapReduce. In: CLOUDDB (2009)
Google Scholar
Akdogan, A., et al.: Voronoi-based geospatial query processing with MapReduce. In: CLOUDCOM (2010)
Google Scholar
You, S., et al.: Large-scale spatial join query processing in cloud. In: CLOUDDM (2015)
Google Scholar
Stonebraker, M., et al.: SciDB: a database management system for applications with complex analytics. Comput. Sci. Eng. 15(3), 54–62 (2013)
Article Google Scholar
Wang, G., et al.: Behavioral Simulations in MapReduce. In: PVLDB (2010)
Google Scholar
Lu, J., Guting, R.H.: Parallel secondo: boosting database engines with Hadoop. In: ICPADS (2012)
Google Scholar

Download references

Acknowledgement

This work is supported in part by the National Science Foundation under Grants IIS-1525953, CNS-1512877, IIS-0952977, and IIS-1218168.

Author information

Authors and Affiliations

University of California, Riverside, USA
Ahmed Eldawy
University of Minnesota, Twin Cities, USA
Ibrahim Sabek & Mohamed F. Mokbel
KACST GIS Technology Innovation Center, Mecca, Saudi Arabia
Mostafa Elganainy, Ammar Bakeer & Ahmed Abdelmotaleb

Authors

Ahmed Eldawy
View author publications
You can also search for this author in PubMed Google Scholar
Ibrahim Sabek
View author publications
You can also search for this author in PubMed Google Scholar
Mostafa Elganainy
View author publications
You can also search for this author in PubMed Google Scholar
Ammar Bakeer
View author publications
You can also search for this author in PubMed Google Scholar
Ahmed Abdelmotaleb
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed F. Mokbel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ahmed Eldawy .

Editor information

Editors and Affiliations

Institute of Computer Science, Heidelberg University, Heidelberg, Germany
Michael Gertz
George Mason University, Fairfax, Virginia, USA
Matthias Renz
University of Queensland, Brisbane, Queensland, Australia
Xiaofang Zhou
ESRI, University of Minnesota, Minneapolis, Minnesota, USA
Erik Hoel
Auburn University, Auburn, Alabama, USA
Wei-Shinn Ku
Free University of Berlin, Dahlem, Berlin, Germany
Agnes Voisard
Microsoft, Redmond, Washington, USA
Chengyang Zhang
California State University, Sacramento, California, USA
Haiquan Chen
LinkedIn, Sunnyvale, California, USA
Liang Tang
University of North Texas, Denton, Texas, USA
Yan Huang
Virginia Tech, Falls Church, Virginia, USA
Chang-Tien Lu
Oracle, Redwood Shores, California, USA
Siva Ravada

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Eldawy, A., Sabek, I., Elganainy, M., Bakeer, A., Abdelmotaleb, A., Mokbel, M.F. (2017). Sphinx: Empowering Impala for Efficient Execution of SQL Queries on Big Spatial Data. In: Gertz, M., et al. Advances in Spatial and Temporal Databases. SSTD 2017. Lecture Notes in Computer Science(), vol 10411. Springer, Cham. https://doi.org/10.1007/978-3-319-64367-0_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-64367-0_4
Published: 22 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-64366-3
Online ISBN: 978-3-319-64367-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics