Skip to main content

Sphinx: Empowering Impala for Efficient Execution of SQL Queries on Big Spatial Data

  • Conference paper
  • First Online:
Advances in Spatial and Temporal Databases (SSTD 2017)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10411))

Included in the following conference series:

Abstract

This paper presents Sphinx, a full-fledged open-source system for big spatial data which overcomes the limitations of existing systems by adopting a standard SQL interface, and by providing a high efficient core built inside the core of the Apache Impala system. Sphinx is composed of four main layers, namely, query parser, indexer, query planner, and query executor. The query parser injects spatial data types and functions in the SQL interface of Sphinx. The indexer creates spatial indexes in Sphinx by adopting a two-layered index design. The query planner utilizes these indexes to construct efficient query plans for range query and spatial join operations. Finally, the query executor carries out these plans on big spatial datasets in a distributed cluster. A system prototype of Sphinx running on real datasets shows up-to three orders of magnitude performance improvement over plain-vanilla Impala, SpatialHadoop, and PostGIS.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    The idea of Sphinx was first introduced as a poster here [13].

  2. 2.

    Project home page is http://www.spatialworx.com/sphinx/ and source code is available at https://github.com/gistic/SpatialImpala.

References

  1. Markram, H.: The blue brain project. Nat. Rev. Neurosci. 7(2), 153–160 (2006)

    Article  MathSciNet  Google Scholar 

  2. Auchincloss, A., et al.: A review of spatial methods in epidemiology: 2000–2010. Annu. Rev. Public Health 33, 107–122 (2012)

    Article  Google Scholar 

  3. Faghmous, J., Kumar, V.: Spatio-temporal data mining for climate data: advances, challenges, and opportunities. In: Chu, W. (ed.) Data Mining and Knowledge Discovery for Big Data. Studies in Big Data, vol. 1, pp. 83–116. Springer, Heidelberg (2014). doi:10.1007/978-3-642-40837-3_3

  4. Sankaranarayanan, J., Samet, H., Teitler, B.E., Sperling, M.: TwitterStand: news in tweets. In: SIGSPATIAL (2009)

    Google Scholar 

  5. Aji, A., et al.: Hadoop-GIS: a high performance spatial data warehousing system over MapReduce. In: VLDB (2013)

    Google Scholar 

  6. Eldawy, A., Mokbel, M.F.: SpatialHadoop: a MapReduce framework for spatial data. In: ICDE (2015)

    Google Scholar 

  7. Nishimura, S., et al.: \({\cal{MD}}\)-HBase: design and implementation of an elastic data infrastructure for cloud-scale location services. DAPD 31(2), 289–319 (2013)

    Google Scholar 

  8. Nidzwetzki, J.K., Güting, R.H.: Distributed SECONDO: a highly available and scalable system for spatial data processing. In: Claramunt, C., Schneider, M., Wong, R.C.-W., Xiong, L., Loh, W.-K., Shahabi, C., Li, K.-J. (eds.) SSTD 2015. LNCS, vol. 9239, pp. 491–496. Springer, Cham (2015). doi:10.1007/978-3-319-22363-6_28

    Chapter  Google Scholar 

  9. Fox, A., et al.: Spatio-temporal indexing in non-relational distributed databases. In: International Conference on Big Data (2013)

    Google Scholar 

  10. Yu, J., et al.: A demonstration of GeoSpark: a cluster computing framework for processing big spatial data. In: ICDE (2016)

    Google Scholar 

  11. Xie, D., et al.: Simba: efficient in-memory spatial analytics. In: SIGMOD, San Francisco, CA, June 2016

    Google Scholar 

  12. Whitman, R.T., et al.: Spatial indexing and analytics on hadoop. In: SIGSPATIAL (2014)

    Google Scholar 

  13. Eldawy, A., et al.: Sphinx: distributed execution of interactive SQL queries on big spatial data (Poster). In: SIGSPATIAL (2015)

    Google Scholar 

  14. Kornacker, M., et al.: Impala: A Modern. CIDR, Open-Source SQL Engine for Hadoop (2015)

    Google Scholar 

  15. Wanderman-Milne, S., Li, N.: Runtime code generation in cloudera impala. IEEE Data Eng. Bull. 37(1), 31–37 (2014)

    Google Scholar 

  16. Floratou, A., et al.: SQL-on-hadoop: full circle back to shared-nothing database architectures. In: PVLDB (2014)

    Google Scholar 

  17. Thusoo, A., et al.: Hive: a warehousing solution over a map-reduce framework. In: PVLDB (2009)

    Google Scholar 

  18. Armbrust, M., et al.: Spark SQL: relational data processing in spark. In: SIGMOD (2015)

    Google Scholar 

  19. Schnitzer, B., Leutenegger, S.T.: Master-client r-trees: a new parallel r-tree architecture. In: SSDBM (1999)

    Google Scholar 

  20. DeWitt, D., Gray, J.: Parallel database systems: the future of high performance database systems. In: CACM (1992)

    Google Scholar 

  21. Eldawy, A., Alarabi, L., Mokbel, M.F.: Spatial partitioning techniques in SpatialHadoop. In: PVLDB (2015)

    Google Scholar 

  22. Yu, J., et al.: GeoSpark: a cluster computing framework for processing large-scale spatial data. In: SIGSPATIAL (2015)

    Google Scholar 

  23. Leutenegger, S., et al.: STR: a simple and efficient algorithm for R-tree packing. In: ICDE (1997)

    Google Scholar 

  24. den Bercken, J.V., et al.: The bulk index join: a generic approach to processing non-equijoins. In: ICDE (1999)

    Google Scholar 

  25. Patel, J., DeWitt, D.: Partition based spatial-merge join. In: SIGMOD (1996)

    Google Scholar 

  26. Dittrich, J.P., Seeger, B.: Data redundancy and duplicate detection in spatial join processing. In: ICDE (2000)

    Google Scholar 

  27. Brinkhoff, T., Kriegel, H., Seeger, B.: Efficient processing of spatial joins using R-trees. In: SIGMOD, pp. 237–246 (1993)

    Google Scholar 

  28. Arge, L., et al.: Scalable sweeping-based spatial join. In: VLDB (1998)

    Google Scholar 

  29. Zhang, S., et al.: SJMR: parallelizing spatial join with MapReduce on clusters. In: CLUSTER, pp. 1–8 (2009)

    Google Scholar 

  30. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM 51, 107–113 (2008)

    Article  Google Scholar 

  31. Olston, C., et al.: Pig latin: a not-so-foreign language for data processing. In: SIGMOD (2008)

    Google Scholar 

  32. Zaharia, M., et al.: Spark: cluster computing with working sets. In: HotCloud (2010)

    Google Scholar 

  33. Cary, A., Sun, Z., Hristidis, V., Rishe, N.: Experiences on processing spatial data with MapReduce. In: Winslett, M. (ed.) SSDBM 2009. LNCS, vol. 5566, pp. 302–319. Springer, Heidelberg (2009). doi:10.1007/978-3-642-02279-1_24

    Chapter  Google Scholar 

  34. Zhang, S., et al.: Spatial queries evaluation with MapReduce. In: GCC, pp. 287–292 (2009)

    Google Scholar 

  35. Ma, Q., Yang, B., Qian, W., Zhou, A.: Query processing of massive trajectory data based on MapReduce. In: CLOUDDB (2009)

    Google Scholar 

  36. Akdogan, A., et al.: Voronoi-based geospatial query processing with MapReduce. In: CLOUDCOM (2010)

    Google Scholar 

  37. You, S., et al.: Large-scale spatial join query processing in cloud. In: CLOUDDM (2015)

    Google Scholar 

  38. Stonebraker, M., et al.: SciDB: a database management system for applications with complex analytics. Comput. Sci. Eng. 15(3), 54–62 (2013)

    Article  Google Scholar 

  39. Wang, G., et al.: Behavioral Simulations in MapReduce. In: PVLDB (2010)

    Google Scholar 

  40. Lu, J., Guting, R.H.: Parallel secondo: boosting database engines with Hadoop. In: ICPADS (2012)

    Google Scholar 

Download references

Acknowledgement

This work is supported in part by the National Science Foundation under Grants IIS-1525953, CNS-1512877, IIS-0952977, and IIS-1218168.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ahmed Eldawy .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2017 Springer International Publishing AG

About this paper

Cite this paper

Eldawy, A., Sabek, I., Elganainy, M., Bakeer, A., Abdelmotaleb, A., Mokbel, M.F. (2017). Sphinx: Empowering Impala for Efficient Execution of SQL Queries on Big Spatial Data. In: Gertz, M., et al. Advances in Spatial and Temporal Databases. SSTD 2017. Lecture Notes in Computer Science(), vol 10411. Springer, Cham. https://doi.org/10.1007/978-3-319-64367-0_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-64367-0_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-64366-3

  • Online ISBN: 978-3-319-64367-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics