skip to main content
research-article

Distributed Spatial and Spatio-Temporal Join on Apache Spark

Published:27 June 2019Publication History
Skip Abstract Section

Abstract

Effective processing of extremely large volumes of spatial data has led to many organizations employing distributed processing frameworks. Apache Spark is one such open source framework that is enjoying widespread adoption. Within this data space, it is important to note that most of the observational data (i.e., data collected by sensors, either moving or stationary) has a temporal component or timestamp. To perform advanced analytics and gain insights, the temporal component becomes equally important as the spatial and attribute components. In this article, we detail several variants of a spatial join operation that addresses both spatial, temporal, and attribute-based joins. Our spatial join technique differs from other approaches in that it combines spatial, temporal, and attribute predicates in the join operator. In addition, our spatio-temporal join algorithm and implementation differs from others in that it runs in commercial off-the-shelf (COTS) application. The users of this functionality are assumed to be GIS analysts with little if any knowledge of the implementation details of spatio-temporal joins or distributed processing. They are comfortable using simple tools that do not provide the ability to tweak the configuration of the algorithm or processing environment. The spatio-temporal join algorithm behind the tool must always succeed, regardless of input data parameters (e.g., it can be highly irregularly distributed, contain large numbers of coincident points, it can be extremely large, etc.). These factors combine to place additional requirements on the algorithm that are uncommonly found in the traditional research environment. Our spatio-temporal join algorithm was shipped as part of the GeoAnalytics Server [12], part of the ArcGIS Enterprise platform from version 10.5 onward.

References

  1. David J. Abel, Beng Chin Ooi, Kian-Lee Tan, Robert Power, and Jeffrey X. Yu. 1995. Spatial join strategies in distributed spatial DBMS. In Proceedings of the 4th International Symposium on Advances in Spatial Databases (SSD’95). Springer-Verlag, London, UK, 348--367. http://dl.acm.org/citation.cfm?id=647224.718929 Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Accumulo. 2012. Apache Accumulo. Retrieved May 17, 2018 from https://accumulo.apache.org.Google ScholarGoogle Scholar
  3. Hoang Vo, Ablimit Aji, and Fusheng Wang. 2014. SATO: A spatial data partitioning framework for scalable query processing. In Proceedings of the 22Nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL'14). ACM, 545--548. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Ablimit Aji, Fusheng Wang, Hoang Vo, Rubao Lee, Qiaoling Liu, Xiaodong Zhang, and Joel Saltz. 2013. Hadoop GIS: A high performance spatial data warehousing system over mapreduce. Proc. VLDB Endow. 6, 11 (Aug. 2013), 1009--1020. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Furqan Baig, Mudit Mehrotra, Hoang Vo, Fusheng Wang, Joel Saltz, and Tahsin Kurc. 2015. SparkGIS: Efficient comparison and evaluation of algorithm results in tissue image analysis studies. In Proceedings of the VLDB Workshop on Biomedical Data Management and Graph Online Querying (DMAH’15), Vol. 9579. Springer.Google ScholarGoogle Scholar
  6. Furqan Baig, Hoang Vo, Tahsin Kurc, Joel Saltz, and Fusheng Wang. 2017. SparkGIS: Resource aware efficient in-memory spatial query processing. In Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL’17). ACM, New York, NY, Article 28, 10 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Thomas Brinkhoff, Hans-Peter Kriegel, and Bernhard Seeger. 1996. Parallel processing of spatial joins using R-trees. In Proceedings of the 12th International Conference on Data Engineering (ICDE’96). IEEE Computer Society, Los Alamitos, CA, 258--265. http://dl.acm.org/citation.cfm?id=645481.655583 Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. P. Dittrich and Bernhard Seeger. 2000. Data redundancy and duplicate detection in spatial join processing. In Proceedings of the 16th International Conference on Data Engineering (ICDE’00). IEEE Computer Society, Los Alamitos, CA, 535. http://dl.acm.org/citation.cfm?id=846219.847395. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Zhenhong Du, Xianwei Zhao, Xinyue Ye, Jingwei Zhou, Feng Zhang, and Renyi Liu. 2017. An effective high-performance multiway spatial join algorithm with Spark. ISPRS Int. J. Geo-Inf. 6, 4 (Mar. 2017), 96.Google ScholarGoogle ScholarCross RefCross Ref
  10. Ahmed Eldawy and Mohamed F. Mokbel. 2015. SpatialHadoop: A MapReduce framework for spatial data. In Proceedings of the 2015 IEEE 31st International Conference on Data Engineering (ICDE’15), Vol. 2015-May. IEEE Computer Society, 1352--1363.Google ScholarGoogle Scholar
  11. Esri. 2013. GIS Tools for Hadoop. Retrieved April 11, 2018 from https://github.com/Esri/gis-tools-for-hadoop.Google ScholarGoogle Scholar
  12. Esri. 2016. ArcGIS GeoAnalytics Server. Retrieved April 11, 2018 from http://server.arcgis.com/en/server/latest/get-started/windows/what-is-arcgis-geoanalytics-server-.htm.Google ScholarGoogle Scholar
  13. Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD’96). AAAI Press, 226--231. http://dl.acm.org/citation.cfm?id=3001460.3001507. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. R. A. Finkel and J. L. Bentley. 1974. Quad trees a data structure for retrieval on composite keys. Acta Inf. 4, 1 (01 Mar 1974), 1--9. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Fox, C. Eichelberger, J. Hughes, and S. Lyon. 2013. Spatio-temporal indexing in non-relational distributed databases. In Proceedings of the 2013 IEEE International Conference on Big Data. 291--299.Google ScholarGoogle Scholar
  16. Irene Gargantini. 1982. An effective way to represent quadtrees. Commun. ACM 25, 12 (Dec. 1982), 905--910. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. GeoMesa. 2017. GeoMesa Spark: Aggregating and Visualizing Data. Retrieved May 17, 2018 from http://www.geomesa.org/documentation/tutorials/shallow-join.html.Google ScholarGoogle Scholar
  18. Lars George. 2011. HBase: The Definitive Guide (1st ed.). O’Reilly Media, Inc.Google ScholarGoogle Scholar
  19. Antonin Guttman. 1984. R-trees: A dynamic index structure for spatial searching. In Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data (SIGMOD’84). ACM, New York, NY, 47--57. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Stefan Hagedorn, Philipp Götze, and Kai-Uwe Sattler. 2017. The STARK framework for spatio-temporal data analytics on Spark. In Proceedings of the 17th Conference on Database Systems for Business, Technology, and the Web (BTW’17).Google ScholarGoogle Scholar
  21. Gisli R. Hjaltason and Hanan Samet. 2002. Speeding up construction of PMR quadtree-based spatial indexes. VLDB J. 11, 2 (Oct. 2002), 109--137. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Erik G. Hoel and Hanan Samet. 1994. Data-parallel spatial join algorithms. In Proceedings of the 1994 International Conference on Parallel Processing (ICPP’94), Vol. 3. IEEE Computer Society, Los Alamitos, CA, 227--234. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Edwin H. Jacox and Hanan Samet. 2007. Spatial join techniques. ACM Trans. Database Syst. 32, 1, Article 7 (Mar. 2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. M. Kornacker and J. Erickson. 2012. Cloudera Impala: Real Time Queries in Apache Hadoop, For Real. Retrieved April 11, 2018 from http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real.Google ScholarGoogle Scholar
  25. Kalev Leetaru and Philip A. Schrodt. 2013. GDELT: Global data on events, language, and tone, 1979--2012. In Proceedings of the International Studies Association Annual Conference (2013).Google ScholarGoogle Scholar
  26. Nikos Mamoulis and Dimitris Papadias. 2001. Multiway spatial joins. ACM Trans. Database Syst. 26, 4 (Dec. 2001), 424--475. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Randal C. Nelson and Hanan Samet. 1986. A consistent hierarchical representation for vector data. SIGGRAPH Comput. Graph. 20, 4 (Aug. 1986), 197--206. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Gustavo Niemeyer. 2008. Geohash. Retrieved June 6, 2018 from https://en.wikipedia.org/wiki/Geohash.Google ScholarGoogle Scholar
  29. J. Nievergelt, Hans Hinterberger, and Kenneth C. Sevcik. 1984. The grid file: An adaptable, symmetric multikey file structure. ACM Trans. Database Syst. 9, 1 (Mar. 1984), 38--71. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. OpenStreetMap. 2004. OpenStreetMap. Retrieved May 17, 2018 from https://openstreetmap.org.Google ScholarGoogle Scholar
  31. Jack A. Orenstein. 1982. Multidimensional tries used for associative searching. Inf. Process. Lett. 14, 4 (1982), 150--157.Google ScholarGoogle ScholarCross RefCross Ref
  32. GDELT Project. 2014. The GDELT Project. Retrieved May 3, 2018 from https://www.gdeltproject.org/.Google ScholarGoogle Scholar
  33. Mansour Raad. 2013. BigData Spatial Joins. Retrieved April 11, 2018 from http://thunderheadxpler.blogspot.com/2013/10/bigdata-spatial-joins.html.Google ScholarGoogle Scholar
  34. Timos K. Sellis, Nick Roussopoulos, and Christos Faloutsos. 1987. The R+-tree: A dynamic index for multi-dimensional objects. In Proceedings of the 13th International Conference on Very Large Data Bases (VLDB’87). Morgan Kaufmann Publishers Inc., San Francisco, CA, 507--518. http://dl.acm.org/citation.cfm?id=645914.671636 Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. R. Sriharsha. 2015. Magellan: Geospatial Analytics on Spark. Retrieved May 1, 2018 from https://hortonworks.com/blog/magellan-geospatial-analytics-in-spark.Google ScholarGoogle Scholar
  36. Mingjie Tang, Yongyang Yu, Qutaibah M. Malluhi, Mourad Ouzzani, and Walid G. Aref. 2016. LocationSpark: A distributed in-memory data management system for big spatial data. Proc. VLDB Endow. 9, 13 (Sep. 2016), 1565--1568. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. New York City Taxi and Limousine Commission. 2016. TLC Trip Record Data. Retrieved May 3, 2018 from http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml.Google ScholarGoogle Scholar
  38. Patrick Valduriez and Georges Gardarin. 1984. Join and semijoin algorithms for a multiprocessor database machine. ACM Trans. Database Syst. 9, 1 (Mar. 1984), 133--161. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Tom White. 2009. Hadoop: The Definitive Guide (1st ed.). O’Reilly Media, Inc. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Randall T. Whitman, Michael B. Park, Sarah M. Ambrose, and Erik G. Hoel. 2014. Spatial indexing and analytics on Hadoop. In Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL’14). ACM, New York, NY, 73--82. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Randall T. Whitman, Michael B. Park, Bryan G. Marsh, and Erik G. Hoel. 2017. Spatio-temporal join on Apache Spark. In Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL’17). ACM, New York, NY, Article 20, 10 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Dong Xie, Feifei Li, Bin Yao, Gefei Li, Liang Zhou, and Minyi Guo. 2016. Simba: Efficient in-memory spatial analytics. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD’16). ACM, New York, NY, 1071--1085. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Simin You, Jianting Zhang, and Le Gruenwald. 2015. Large-scale spatial join query processing in Cloud. In Proceedings of the 2015 31st IEEE International Conference on Data Engineering Workshops (2015), 34--41.Google ScholarGoogle ScholarCross RefCross Ref
  44. Jia Yu, Jinxuan Wu, and Mohamed Sarwat. 2015. GeoSpark: A cluster computing framework for processing large-scale spatial data. In Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL’15). ACM, New York, NY, Article 70, 4 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud’10). USENIX Association, Berkeley, CA, 10--10. http://dl.acm.org/citation.cfm?id=1863103.1863113 Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Renyi Liu, Feng Zhang, Zhenhong Du, Jingwei Zhou, and Xinyue Ye. 2016. A new design of high-performance large-scale GIS computing at a finer spatial granularity: A case study of spatial join with spark for sustainability. Sustainability (2071-1050) 8, 9 (2016).Google ScholarGoogle Scholar
  47. S. Zhang, J. Han, Z. Liu, K. Wang, and Z. Xu. 2009. SJMR: Parallelizing spatial join with mapreduce on clusters. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’09). IEEE, 1--8.Google ScholarGoogle Scholar
  48. Yunqin Zhong, Jizhong Han, Tieying Zhang, Zhenhua Li, Jinyun Fang, and Guihai Chen. 2012. Towards parallel spatial query processing for big spatial data. In Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops 8 PhD Forum (IPDPSW’12). IEEE Computer Society, Los Alamitos, CA, 2085--2094. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Distributed Spatial and Spatio-Temporal Join on Apache Spark

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image ACM Transactions on Spatial Algorithms and Systems
          ACM Transactions on Spatial Algorithms and Systems  Volume 5, Issue 1
          Special Issue on SIGSPATIAL 2017
          March 2019
          146 pages
          ISSN:2374-0353
          EISSN:2374-0361
          DOI:10.1145/3336122
          Issue’s Table of Contents

          Copyright © 2019 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 27 June 2019
          • Revised: 1 March 2019
          • Accepted: 1 March 2019
          • Received: 1 June 2018
          Published in tsas Volume 5, Issue 1

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article
          • Research
          • Refereed

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format .

        View HTML Format