research-article

Distributed Spatial and Spatio-Temporal Join on Apache Spark

Authors:
Randall T. Whitman

Esri, CA, USA

Esri, CA, USA
View Profile

,
Bryan G. Marsh

Esri, CA, USA

Esri, CA, USA
View Profile

,
Michael B. Park

Esri, CA, USA

Esri, CA, USA
View Profile

,
Erik G. Hoel

Esri, CA, USA

Esri, CA, USA
View Profile

Authors Info & Claims

ACM Transactions on Spatial Algorithms and Systems Volume 5 Issue 1Article No.: 6pp 1–28https://doi.org/10.1145/3325135

Published:27 June 2019Publication History

ACM Transactions on Spatial Algorithms and Systems

Abstract

Effective processing of extremely large volumes of spatial data has led to many organizations employing distributed processing frameworks. Apache Spark is one such open source framework that is enjoying widespread adoption. Within this data space, it is important to note that most of the observational data (i.e., data collected by sensors, either moving or stationary) has a temporal component or timestamp. To perform advanced analytics and gain insights, the temporal component becomes equally important as the spatial and attribute components. In this article, we detail several variants of a spatial join operation that addresses both spatial, temporal, and attribute-based joins. Our spatial join technique differs from other approaches in that it combines spatial, temporal, and attribute predicates in the join operator. In addition, our spatio-temporal join algorithm and implementation differs from others in that it runs in commercial off-the-shelf (COTS) application. The users of this functionality are assumed to be GIS analysts with little if any knowledge of the implementation details of spatio-temporal joins or distributed processing. They are comfortable using simple tools that do not provide the ability to tweak the configuration of the algorithm or processing environment. The spatio-temporal join algorithm behind the tool must always succeed, regardless of input data parameters (e.g., it can be highly irregularly distributed, contain large numbers of coincident points, it can be extremely large, etc.). These factors combine to place additional requirements on the algorithm that are uncommonly found in the traditional research environment. Our spatio-temporal join algorithm was shipped as part of the GeoAnalytics Server [12], part of the ArcGIS Enterprise platform from version 10.5 onward.

References

David J. Abel, Beng Chin Ooi, Kian-Lee Tan, Robert Power, and Jeffrey X. Yu. 1995. Spatial join strategies in distributed spatial DBMS. In Proceedings of the 4th International Symposium on Advances in Spatial Databases (SSD’95). Springer-Verlag, London, UK, 348--367. http://dl.acm.org/citation.cfm?id=647224.718929 Google ScholarDigital Library
Accumulo. 2012. Apache Accumulo. Retrieved May 17, 2018 from https://accumulo.apache.org.Google Scholar
Hoang Vo, Ablimit Aji, and Fusheng Wang. 2014. SATO: A spatial data partitioning framework for scalable query processing. In Proceedings of the 22Nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL'14). ACM, 545--548. Google ScholarDigital Library
Ablimit Aji, Fusheng Wang, Hoang Vo, Rubao Lee, Qiaoling Liu, Xiaodong Zhang, and Joel Saltz. 2013. Hadoop GIS: A high performance spatial data warehousing system over mapreduce. Proc. VLDB Endow. 6, 11 (Aug. 2013), 1009--1020. Google ScholarDigital Library
Furqan Baig, Mudit Mehrotra, Hoang Vo, Fusheng Wang, Joel Saltz, and Tahsin Kurc. 2015. SparkGIS: Efficient comparison and evaluation of algorithm results in tissue image analysis studies. In Proceedings of the VLDB Workshop on Biomedical Data Management and Graph Online Querying (DMAH’15), Vol. 9579. Springer.Google Scholar
Furqan Baig, Hoang Vo, Tahsin Kurc, Joel Saltz, and Fusheng Wang. 2017. SparkGIS: Resource aware efficient in-memory spatial query processing. In Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL’17). ACM, New York, NY, Article 28, 10 pages. Google ScholarDigital Library
Thomas Brinkhoff, Hans-Peter Kriegel, and Bernhard Seeger. 1996. Parallel processing of spatial joins using R-trees. In Proceedings of the 12th International Conference on Data Engineering (ICDE’96). IEEE Computer Society, Los Alamitos, CA, 258--265. http://dl.acm.org/citation.cfm?id=645481.655583 Google ScholarDigital Library
J. P. Dittrich and Bernhard Seeger. 2000. Data redundancy and duplicate detection in spatial join processing. In Proceedings of the 16th International Conference on Data Engineering (ICDE’00). IEEE Computer Society, Los Alamitos, CA, 535. http://dl.acm.org/citation.cfm?id=846219.847395. Google ScholarDigital Library
Zhenhong Du, Xianwei Zhao, Xinyue Ye, Jingwei Zhou, Feng Zhang, and Renyi Liu. 2017. An effective high-performance multiway spatial join algorithm with Spark. ISPRS Int. J. Geo-Inf. 6, 4 (Mar. 2017), 96.Google ScholarCross Ref
Ahmed Eldawy and Mohamed F. Mokbel. 2015. SpatialHadoop: A MapReduce framework for spatial data. In Proceedings of the 2015 IEEE 31st International Conference on Data Engineering (ICDE’15), Vol. 2015-May. IEEE Computer Society, 1352--1363.Google Scholar
Esri. 2013. GIS Tools for Hadoop. Retrieved April 11, 2018 from https://github.com/Esri/gis-tools-for-hadoop.Google Scholar
Esri. 2016. ArcGIS GeoAnalytics Server. Retrieved April 11, 2018 from http://server.arcgis.com/en/server/latest/get-started/windows/what-is-arcgis-geoanalytics-server-.htm.Google Scholar
Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD’96). AAAI Press, 226--231. http://dl.acm.org/citation.cfm?id=3001460.3001507. Google ScholarDigital Library
R. A. Finkel and J. L. Bentley. 1974. Quad trees a data structure for retrieval on composite keys. Acta Inf. 4, 1 (01 Mar 1974), 1--9. Google ScholarDigital Library
A. Fox, C. Eichelberger, J. Hughes, and S. Lyon. 2013. Spatio-temporal indexing in non-relational distributed databases. In Proceedings of the 2013 IEEE International Conference on Big Data. 291--299.Google Scholar
Irene Gargantini. 1982. An effective way to represent quadtrees. Commun. ACM 25, 12 (Dec. 1982), 905--910. Google ScholarDigital Library
GeoMesa. 2017. GeoMesa Spark: Aggregating and Visualizing Data. Retrieved May 17, 2018 from http://www.geomesa.org/documentation/tutorials/shallow-join.html.Google Scholar
Lars George. 2011. HBase: The Definitive Guide (1st ed.). O’Reilly Media, Inc.Google Scholar
Antonin Guttman. 1984. R-trees: A dynamic index structure for spatial searching. In Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data (SIGMOD’84). ACM, New York, NY, 47--57. Google ScholarDigital Library
Stefan Hagedorn, Philipp Götze, and Kai-Uwe Sattler. 2017. The STARK framework for spatio-temporal data analytics on Spark. In Proceedings of the 17th Conference on Database Systems for Business, Technology, and the Web (BTW’17).Google Scholar
Gisli R. Hjaltason and Hanan Samet. 2002. Speeding up construction of PMR quadtree-based spatial indexes. VLDB J. 11, 2 (Oct. 2002), 109--137. Google ScholarDigital Library
Erik G. Hoel and Hanan Samet. 1994. Data-parallel spatial join algorithms. In Proceedings of the 1994 International Conference on Parallel Processing (ICPP’94), Vol. 3. IEEE Computer Society, Los Alamitos, CA, 227--234. Google ScholarDigital Library
Edwin H. Jacox and Hanan Samet. 2007. Spatial join techniques. ACM Trans. Database Syst. 32, 1, Article 7 (Mar. 2007). Google ScholarDigital Library
M. Kornacker and J. Erickson. 2012. Cloudera Impala: Real Time Queries in Apache Hadoop, For Real. Retrieved April 11, 2018 from http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real.Google Scholar
Kalev Leetaru and Philip A. Schrodt. 2013. GDELT: Global data on events, language, and tone, 1979--2012. In Proceedings of the International Studies Association Annual Conference (2013).Google Scholar
Nikos Mamoulis and Dimitris Papadias. 2001. Multiway spatial joins. ACM Trans. Database Syst. 26, 4 (Dec. 2001), 424--475. Google ScholarDigital Library
Randal C. Nelson and Hanan Samet. 1986. A consistent hierarchical representation for vector data. SIGGRAPH Comput. Graph. 20, 4 (Aug. 1986), 197--206. Google ScholarDigital Library
Gustavo Niemeyer. 2008. Geohash. Retrieved June 6, 2018 from https://en.wikipedia.org/wiki/Geohash.Google Scholar
J. Nievergelt, Hans Hinterberger, and Kenneth C. Sevcik. 1984. The grid file: An adaptable, symmetric multikey file structure. ACM Trans. Database Syst. 9, 1 (Mar. 1984), 38--71. Google ScholarDigital Library
OpenStreetMap. 2004. OpenStreetMap. Retrieved May 17, 2018 from https://openstreetmap.org.Google Scholar
Jack A. Orenstein. 1982. Multidimensional tries used for associative searching. Inf. Process. Lett. 14, 4 (1982), 150--157.Google ScholarCross Ref
GDELT Project. 2014. The GDELT Project. Retrieved May 3, 2018 from https://www.gdeltproject.org/.Google Scholar
Mansour Raad. 2013. BigData Spatial Joins. Retrieved April 11, 2018 from http://thunderheadxpler.blogspot.com/2013/10/bigdata-spatial-joins.html.Google Scholar
Timos K. Sellis, Nick Roussopoulos, and Christos Faloutsos. 1987. The R+-tree: A dynamic index for multi-dimensional objects. In Proceedings of the 13th International Conference on Very Large Data Bases (VLDB’87). Morgan Kaufmann Publishers Inc., San Francisco, CA, 507--518. http://dl.acm.org/citation.cfm?id=645914.671636 Google ScholarDigital Library
R. Sriharsha. 2015. Magellan: Geospatial Analytics on Spark. Retrieved May 1, 2018 from https://hortonworks.com/blog/magellan-geospatial-analytics-in-spark.Google Scholar
Mingjie Tang, Yongyang Yu, Qutaibah M. Malluhi, Mourad Ouzzani, and Walid G. Aref. 2016. LocationSpark: A distributed in-memory data management system for big spatial data. Proc. VLDB Endow. 9, 13 (Sep. 2016), 1565--1568. Google ScholarDigital Library
New York City Taxi and Limousine Commission. 2016. TLC Trip Record Data. Retrieved May 3, 2018 from http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml.Google Scholar
Patrick Valduriez and Georges Gardarin. 1984. Join and semijoin algorithms for a multiprocessor database machine. ACM Trans. Database Syst. 9, 1 (Mar. 1984), 133--161. Google ScholarDigital Library
Tom White. 2009. Hadoop: The Definitive Guide (1st ed.). O’Reilly Media, Inc. Google ScholarDigital Library
Randall T. Whitman, Michael B. Park, Sarah M. Ambrose, and Erik G. Hoel. 2014. Spatial indexing and analytics on Hadoop. In Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL’14). ACM, New York, NY, 73--82. Google ScholarDigital Library
Randall T. Whitman, Michael B. Park, Bryan G. Marsh, and Erik G. Hoel. 2017. Spatio-temporal join on Apache Spark. In Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL’17). ACM, New York, NY, Article 20, 10 pages. Google ScholarDigital Library
Dong Xie, Feifei Li, Bin Yao, Gefei Li, Liang Zhou, and Minyi Guo. 2016. Simba: Efficient in-memory spatial analytics. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD’16). ACM, New York, NY, 1071--1085. Google ScholarDigital Library
Simin You, Jianting Zhang, and Le Gruenwald. 2015. Large-scale spatial join query processing in Cloud. In Proceedings of the 2015 31st IEEE International Conference on Data Engineering Workshops (2015), 34--41.Google ScholarCross Ref
Jia Yu, Jinxuan Wu, and Mohamed Sarwat. 2015. GeoSpark: A cluster computing framework for processing large-scale spatial data. In Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL’15). ACM, New York, NY, Article 70, 4 pages. Google ScholarDigital Library
Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud’10). USENIX Association, Berkeley, CA, 10--10. http://dl.acm.org/citation.cfm?id=1863103.1863113 Google ScholarDigital Library
Renyi Liu, Feng Zhang, Zhenhong Du, Jingwei Zhou, and Xinyue Ye. 2016. A new design of high-performance large-scale GIS computing at a finer spatial granularity: A case study of spatial join with spark for sustainability. Sustainability (2071-1050) 8, 9 (2016).Google Scholar
S. Zhang, J. Han, Z. Liu, K. Wang, and Z. Xu. 2009. SJMR: Parallelizing spatial join with mapreduce on clusters. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’09). IEEE, 1--8.Google Scholar
Yunqin Zhong, Jizhong Han, Tieying Zhang, Zhenhua Li, Jinyun Fang, and Guihai Chen. 2012. Towards parallel spatial query processing for big spatial data. In Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops 8 PhD Forum (IPDPSW’12). IEEE Computer Society, Los Alamitos, CA, 2085--2094. Google ScholarDigital Library

Index Terms

Distributed Spatial and Spatio-Temporal Join on Apache Spark
1. Computing methodologies
  1. Distributed computing methodologies
    1. Distributed algorithms
2. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
        Join algorithms
  2. Information systems applications
    1. Spatial-temporal systems
      1. Geographic information systems

Recommendations

Spatio-Temporal Join on Apache Spark
SIGSPATIAL '17: Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems

Effective processing of extremely large volumes of spatial data has led to many organizations employing distributed processing frameworks. Apache Spark is one such open-source framework that is enjoying widespread adoption. Within this data space, it is ...
Read More
On Spatial Joins in MapReduce
SIGSPATIAL '17: Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems

This paper provides the first attempt for a full-fledged query optimizer for MapReduce-based spatial join algorithms. The optimizer develops its own taxonomy that covers almost all possible ways of doing a spatial join for any two input datasets. The ...
Read More
Impact of Memory Size on Bigdata Processing based on Hadoop and Spark
RACS '17: Proceedings of the International Conference on Research in Adaptive and Convergent Systems

Hadoop and Spark are well-known big data processing platforms. The main technologies of Hadoop are Hadoop Distributed File System and MapReduce processing. Hadoop stores intermediary data on Hadoop Distributed File System, which is a disk-based ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Spatial Algorithms and Systems Volume 5, Issue 1
Special Issue on SIGSPATIAL 2017
March 2019
146 pages
ISSN:2374-0353
EISSN:2374-0361
DOI:10.1145/3336122
Editor:
Walid G. Aref
Purdue University, USA
Issue’s Table of Contents
Copyright © 2019 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 June 2019
- Revised: 1 March 2019
- Accepted: 1 March 2019
- Received: 1 June 2018
Published in tsas Volume 5, Issue 1

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
HDFS
Hadoop
Spark
Spatial join
distributed processing
geospatial and spatiotemporal databases
spatio-temporal join
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 16
  Total Citations
  View Citations
- 553
  Total Downloads
- Downloads (Last 12 months)88
- Downloads (Last 6 weeks)10
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format .

View HTML Format

Distributed Spatial and Spatio-Temporal Join on Apache Spark

ACM Transactions on Spatial Algorithms and Systems

Abstract

References

Cited By

Index Terms

Recommendations

Spatio-Temporal Join on Apache Spark

On Spatial Joins in MapReduce

Impact of Memory Size on Bigdata Processing based on Hadoop and Spark

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

HTML Format

Caption

Distributed Spatial and Spatio-Temporal Join on Apache Spark

ACM Transactions on Spatial Algorithms and Systems

Abstract

References

Cited By

Index Terms

Recommendations

Spatio-Temporal Join on Apache Spark

On Spatial Joins in MapReduce

Impact of Memory Size on Bigdata Processing based on Hadoop and Spark

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

HTML Format

Share this Publication link

Share on Social Media