Abstract
Effective processing of extremely large volumes of spatial data has led to many organizations employing distributed processing frameworks. Apache Spark is one such open source framework that is enjoying widespread adoption. Within this data space, it is important to note that most of the observational data (i.e., data collected by sensors, either moving or stationary) has a temporal component or timestamp. To perform advanced analytics and gain insights, the temporal component becomes equally important as the spatial and attribute components. In this article, we detail several variants of a spatial join operation that addresses both spatial, temporal, and attribute-based joins. Our spatial join technique differs from other approaches in that it combines spatial, temporal, and attribute predicates in the join operator. In addition, our spatio-temporal join algorithm and implementation differs from others in that it runs in commercial off-the-shelf (COTS) application. The users of this functionality are assumed to be GIS analysts with little if any knowledge of the implementation details of spatio-temporal joins or distributed processing. They are comfortable using simple tools that do not provide the ability to tweak the configuration of the algorithm or processing environment. The spatio-temporal join algorithm behind the tool must always succeed, regardless of input data parameters (e.g., it can be highly irregularly distributed, contain large numbers of coincident points, it can be extremely large, etc.). These factors combine to place additional requirements on the algorithm that are uncommonly found in the traditional research environment. Our spatio-temporal join algorithm was shipped as part of the GeoAnalytics Server [12], part of the ArcGIS Enterprise platform from version 10.5 onward.
- David J. Abel, Beng Chin Ooi, Kian-Lee Tan, Robert Power, and Jeffrey X. Yu. 1995. Spatial join strategies in distributed spatial DBMS. In Proceedings of the 4th International Symposium on Advances in Spatial Databases (SSD’95). Springer-Verlag, London, UK, 348--367. http://dl.acm.org/citation.cfm?id=647224.718929 Google ScholarDigital Library
- Accumulo. 2012. Apache Accumulo. Retrieved May 17, 2018 from https://accumulo.apache.org.Google Scholar
- Hoang Vo, Ablimit Aji, and Fusheng Wang. 2014. SATO: A spatial data partitioning framework for scalable query processing. In Proceedings of the 22Nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL'14). ACM, 545--548. Google ScholarDigital Library
- Ablimit Aji, Fusheng Wang, Hoang Vo, Rubao Lee, Qiaoling Liu, Xiaodong Zhang, and Joel Saltz. 2013. Hadoop GIS: A high performance spatial data warehousing system over mapreduce. Proc. VLDB Endow. 6, 11 (Aug. 2013), 1009--1020. Google ScholarDigital Library
- Furqan Baig, Mudit Mehrotra, Hoang Vo, Fusheng Wang, Joel Saltz, and Tahsin Kurc. 2015. SparkGIS: Efficient comparison and evaluation of algorithm results in tissue image analysis studies. In Proceedings of the VLDB Workshop on Biomedical Data Management and Graph Online Querying (DMAH’15), Vol. 9579. Springer.Google Scholar
- Furqan Baig, Hoang Vo, Tahsin Kurc, Joel Saltz, and Fusheng Wang. 2017. SparkGIS: Resource aware efficient in-memory spatial query processing. In Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL’17). ACM, New York, NY, Article 28, 10 pages. Google ScholarDigital Library
- Thomas Brinkhoff, Hans-Peter Kriegel, and Bernhard Seeger. 1996. Parallel processing of spatial joins using R-trees. In Proceedings of the 12th International Conference on Data Engineering (ICDE’96). IEEE Computer Society, Los Alamitos, CA, 258--265. http://dl.acm.org/citation.cfm?id=645481.655583 Google ScholarDigital Library
- J. P. Dittrich and Bernhard Seeger. 2000. Data redundancy and duplicate detection in spatial join processing. In Proceedings of the 16th International Conference on Data Engineering (ICDE’00). IEEE Computer Society, Los Alamitos, CA, 535. http://dl.acm.org/citation.cfm?id=846219.847395. Google ScholarDigital Library
- Zhenhong Du, Xianwei Zhao, Xinyue Ye, Jingwei Zhou, Feng Zhang, and Renyi Liu. 2017. An effective high-performance multiway spatial join algorithm with Spark. ISPRS Int. J. Geo-Inf. 6, 4 (Mar. 2017), 96.Google ScholarCross Ref
- Ahmed Eldawy and Mohamed F. Mokbel. 2015. SpatialHadoop: A MapReduce framework for spatial data. In Proceedings of the 2015 IEEE 31st International Conference on Data Engineering (ICDE’15), Vol. 2015-May. IEEE Computer Society, 1352--1363.Google Scholar
- Esri. 2013. GIS Tools for Hadoop. Retrieved April 11, 2018 from https://github.com/Esri/gis-tools-for-hadoop.Google Scholar
- Esri. 2016. ArcGIS GeoAnalytics Server. Retrieved April 11, 2018 from http://server.arcgis.com/en/server/latest/get-started/windows/what-is-arcgis-geoanalytics-server-.htm.Google Scholar
- Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu. 1996. A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining (KDD’96). AAAI Press, 226--231. http://dl.acm.org/citation.cfm?id=3001460.3001507. Google ScholarDigital Library
- R. A. Finkel and J. L. Bentley. 1974. Quad trees a data structure for retrieval on composite keys. Acta Inf. 4, 1 (01 Mar 1974), 1--9. Google ScholarDigital Library
- A. Fox, C. Eichelberger, J. Hughes, and S. Lyon. 2013. Spatio-temporal indexing in non-relational distributed databases. In Proceedings of the 2013 IEEE International Conference on Big Data. 291--299.Google Scholar
- Irene Gargantini. 1982. An effective way to represent quadtrees. Commun. ACM 25, 12 (Dec. 1982), 905--910. Google ScholarDigital Library
- GeoMesa. 2017. GeoMesa Spark: Aggregating and Visualizing Data. Retrieved May 17, 2018 from http://www.geomesa.org/documentation/tutorials/shallow-join.html.Google Scholar
- Lars George. 2011. HBase: The Definitive Guide (1st ed.). O’Reilly Media, Inc.Google Scholar
- Antonin Guttman. 1984. R-trees: A dynamic index structure for spatial searching. In Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data (SIGMOD’84). ACM, New York, NY, 47--57. Google ScholarDigital Library
- Stefan Hagedorn, Philipp Götze, and Kai-Uwe Sattler. 2017. The STARK framework for spatio-temporal data analytics on Spark. In Proceedings of the 17th Conference on Database Systems for Business, Technology, and the Web (BTW’17).Google Scholar
- Gisli R. Hjaltason and Hanan Samet. 2002. Speeding up construction of PMR quadtree-based spatial indexes. VLDB J. 11, 2 (Oct. 2002), 109--137. Google ScholarDigital Library
- Erik G. Hoel and Hanan Samet. 1994. Data-parallel spatial join algorithms. In Proceedings of the 1994 International Conference on Parallel Processing (ICPP’94), Vol. 3. IEEE Computer Society, Los Alamitos, CA, 227--234. Google ScholarDigital Library
- Edwin H. Jacox and Hanan Samet. 2007. Spatial join techniques. ACM Trans. Database Syst. 32, 1, Article 7 (Mar. 2007). Google ScholarDigital Library
- M. Kornacker and J. Erickson. 2012. Cloudera Impala: Real Time Queries in Apache Hadoop, For Real. Retrieved April 11, 2018 from http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real.Google Scholar
- Kalev Leetaru and Philip A. Schrodt. 2013. GDELT: Global data on events, language, and tone, 1979--2012. In Proceedings of the International Studies Association Annual Conference (2013).Google Scholar
- Nikos Mamoulis and Dimitris Papadias. 2001. Multiway spatial joins. ACM Trans. Database Syst. 26, 4 (Dec. 2001), 424--475. Google ScholarDigital Library
- Randal C. Nelson and Hanan Samet. 1986. A consistent hierarchical representation for vector data. SIGGRAPH Comput. Graph. 20, 4 (Aug. 1986), 197--206. Google ScholarDigital Library
- Gustavo Niemeyer. 2008. Geohash. Retrieved June 6, 2018 from https://en.wikipedia.org/wiki/Geohash.Google Scholar
- J. Nievergelt, Hans Hinterberger, and Kenneth C. Sevcik. 1984. The grid file: An adaptable, symmetric multikey file structure. ACM Trans. Database Syst. 9, 1 (Mar. 1984), 38--71. Google ScholarDigital Library
- OpenStreetMap. 2004. OpenStreetMap. Retrieved May 17, 2018 from https://openstreetmap.org.Google Scholar
- Jack A. Orenstein. 1982. Multidimensional tries used for associative searching. Inf. Process. Lett. 14, 4 (1982), 150--157.Google ScholarCross Ref
- GDELT Project. 2014. The GDELT Project. Retrieved May 3, 2018 from https://www.gdeltproject.org/.Google Scholar
- Mansour Raad. 2013. BigData Spatial Joins. Retrieved April 11, 2018 from http://thunderheadxpler.blogspot.com/2013/10/bigdata-spatial-joins.html.Google Scholar
- Timos K. Sellis, Nick Roussopoulos, and Christos Faloutsos. 1987. The R+-tree: A dynamic index for multi-dimensional objects. In Proceedings of the 13th International Conference on Very Large Data Bases (VLDB’87). Morgan Kaufmann Publishers Inc., San Francisco, CA, 507--518. http://dl.acm.org/citation.cfm?id=645914.671636 Google ScholarDigital Library
- R. Sriharsha. 2015. Magellan: Geospatial Analytics on Spark. Retrieved May 1, 2018 from https://hortonworks.com/blog/magellan-geospatial-analytics-in-spark.Google Scholar
- Mingjie Tang, Yongyang Yu, Qutaibah M. Malluhi, Mourad Ouzzani, and Walid G. Aref. 2016. LocationSpark: A distributed in-memory data management system for big spatial data. Proc. VLDB Endow. 9, 13 (Sep. 2016), 1565--1568. Google ScholarDigital Library
- New York City Taxi and Limousine Commission. 2016. TLC Trip Record Data. Retrieved May 3, 2018 from http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml.Google Scholar
- Patrick Valduriez and Georges Gardarin. 1984. Join and semijoin algorithms for a multiprocessor database machine. ACM Trans. Database Syst. 9, 1 (Mar. 1984), 133--161. Google ScholarDigital Library
- Tom White. 2009. Hadoop: The Definitive Guide (1st ed.). O’Reilly Media, Inc. Google ScholarDigital Library
- Randall T. Whitman, Michael B. Park, Sarah M. Ambrose, and Erik G. Hoel. 2014. Spatial indexing and analytics on Hadoop. In Proceedings of the 22nd ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL’14). ACM, New York, NY, 73--82. Google ScholarDigital Library
- Randall T. Whitman, Michael B. Park, Bryan G. Marsh, and Erik G. Hoel. 2017. Spatio-temporal join on Apache Spark. In Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL’17). ACM, New York, NY, Article 20, 10 pages. Google ScholarDigital Library
- Dong Xie, Feifei Li, Bin Yao, Gefei Li, Liang Zhou, and Minyi Guo. 2016. Simba: Efficient in-memory spatial analytics. In Proceedings of the 2016 International Conference on Management of Data (SIGMOD’16). ACM, New York, NY, 1071--1085. Google ScholarDigital Library
- Simin You, Jianting Zhang, and Le Gruenwald. 2015. Large-scale spatial join query processing in Cloud. In Proceedings of the 2015 31st IEEE International Conference on Data Engineering Workshops (2015), 34--41.Google ScholarCross Ref
- Jia Yu, Jinxuan Wu, and Mohamed Sarwat. 2015. GeoSpark: A cluster computing framework for processing large-scale spatial data. In Proceedings of the 23rd SIGSPATIAL International Conference on Advances in Geographic Information Systems (SIGSPATIAL’15). ACM, New York, NY, Article 70, 4 pages. Google ScholarDigital Library
- Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing (HotCloud’10). USENIX Association, Berkeley, CA, 10--10. http://dl.acm.org/citation.cfm?id=1863103.1863113 Google ScholarDigital Library
- Renyi Liu, Feng Zhang, Zhenhong Du, Jingwei Zhou, and Xinyue Ye. 2016. A new design of high-performance large-scale GIS computing at a finer spatial granularity: A case study of spatial join with spark for sustainability. Sustainability (2071-1050) 8, 9 (2016).Google Scholar
- S. Zhang, J. Han, Z. Liu, K. Wang, and Z. Xu. 2009. SJMR: Parallelizing spatial join with mapreduce on clusters. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’09). IEEE, 1--8.Google Scholar
- Yunqin Zhong, Jizhong Han, Tieying Zhang, Zhenhua Li, Jinyun Fang, and Guihai Chen. 2012. Towards parallel spatial query processing for big spatial data. In Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops 8 PhD Forum (IPDPSW’12). IEEE Computer Society, Los Alamitos, CA, 2085--2094. Google ScholarDigital Library
Index Terms
- Distributed Spatial and Spatio-Temporal Join on Apache Spark
Recommendations
Spatio-Temporal Join on Apache Spark
SIGSPATIAL '17: Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information SystemsEffective processing of extremely large volumes of spatial data has led to many organizations employing distributed processing frameworks. Apache Spark is one such open-source framework that is enjoying widespread adoption. Within this data space, it is ...
On Spatial Joins in MapReduce
SIGSPATIAL '17: Proceedings of the 25th ACM SIGSPATIAL International Conference on Advances in Geographic Information SystemsThis paper provides the first attempt for a full-fledged query optimizer for MapReduce-based spatial join algorithms. The optimizer develops its own taxonomy that covers almost all possible ways of doing a spatial join for any two input datasets. The ...
Impact of Memory Size on Bigdata Processing based on Hadoop and Spark
RACS '17: Proceedings of the International Conference on Research in Adaptive and Convergent SystemsHadoop and Spark are well-known big data processing platforms. The main technologies of Hadoop are Hadoop Distributed File System and MapReduce processing. Hadoop stores intermediary data on Hadoop Distributed File System, which is a disk-based ...
Comments