GeoSparkViz: a cluster computing system for visualizing massive-scale geospatial data

Yu, Jia; Sarwat, Mohamed

doi:10.1007/s00778-020-00645-2

GeoSparkViz: a cluster computing system for visualizing massive-scale geospatial data

Regular Paper
Published: 07 January 2021

Volume 30, pages 237–258, (2021)
Cite this article

The VLDB Journal Aims and scope Submit manuscript

We’re sorry, something doesn't seem to be working properly.

Please try refreshing the page. If that doesn't work, please contact support so we can address the problem.

Abstract

In the last decade, geospatial data which is extracted from GPS traces and satellites image has become ubiquitous. GeoVisual analytics, abbr. GeoViz, is the science of analytical reasoning assisted by geospatial map interfaces. GeoViz involves two phases: (1) spatial data processing: that loads spatial data and executes spatial queries to return the set of spatial objects to be visualized. (2) Map visualization: that applies a map visualization effect, e.g., Heatmap, on the spatial objects produced in the first phase. Existing GeoViz system architectures decouple these two phases, which lose the opportunity to co-optimize the data processing and map visualization phases in the same cluster. To remedy this, the paper presents GeoSparkViz, a full-fledged system that allows the user to load, process, integrate and execute GeoViz tasks on spatial data at scale. GeoSparkViz extends a state-of-the-art distributed data management system to provide native support for general geospatial map visualization. The system encapsulates the main steps of the map visualization process, e.g., pixelize spatial objects, pixel aggregation, and map tile rendering into a set of massively parallelized map building operators. This allows the system to co-optimize the spatial query operators and map building operators side by side. GeoSparkViz is also equipped with a GeoViz-aware spatial partitioning operator that achieves load balancing for GeoViz workloads among all nodes in the cluster. Experiments based on an implementation in Spark show that GeoSparkViz achieves up to an order of magnitude less data-to-visualization time than its counterparts when running visual analytics tasks over large-scale spatial data extracted from the NYC taxi dataset and OpenStreetMaps.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Strabo 2: Distributed Management of Massive Geospatial RDF Datasets

A comparative experimental study of distributed storage engines for big spatial data processing using GeoSpark

Article 01 July 2021

Spatial data management in apache spark: the GeoSpark perspective and beyond

Article 22 October 2018

References

Aji, A., Wang, F., Vo, H., Lee, R., Liu, Q., Zhang, X., Saltz, J.H.: Hadoop-GIS: a high performance spatial data warehousing system over MapReduce. Proc. VLDB Endow. PVLDB 6(11), 1009–1020 (2013)
Article Google Scholar
Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark SQL: relational data processing in spark. In: Sellis, T.K., Davidson, S.B., Ives, Z.G. (eds.) Proceedings of the ACM International Conference on Management of Data, SIGMOD, pp. 1383–1394. ACM (2015)
Baig, F., Mehrotra, M., Vo, H., Wang, F., Saltz, J.H., Kurç, T.M.: SparkGIS: efficient comparison and evaluation of algorithm results in tissue image analysis studies. In: Workshop on Biomedical Data Management and Graph Online Querying—VLDB, pp. 134–146 (2015)
Battle, L., Stonebraker, M., Chang, R.: Dynamic reduction of query result sets for interactive visualizaton. In: Proceedings of International Conference on Big Data, BigData, pp. 1–8 (2013)
Crotty, A., Galakatos, A., Zgraggen, E., Binnig, C., Kraska, T.: Vizdom: interactive analytics through pen and touch. Proc. VLDB Endow. PVLDB 8(12), 2024–2027 (2015)
Article Google Scholar
de Lara Pahins, C.A., Stephens, S.A., Scheidegger, C., Comba, J.L.D.: Hashedcubes: simple, low memory, real-time visual exploration of big data. IEEE Trans. Vis. Comput. Graph. TVCG 23(1), 671–680 (2017)
Article Google Scholar
Eldawy, A., Alarabi, L., Mokbel, M.F.: Spatial partitioning techniques in spatial hadoop. Proc. VLDB Endow. PVLDB 8(12), 1602–1605 (2015)
Article Google Scholar
Eldawy, A., Mokbel, M.F.: A demonstration of spatialhadoop: an efficient mapreduce framework for spatial data. Proc. VLDB Endow. PVLDB 6(12), 1230–1233 (2013)
Article Google Scholar
Eldawy, A., Mokbel, M.F., Alharthi, S., Alzaidy, A., Tarek, K., Ghani, S.: Shahed: a mapreduce-based system for querying and visualizing spatio-temporal satellite data. In: Proceedings of the International Conference on Data Engineering, ICDE, pp. 1585–1596. IEEE (2015)
Eldawy, A., Mokbel, M.F., Jonathan, C.: Hadoopviz: a mapreduce framework for extensible visualization of big spatial data. In: Proceedings of the International Conference on Data Engineering, ICDE, pp. 601–612. IEEE (2016)
Earthdata Cloud Evolution. http://www.naturalearthdata.com/downloads/
Guo, T., Feng, K., Cong, G., Bao, Z.: Efficient selection of geospatial data on maps for interactive and visualized exploration. In: Proceedings of the ACM International Conference on Management of Data, SIGMOD, pp. 567–582. ACM (2018)
Apache Hadoop. http://hadoop.apache.org/
Hughes, J.N., Annex, A., Eichelberger, C.N., Fox, A., Hulbert, A., Ronquest, M.: Geomesa: a distributed architecture for spatio-temporal fusion. In: SPIE Defense+Security, pp. 94730F–94730F. International Society for Optics and Photonics (2015)
Kefaloukos, P.K., Salles, M.A.V., Zachariasen, M.: Declarative cartography: in-database map generalization of geospatial datasets. In: Proceedings of the International Conference on Data Engineering, ICDE, pp. 1024–1035. IEEE (2014)
Kini, A., Emanuele, R.: Geotrellis: adding geospatial capabilities to spark. Spark Summit (2014)
Lins, L., Klosowski, J.T., Scheidegger, C.: Nanocubes for real-time exploration of spatiotemporal datasets. IEEE Trans. Vis. Comput. Graph. TVCG 19(12), 2456–2465 (2013)
Article Google Scholar
Liu, Z., Jiang, B., Heer, J.: immens: real-time visual querying of big data. In: Computer Graphics Forum, vol. 32, pp. 421–430. Wiley Online Library (2013)
Apache Livy. https://livy.apache.org/
Lu, J., Güting, R.H.: Parallel secondo: boosting database engines with hadoop. In: International Conference on Parallel and Distributed Systems, ICPADS, pp. 738–743. IEEE (2012)
Mahdian, M., Schrijvers, O., Vassilvitskii, S.: Algorithmic cartography: placing points of interest and ads on maps. In: Proceedings of the International Conference on Knowledge Discovery and Data Mining, SIGKDD, pp. 755–764 (2015)
Mostak, T.: An overview of MAPD (massively parallel database). White paper, Massachusetts Institute of Technology (2013)
OpenStreetMap. Map Zoom Level. http://wiki.openstreetmap.org/wiki/Zoom_levels
Park, Y., Cafarella, M.J., Mozafari, B.: Visualization-aware sampling for very large databases. In: Proceedings of the International Conference on Data Engineering, ICDE, pp. 755–766. IEEE (2016)
Rahman, S., Aliakbarpour, M., Kong, H., Blais, E., Karahalios, K., Parameswaran, A.G., Rubinfeld, R.: I’ve seen “enough”: incrementally improving visualizations to support rapid decision making. Proc. VLDB Endow. PVLDB 10(11), 1262–1273 (2017)
Article Google Scholar
Sarma, A.D., Lee, H., Gonzalez, H., Madhavan, J., Halevy, A.Y.: Efficient spatial sampling of large geographical tables. In: Proceedings of the ACM International Conference on Management of Data, SIGMOD, pp. 193–204 (2012)
Satyanarayan, A., Moritz, D., Wongsuphasawat, K., Heer, J.: Vega-lite: a grammar of interactive graphics. IEEE Trans. Vis. Comput. Graph. TVCG 23(1), 341–350 (2017)
Article Google Scholar
Satyanarayan, A., Russell, R., Hoffswell, J., Heer, J.: Reactive vega: a streaming dataflow architecture for declarative interactive visualization. IEEE Trans. Vis. Comput. Graph. TVCG 22(1), 659–668 (2016)
Article Google Scholar
Scarsella, A., Stofega, W.: Worldwide Smartphone Forecast 2020–2024. Technical report, International Data Corporation (IDC) (2020). https://www.idc.com/getdoc.jsp?containerId=US46135620
Apache Spark. http://spark.apache.org/
Su, S., An, M., Perry, V., Jia, J., Kim, T., Chen, T., Li, C.: Visually analyzing A billion tweets: an application for collaborative visual analytics on large high-resolution display. In: Proceedings of International Conference on Big Data, BigData, pp. 3597–3606 (2018)
Tang, M., Yu, Y., Malluhi, Q.M., Ouzzani, M., Aref, W.G.: LocationSpark: a distributed in-memory data management system for big spatial data. Proc. VLDB Endow. PVLDB 9(13), 1565–1568 (2016)
Article Google Scholar
Wang, L., Christensen, R., Li, F., Yi, K.: Spatial online sampling and aggregation. Proc. VLDB Endow. PVLDB 9(3), 84–95 (2015)
Article Google Scholar
Weibel, R., Dutton, G.: Generalising spatial data and dealing with multiple representations. Geograph. Inf. Syst. 1, 125–155 (1999)
Google Scholar
Wu, E., Battle, L., Madden, S.R.: The case for data visualization management systems. Proc. VLDB Endow. PVLDB 7(10), 903–906 (2014)
Article Google Scholar
Xie, D., Li, F., Yao, B., Li, G., Zhou, L., Guo, M.: Simba: efficient in-memory spatial analytics. In: Proceedings of the ACM International Conference on Management of Data, SIGMOD, pp. 1071–1085. ACM (2016)
Yu, J., Sarwat, M.: Indexing the pickup and drop-off locations of NYC taxi trips in postgresql—lessons from the road. In: Proceedings of the International Symposium on Advances in Spatial and Temporal Databases, SSTD, pp. 145–162 (2017)
Yu, J., Sarwat, M.: Turbocharging geospatial visualization dashboards via a materialized sampling cube approach. In: Proceedings of the International Conference on Data Engineering, ICDE, pp. 1165–1176. IEEE (2020)
Yu, J., Zhang, Z., Sarwat, M.: Geosparkviz: a scalable geospatial data visualization framework in the apache spark ecosystem. In: Proceedings of the International Conference on Scientific and Statistical Database Management, SSDBM, pp. 15:1–15:12 (2018)
Yu, J., Zhang, Z., Sarwat, M.: Spatial data management in apache spark: the GeoSpark perspective and beyond. GeoInformatica 23(1), 37–78 (2019)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Electrical Engineering and Computer Science, Washington State University, 355 NE Spokane Street, Pullman, WA, 99163, USA
Jia Yu
School of Computing, Informatics, and Decision Systems Engineering, Arizona State University, 699 S Mill Avenue, Tempe, AZ, 85281, USA
Mohamed Sarwat

Authors

Jia Yu
View author publications
You can also search for this author inPubMed Google Scholar
Mohamed Sarwat
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Jia Yu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional GeoViz SQL specification

This section includes an additional specification about GeoViz SQL. This is complementary to the content in Sect. 4. We also give more examples to demonstrate how to assemble map effects.

1.1 Type and function specification

GeoViz SQL allows declarative SQL-like queries over structured RDDs. Each RDD has a schema which consists of a number of attributes. Each attribute has a type in Spark.

1.1.1 Types

GeoSparkViz adds two new types of objects: pixels and image. This way, Spark can understand and manipulate data for maps. In addition, GeoSpark itself adds a new type in Spark called Geometry to represent geospatial data.

Geometry [40] This is a generic data type which internally represents a variety of spatial objects, such as points, line strings, and polygons. It has several fields such as coordinates.

Pixel This type extends the Geometry type to support pixels and hence spatial query operators can process it directly. It is used by several map building operators: Pixel, Pixel aggregate and Render. Besides the original fields in Geometry, it has several additional fields: (1) resolution (2) tile id. A pixel can be considered as a point object.

Image This type is a serializable wrapper of Java BufferedImage class and actually holds the map tile data. It provides serialization functions to BufferedImage. Each map tile in GeoSparkViz is an Image type object.

1.1.2 Functions

ST_TileId Each pixel in GeoSparkViz has several internal attributes. The tile ID of a pixel is used to partition the pixels properly.

Input The function takes as input a pixel attribute.
Output It returns the tile ID of this pixel. The ID is a string type object.

ST_EncodeImage This function returns the base64 string representation of an image. This is a specific function for the server-client environment. For example, some client libraries such as Apache Zeppelin can directly display base64 images.

Input The function takes as input an image attribute.
Output It returns a base64 string of the image.

1.2 Additional GeoViz query examples

In this section, we provide more examples about how to assemble GeoViz queries. Another example, scatter plot of taxi trip pickup points, can be found in Sect. 4.2.

Spatial dataset We use the NYC taxi trip dataset mentioned in Fig. 5 as the running example in this section. The dataset is loaded into a structured Spatial RDD.

Heat map of taxi trip pick up points This shows a heat map of the distribution of pickup points of taxi trips. The color is in proportion to the density of pickup points. The max weight is 100 which means: if there are more than 100 trips picked up in a place, this place shows red color. The initial weight in Pixelize operator is 1 and the aggregation strategy is count(). Single-image map of taxi trip pick up points This shows a heat map of the distribution of pickup points of taxi trips in a map image which does not have any tiles. Other parameters are the same as the previous one. This is similar to the queries shown in Fig. 4. Heat map of trip fare This shows a heat map of trip fare. If trips picked up from a place cost more money, this place will show a red color. The max weight is 30 which means: if trips from a place cost more than 30 dollars, this place will show a red color. The initial weight in the Pixelize operator is the “trip fare” attribute and the aggregation strategy is avg().

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yu, J., Sarwat, M. GeoSparkViz: a cluster computing system for visualizing massive-scale geospatial data. The VLDB Journal 30, 237–258 (2021). https://doi.org/10.1007/s00778-020-00645-2

Download citation

Received: 10 December 2019
Revised: 24 June 2020
Accepted: 27 September 2020
Published: 07 January 2021
Issue Date: March 2021
DOI: https://doi.org/10.1007/s00778-020-00645-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

GeoSparkViz: a cluster computing system for visualizing massive-scale geospatial data

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Strabo 2: Distributed Management of Massive Geospatial RDF Datasets

A comparative experimental study of distributed storage engines for big spatial data processing using GeoSpark

Spatial data management in apache spark: the GeoSpark perspective and beyond

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Additional GeoViz SQL specification

Additional GeoViz SQL specification

1.1 Type and function specification

1.1.1 Types

1.1.2 Functions

1.2 Additional GeoViz query examples

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now