research-article

A parallel spatial data analysis infrastructure for the cloud

Authors:
Suprio Ray

University of Toronto

University of Toronto
View Profile

,
Bogdan Simion

University of Toronto

University of Toronto
View Profile

,
Angela Demke Brown

University of Toronto

University of Toronto
View Profile

,
Ryan Johnson

University of Toronto

University of Toronto
View Profile

SIGSPATIAL'13: Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information SystemsNovember 2013Pages 284–293https://doi.org/10.1145/2525314.2525347

Published:05 November 2013Publication History

SIGSPATIAL'13: Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems

Pages 284–293

ABSTRACT

Spatial data analysis applications are emerging from a wide range of domains such as building information management, environmental assessments and medical imaging. Time-consuming computational geometry algorithms make these applications slow, even for medium-sized datasets. At the same time, there is a rapid expansion in available processing cores, through multicore machines and Cloud computing. The confluence of these trends demands effective parallelization of spatial query processing. Unfortunately, traditional parallel spatial databases are ill-equipped to deal with the performance heterogeneity that is common in the Cloud.

We introduce Niharika, a parallel spatial data analysis infrastructure that exploits all available cores in a heterogeneous cluster. Niharika first uses a declustering technique that creates balanced spatial partitions. Then, Niharika adapts to performance heterogeneity and processing skew in the spatial dataset using dynamic load-balancing. We evaluate Niharika with three load-balancing algorithms and two different spatial datasets (both from TIGER) using Amazon EC2 instances. Niharika adapts to the performance heterogeneity in the EC2 nodes, thereby achieving excellent speedups (e.g., 63.6X using 64 cores on 16 4-core EC2 nodes, in the best case) and outperforming an approach that does not adapt.

References

A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In VLDB, pages 922--933, 2009. Google ScholarDigital Library
R. Acker, C. Roth, and R. Bayer. Parallel query processing in databases on multicore architectures. In ICA3PP, 2008. Google ScholarDigital Library
A. Aji, F. Wang, and J. H. Saltz. Towards building a high performance spatial query system for large scale medical imaging data. In SIGSPATIAL, pages 309--318, 2012. Google ScholarDigital Library
M.-C. Albutiu, A. Kemper, and T. Neumann. Massively parallel sort-merge joins in main memory multi-core database systems. In VLDB, pages 1064--1075, 2012. Google ScholarDigital Library
V. Bharadwaj, D. Ghose, V. Mani, and T. G. Robertazzi. Scheduling Divisible Loads in Parallel and Distributed Systems. IEEE Computer Society, 1996. Google ScholarDigital Library
D. Borthakur. Petabyte scale databases and storage systems deployed at facebook. In SIGMOD, 2013. Google ScholarDigital Library
T. Brinkhoff, H. peter Kriegel, and B. Seeger. Parallel Processing of Spatial Joins Using R-trees. In ICDE, 1996. Google ScholarDigital Library
P. C. Campbell, K. D. Devine, J. E. Flaherty, L. G. Gervasio, and J. D. Teresco. Dynamic octree load balancing using space-filling curves. Williams College, TR CS-03-01, 2003.Google Scholar
D. J. DeWitt and J. Gray. Parallel database systems: The future of high performance database processing. Commun. of the ACM, 35(6):85--98, 1992. Google ScholarDigital Library
B. Farley, A. Juels, V. Varadarajan, T. Ristenpart, K. D. Bowers, and M. M. Swift. More for your money: exploiting performance heterogeneity in public clouds. In SoCC, 2012. Google ScholarDigital Library
N. Hardavellas, I. Pandis, R. Johnson, N. Mancheril, A. Ailamaki, and B. Falsafi. Database Servers on Chip Multiprocessors: Limitations and Opportunities. In CIDR, pages 79--87, 2007.Google Scholar
E. H. Jacox and H. Samet. Spatial join techniques. ACM Transactions on Database Systems, 32(1), 2007. Google ScholarDigital Library
G. Luo, J. F. Naughton, and C. J. Ellmann. A Non-Blocking Parallel Spatial Join Algorithm. In ICDE, 2002.Google Scholar
T. Mayr, P. Bonnet, and J. Gehrke. Leveraging non-uniform resources for parallel query processing. In CCGrid, 2002. Google ScholarDigital Library
PostgreSQL Partitioning. http://www.postgresql.org/-docs/8.3/static/ddl-partitioning.html.Google Scholar
J. M. Patel and D. J. DeWitt. Partition based spatial-merge join. In SIGMOD, pages 259--270, 1996. Google ScholarDigital Library
J. M. Patel and D. J. DeWitt. Clone join and shadow join: two parallel spatial join algorithms. In SIGSPATIAL, 2000. Google ScholarDigital Library
A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. Dewitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD, 2009. Google ScholarDigital Library
S. Ray, B. Simion, and A. D. Brown. Jackpine: A benchmark to evaluate spatial database performance. In ICDE, 2011. Google ScholarDigital Library
B. Simion, S. Ray, and A. D. Brown. Surveying the landscape: An in-depth analysis of spatial database workloads. In SIGSPATIAL, pages 376--385, 2012. Google ScholarDigital Library
http://www.census.gov/geo/www/tiger.Google Scholar
M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica. Improving MapReduce performance in heterogeneous environments. In OSDI, pages 29--42, 2008. Google ScholarDigital Library
S. Zhang, J. Han, Z. Liu, K. Wang, and Z. Xu. SJMR: Parallelizing spatial join with MapReduce on clusters. In CLUSTER, pages 1--8, 2009.Google ScholarCross Ref
X. Zhou, D. J. Abel, and D. Truffet. Data partitioning for parallel spatial join processing. Geoinformatica, 1998. Google ScholarDigital Library

Index Terms

A parallel spatial data analysis infrastructure for the cloud
1. Human-centered computing
  1. Visualization
    1. Visualization application domains
      1. Geographic visualization
2. Information systems
  1. Information systems applications
    1. Spatial-temporal systems

Recommendations

DevOps patterns to scale web applications using cloud services
SPLASH '13: Proceedings of the 2013 companion publication for conference on Systems, programming, & applications: software for humanity

Scaling a web applications can be easy for simple CRUD software running when you use Platform as a Service Clouds (PaaS). But if you need to deploy a complex software, with many components and a lot users, you will need have a mix of cloud services in ...
Read More
Resource Allocation Scheme in Cloud Infrastructure
CUBE '13: Proceedings of the 2013 International Conference on Cloud & Ubiquitous Computing & Emerging Technologies

Cloud computing is a paradigm for large-scale distributed computing that makes use of existing technologies such as virtualization, service-orientation, and grid computing. In cloud environment, pool of virtual resources is always changing. Thus ...
Read More
SLA-aware Workload Scheduling Using Hybrid Cloud Services
HiPS '21: Proceedings of the 1st Workshop on High Performance Serverless Computing

Cloud services have an auto-scaling feature for load balancing to meet the performance requirements of an application. Existing auto-scaling techniques are based on upscaling and downscaling cloud resources to distribute the dynamically varying ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGSPATIAL'13: Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
November 2013
598 pages
ISBN:9781450325219
DOI:10.1145/2525314
General Chairs:
Craig Knoblock
University of Southern California
,
Markus Schneider
University of Florida
,
Program Chairs:
Peer Kröger
Ludwig-Maximilians-Universität München, Germany
,
John Krumm
Microsoft Research
,
Peter Widmayer
ETH Zürich, Switzerland
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 November 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cloud
load balancing
performance heterogeneity
spatial join
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate220of1,116submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 18
  Total Citations
  View Citations
- 335
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A parallel spatial data analysis infrastructure for the cloud

SIGSPATIAL'13: Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems

ABSTRACT

References

Cited By

Index Terms

Recommendations

DevOps patterns to scale web applications using cloud services

Resource Allocation Scheme in Cloud Infrastructure

SLA-aware Workload Scheduling Using Hybrid Cloud Services