ABSTRACT
Spatial data analysis applications are emerging from a wide range of domains such as building information management, environmental assessments and medical imaging. Time-consuming computational geometry algorithms make these applications slow, even for medium-sized datasets. At the same time, there is a rapid expansion in available processing cores, through multicore machines and Cloud computing. The confluence of these trends demands effective parallelization of spatial query processing. Unfortunately, traditional parallel spatial databases are ill-equipped to deal with the performance heterogeneity that is common in the Cloud.
We introduce Niharika, a parallel spatial data analysis infrastructure that exploits all available cores in a heterogeneous cluster. Niharika first uses a declustering technique that creates balanced spatial partitions. Then, Niharika adapts to performance heterogeneity and processing skew in the spatial dataset using dynamic load-balancing. We evaluate Niharika with three load-balancing algorithms and two different spatial datasets (both from TIGER) using Amazon EC2 instances. Niharika adapts to the performance heterogeneity in the EC2 nodes, thereby achieving excellent speedups (e.g., 63.6X using 64 cores on 16 4-core EC2 nodes, in the best case) and outperforming an approach that does not adapt.
- A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In VLDB, pages 922--933, 2009. Google ScholarDigital Library
- R. Acker, C. Roth, and R. Bayer. Parallel query processing in databases on multicore architectures. In ICA3PP, 2008. Google ScholarDigital Library
- A. Aji, F. Wang, and J. H. Saltz. Towards building a high performance spatial query system for large scale medical imaging data. In SIGSPATIAL, pages 309--318, 2012. Google ScholarDigital Library
- M.-C. Albutiu, A. Kemper, and T. Neumann. Massively parallel sort-merge joins in main memory multi-core database systems. In VLDB, pages 1064--1075, 2012. Google ScholarDigital Library
- V. Bharadwaj, D. Ghose, V. Mani, and T. G. Robertazzi. Scheduling Divisible Loads in Parallel and Distributed Systems. IEEE Computer Society, 1996. Google ScholarDigital Library
- D. Borthakur. Petabyte scale databases and storage systems deployed at facebook. In SIGMOD, 2013. Google ScholarDigital Library
- T. Brinkhoff, H. peter Kriegel, and B. Seeger. Parallel Processing of Spatial Joins Using R-trees. In ICDE, 1996. Google ScholarDigital Library
- P. C. Campbell, K. D. Devine, J. E. Flaherty, L. G. Gervasio, and J. D. Teresco. Dynamic octree load balancing using space-filling curves. Williams College, TR CS-03-01, 2003.Google Scholar
- D. J. DeWitt and J. Gray. Parallel database systems: The future of high performance database processing. Commun. of the ACM, 35(6):85--98, 1992. Google ScholarDigital Library
- B. Farley, A. Juels, V. Varadarajan, T. Ristenpart, K. D. Bowers, and M. M. Swift. More for your money: exploiting performance heterogeneity in public clouds. In SoCC, 2012. Google ScholarDigital Library
- N. Hardavellas, I. Pandis, R. Johnson, N. Mancheril, A. Ailamaki, and B. Falsafi. Database Servers on Chip Multiprocessors: Limitations and Opportunities. In CIDR, pages 79--87, 2007.Google Scholar
- E. H. Jacox and H. Samet. Spatial join techniques. ACM Transactions on Database Systems, 32(1), 2007. Google ScholarDigital Library
- G. Luo, J. F. Naughton, and C. J. Ellmann. A Non-Blocking Parallel Spatial Join Algorithm. In ICDE, 2002.Google Scholar
- T. Mayr, P. Bonnet, and J. Gehrke. Leveraging non-uniform resources for parallel query processing. In CCGrid, 2002. Google ScholarDigital Library
- PostgreSQL Partitioning. http://www.postgresql.org/-docs/8.3/static/ddl-partitioning.html.Google Scholar
- J. M. Patel and D. J. DeWitt. Partition based spatial-merge join. In SIGMOD, pages 259--270, 1996. Google ScholarDigital Library
- J. M. Patel and D. J. DeWitt. Clone join and shadow join: two parallel spatial join algorithms. In SIGSPATIAL, 2000. Google ScholarDigital Library
- A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. Dewitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD, 2009. Google ScholarDigital Library
- S. Ray, B. Simion, and A. D. Brown. Jackpine: A benchmark to evaluate spatial database performance. In ICDE, 2011. Google ScholarDigital Library
- B. Simion, S. Ray, and A. D. Brown. Surveying the landscape: An in-depth analysis of spatial database workloads. In SIGSPATIAL, pages 376--385, 2012. Google ScholarDigital Library
- http://www.census.gov/geo/www/tiger.Google Scholar
- M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica. Improving MapReduce performance in heterogeneous environments. In OSDI, pages 29--42, 2008. Google ScholarDigital Library
- S. Zhang, J. Han, Z. Liu, K. Wang, and Z. Xu. SJMR: Parallelizing spatial join with MapReduce on clusters. In CLUSTER, pages 1--8, 2009.Google ScholarCross Ref
- X. Zhou, D. J. Abel, and D. Truffet. Data partitioning for parallel spatial join processing. Geoinformatica, 1998. Google ScholarDigital Library
Index Terms
- A parallel spatial data analysis infrastructure for the cloud
Recommendations
DevOps patterns to scale web applications using cloud services
SPLASH '13: Proceedings of the 2013 companion publication for conference on Systems, programming, & applications: software for humanityScaling a web applications can be easy for simple CRUD software running when you use Platform as a Service Clouds (PaaS). But if you need to deploy a complex software, with many components and a lot users, you will need have a mix of cloud services in ...
Resource Allocation Scheme in Cloud Infrastructure
CUBE '13: Proceedings of the 2013 International Conference on Cloud & Ubiquitous Computing & Emerging TechnologiesCloud computing is a paradigm for large-scale distributed computing that makes use of existing technologies such as virtualization, service-orientation, and grid computing. In cloud environment, pool of virtual resources is always changing. Thus ...
SLA-aware Workload Scheduling Using Hybrid Cloud Services
HiPS '21: Proceedings of the 1st Workshop on High Performance Serverless ComputingCloud services have an auto-scaling feature for load balancing to meet the performance requirements of an application. Existing auto-scaling techniques are based on upscaling and downscaling cloud resources to distribute the dynamically varying ...
Comments