skip to main content
10.1145/2525314.2525347acmconferencesArticle/Chapter ViewAbstractPublication PagesgisConference Proceedingsconference-collections
research-article

A parallel spatial data analysis infrastructure for the cloud

Published:05 November 2013Publication History

ABSTRACT

Spatial data analysis applications are emerging from a wide range of domains such as building information management, environmental assessments and medical imaging. Time-consuming computational geometry algorithms make these applications slow, even for medium-sized datasets. At the same time, there is a rapid expansion in available processing cores, through multicore machines and Cloud computing. The confluence of these trends demands effective parallelization of spatial query processing. Unfortunately, traditional parallel spatial databases are ill-equipped to deal with the performance heterogeneity that is common in the Cloud.

We introduce Niharika, a parallel spatial data analysis infrastructure that exploits all available cores in a heterogeneous cluster. Niharika first uses a declustering technique that creates balanced spatial partitions. Then, Niharika adapts to performance heterogeneity and processing skew in the spatial dataset using dynamic load-balancing. We evaluate Niharika with three load-balancing algorithms and two different spatial datasets (both from TIGER) using Amazon EC2 instances. Niharika adapts to the performance heterogeneity in the EC2 nodes, thereby achieving excellent speedups (e.g., 63.6X using 64 cores on 16 4-core EC2 nodes, in the best case) and outperforming an approach that does not adapt.

References

  1. A. Abouzeid, K. Bajda-Pawlikowski, D. Abadi, A. Silberschatz, and A. Rasin. HadoopDB: an architectural hybrid of MapReduce and DBMS technologies for analytical workloads. In VLDB, pages 922--933, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. R. Acker, C. Roth, and R. Bayer. Parallel query processing in databases on multicore architectures. In ICA3PP, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. A. Aji, F. Wang, and J. H. Saltz. Towards building a high performance spatial query system for large scale medical imaging data. In SIGSPATIAL, pages 309--318, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M.-C. Albutiu, A. Kemper, and T. Neumann. Massively parallel sort-merge joins in main memory multi-core database systems. In VLDB, pages 1064--1075, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. V. Bharadwaj, D. Ghose, V. Mani, and T. G. Robertazzi. Scheduling Divisible Loads in Parallel and Distributed Systems. IEEE Computer Society, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Borthakur. Petabyte scale databases and storage systems deployed at facebook. In SIGMOD, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. Brinkhoff, H. peter Kriegel, and B. Seeger. Parallel Processing of Spatial Joins Using R-trees. In ICDE, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. C. Campbell, K. D. Devine, J. E. Flaherty, L. G. Gervasio, and J. D. Teresco. Dynamic octree load balancing using space-filling curves. Williams College, TR CS-03-01, 2003.Google ScholarGoogle Scholar
  9. D. J. DeWitt and J. Gray. Parallel database systems: The future of high performance database processing. Commun. of the ACM, 35(6):85--98, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. B. Farley, A. Juels, V. Varadarajan, T. Ristenpart, K. D. Bowers, and M. M. Swift. More for your money: exploiting performance heterogeneity in public clouds. In SoCC, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. N. Hardavellas, I. Pandis, R. Johnson, N. Mancheril, A. Ailamaki, and B. Falsafi. Database Servers on Chip Multiprocessors: Limitations and Opportunities. In CIDR, pages 79--87, 2007.Google ScholarGoogle Scholar
  12. E. H. Jacox and H. Samet. Spatial join techniques. ACM Transactions on Database Systems, 32(1), 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. G. Luo, J. F. Naughton, and C. J. Ellmann. A Non-Blocking Parallel Spatial Join Algorithm. In ICDE, 2002.Google ScholarGoogle Scholar
  14. T. Mayr, P. Bonnet, and J. Gehrke. Leveraging non-uniform resources for parallel query processing. In CCGrid, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. PostgreSQL Partitioning. http://www.postgresql.org/-docs/8.3/static/ddl-partitioning.html.Google ScholarGoogle Scholar
  16. J. M. Patel and D. J. DeWitt. Partition based spatial-merge join. In SIGMOD, pages 259--270, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. J. M. Patel and D. J. DeWitt. Clone join and shadow join: two parallel spatial join algorithms. In SIGSPATIAL, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. Dewitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. In SIGMOD, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. S. Ray, B. Simion, and A. D. Brown. Jackpine: A benchmark to evaluate spatial database performance. In ICDE, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. B. Simion, S. Ray, and A. D. Brown. Surveying the landscape: An in-depth analysis of spatial database workloads. In SIGSPATIAL, pages 376--385, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. http://www.census.gov/geo/www/tiger.Google ScholarGoogle Scholar
  22. M. Zaharia, A. Konwinski, A. D. Joseph, R. H. Katz, and I. Stoica. Improving MapReduce performance in heterogeneous environments. In OSDI, pages 29--42, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. S. Zhang, J. Han, Z. Liu, K. Wang, and Z. Xu. SJMR: Parallelizing spatial join with MapReduce on clusters. In CLUSTER, pages 1--8, 2009.Google ScholarGoogle ScholarCross RefCross Ref
  24. X. Zhou, D. J. Abel, and D. Truffet. Data partitioning for parallel spatial join processing. Geoinformatica, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A parallel spatial data analysis infrastructure for the cloud

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SIGSPATIAL'13: Proceedings of the 21st ACM SIGSPATIAL International Conference on Advances in Geographic Information Systems
        November 2013
        598 pages
        ISBN:9781450325219
        DOI:10.1145/2525314

        Copyright © 2013 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 5 November 2013

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        Overall Acceptance Rate220of1,116submissions,20%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader