skip to main content
10.1145/1376616.1376720acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Handling data skew in parallel joins in shared-nothing systems

Published:09 June 2008Publication History

ABSTRACT

Parallel processing continues to be important in large data warehouses. The processing requirements continue to expand in multiple dimensions. These include greater volumes, increasing number of concurrent users, more complex queries, and more applications which define complex logical, semantic, and physical data models. Shared nothing parallel database management systems [16] can scale up "horizontally" by adding more nodes. Most parallel algorithms, however, do not take into account data skew. Data skew occurs naturally in many applications. A query processing skewed data not only slows down its response time, but generates hot nodes, which become a bottleneck throttling the overall system performance. Motivated by real business problems, we propose a new join geography called PRPD (Partial Redistribution & Partial Duplication) to improve the performance and scalability of parallel joins in the presence of data skew in a shared-nothing system. Our experimental results show that PRPD significantly speeds up query elapsed time in the presence of data skew. Our experience shows that eliminating system bottlenecks caused by data skew improves the throughput of the whole system which is important in parallel data warehouses that often run high concurrency workloads.

References

  1. TPC Benchmark H (decision support) standard specification http://www.tpc.org.Google ScholarGoogle Scholar
  2. K. Alsabti and S. Ranka. Skew-insensitive parallel algorithms for relational join. In HIPC, page 367, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Bamha and G. Hains. Frequency-adaptive join for shared nothing machines. Progress in computer research, pages 227--241, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. L. Carter and M. N. Wegman. Universal classes of hash functions. Journal of Computer and System Sciences, 18:143--154, 1979.Google ScholarGoogle ScholarCross RefCross Ref
  5. H.M. Dewan, M. A. Hernández, K. W. Mok, and S. J. Stolfo. Predictive dynamic load balancing of parallel hash-joins over heterogeneous processors in the presence of data skew. In PDIS, pages 40--49, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. DeWitt and J. Gray. Parallel database systems: the future of high performance database systems. Commun. ACM, 35(6):85--98, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. J. DeWitt, J. F. Naughton, D. A. Schneider, and S. Seshadri. Practical skew handling in parallel joins. In VLDB, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. FrankǎOlken and DoronǎRotem. Random sampling from databases: a survey. Statistics and Computing, 5(1):25--42, 1995.Google ScholarGoogle ScholarCross RefCross Ref
  9. R. L. Graham. Bounds on multiprocessing timing anomalies. SIAM Journal on Applied Mathematics, 17(2):416--429, 1969.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. L. Harada and M. Kitsuregawa. Dynamic join product skew handling for hash-joins in shared-nothing database systems. In DASFAA, pages 246--255, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. K. A. Hua and C. Lee. Handling data skew in multiprocessor database computers using partition tuning. In VLDB, pages 525--535, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. E. G. C. Jr., M. R. Garey, and D. S. Johnson. An application of bin-packing to multiprocessor scheduling. SIAM J. Comput., 7(1):1--17, 1978.Google ScholarGoogle ScholarCross RefCross Ref
  13. M. Kitsuregawa and Y. Ogawa. Bucket spreading parallel hash: A new, robust, parallel hash join method for data skew in the super database computer (sdc). In VLDB, pages 210--221, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. S. Lakshmi and P. S. Yu. Effectiveness of parallel joins. IEEE Transactions on Knowledge and Data Engineering, 2(4):410--424, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. A. Shatdal and J. F. Naughton. Using shared virtual memory for parallel join processing. In SIGMOD Conference, pages 119--128, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Stonebraker. The case for shared nothing. IEEE Database Eng. Bull., 9(1):4--9, 1986.Google ScholarGoogle Scholar
  17. C. B. Walton, A. G. Dale, and R. M. Jenevein. A taxonomy and performance model of data skew effects in parallel joins. In VLDB, pages 537--548, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. L.Wolf, D. M. Dias, and P. S. Yu. A parallel sort merge join algorithm for managing data skew. IEEE Trans. Parallel Distrib. Syst., 4(1):70--86, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. L. Wolf, D. M. Dias, P. S. Yu, and J. Turek. An effective algorithm for parallelizing hash joins in the presence of data skew. In ICDE, pages 200--209, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. J. L. Wolf, D. M. Dias, P. S. Yu, and J. Turek. New algorithms for parallelizing relational database joins in the presence of data skew. IEEE Trans. Knowl. Data Eng., 6(6):990--997, 1994. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. X. Zhou and M. E. Orlowska. Handling data skew in parallel hash join computation using two-phase scheduling. In IEEE 1st International Conference on Algorithm and Architecture for Parallel Processing, pages 527--536 vol.2, 1995.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Handling data skew in parallel joins in shared-nothing systems

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data
      June 2008
      1396 pages
      ISBN:9781605581026
      DOI:10.1145/1376616

      Copyright © 2008 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 9 June 2008

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate785of4,003submissions,20%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader