research-article

Handling data skew in parallel joins in shared-nothing systems

Authors:
Yu Xu

Teradata, San Diego, CA, USA

Teradata, San Diego, CA, USA
View Profile

,
Pekka Kostamaa

Teradata, San Diego, CA, USA

Teradata, San Diego, CA, USA
View Profile

,
Xin Zhou

Teradata, San Diego, CA, USA

Teradata, San Diego, CA, USA
View Profile

,
Liang Chen

UCSD, San Diego, CA, USA

UCSD, San Diego, CA, USA
View Profile

SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of dataJune 2008Pages 1043–1052https://doi.org/10.1145/1376616.1376720

Published:09 June 2008Publication History

SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data

Pages 1043–1052

ABSTRACT

Parallel processing continues to be important in large data warehouses. The processing requirements continue to expand in multiple dimensions. These include greater volumes, increasing number of concurrent users, more complex queries, and more applications which define complex logical, semantic, and physical data models. Shared nothing parallel database management systems [16] can scale up "horizontally" by adding more nodes. Most parallel algorithms, however, do not take into account data skew. Data skew occurs naturally in many applications. A query processing skewed data not only slows down its response time, but generates hot nodes, which become a bottleneck throttling the overall system performance. Motivated by real business problems, we propose a new join geography called PRPD (Partial Redistribution & Partial Duplication) to improve the performance and scalability of parallel joins in the presence of data skew in a shared-nothing system. Our experimental results show that PRPD significantly speeds up query elapsed time in the presence of data skew. Our experience shows that eliminating system bottlenecks caused by data skew improves the throughput of the whole system which is important in parallel data warehouses that often run high concurrency workloads.

References

TPC Benchmark H (decision support) standard specification http://www.tpc.org.Google Scholar
K. Alsabti and S. Ranka. Skew-insensitive parallel algorithms for relational join. In HIPC, page 367, 1998. Google ScholarDigital Library
M. Bamha and G. Hains. Frequency-adaptive join for shared nothing machines. Progress in computer research, pages 227--241, 2001. Google ScholarDigital Library
J. L. Carter and M. N. Wegman. Universal classes of hash functions. Journal of Computer and System Sciences, 18:143--154, 1979.Google ScholarCross Ref
H.M. Dewan, M. A. Hernández, K. W. Mok, and S. J. Stolfo. Predictive dynamic load balancing of parallel hash-joins over heterogeneous processors in the presence of data skew. In PDIS, pages 40--49, 1994. Google ScholarDigital Library
D. DeWitt and J. Gray. Parallel database systems: the future of high performance database systems. Commun. ACM, 35(6):85--98, 1992. Google ScholarDigital Library
D. J. DeWitt, J. F. Naughton, D. A. Schneider, and S. Seshadri. Practical skew handling in parallel joins. In VLDB, 1992. Google ScholarDigital Library
FrankǎOlken and DoronǎRotem. Random sampling from databases: a survey. Statistics and Computing, 5(1):25--42, 1995.Google ScholarCross Ref
R. L. Graham. Bounds on multiprocessing timing anomalies. SIAM Journal on Applied Mathematics, 17(2):416--429, 1969.Google ScholarDigital Library
L. Harada and M. Kitsuregawa. Dynamic join product skew handling for hash-joins in shared-nothing database systems. In DASFAA, pages 246--255, 1995. Google ScholarDigital Library
K. A. Hua and C. Lee. Handling data skew in multiprocessor database computers using partition tuning. In VLDB, pages 525--535, 1991. Google ScholarDigital Library
E. G. C. Jr., M. R. Garey, and D. S. Johnson. An application of bin-packing to multiprocessor scheduling. SIAM J. Comput., 7(1):1--17, 1978.Google ScholarCross Ref
M. Kitsuregawa and Y. Ogawa. Bucket spreading parallel hash: A new, robust, parallel hash join method for data skew in the super database computer (sdc). In VLDB, pages 210--221, 1990. Google ScholarDigital Library
M. S. Lakshmi and P. S. Yu. Effectiveness of parallel joins. IEEE Transactions on Knowledge and Data Engineering, 2(4):410--424, 1990. Google ScholarDigital Library
A. Shatdal and J. F. Naughton. Using shared virtual memory for parallel join processing. In SIGMOD Conference, pages 119--128, 1993. Google ScholarDigital Library
M. Stonebraker. The case for shared nothing. IEEE Database Eng. Bull., 9(1):4--9, 1986.Google Scholar
C. B. Walton, A. G. Dale, and R. M. Jenevein. A taxonomy and performance model of data skew effects in parallel joins. In VLDB, pages 537--548, 1991. Google ScholarDigital Library
J. L.Wolf, D. M. Dias, and P. S. Yu. A parallel sort merge join algorithm for managing data skew. IEEE Trans. Parallel Distrib. Syst., 4(1):70--86, 1993. Google ScholarDigital Library
J. L. Wolf, D. M. Dias, P. S. Yu, and J. Turek. An effective algorithm for parallelizing hash joins in the presence of data skew. In ICDE, pages 200--209, 1991. Google ScholarDigital Library
J. L. Wolf, D. M. Dias, P. S. Yu, and J. Turek. New algorithms for parallelizing relational database joins in the presence of data skew. IEEE Trans. Knowl. Data Eng., 6(6):990--997, 1994. Google ScholarDigital Library
X. Zhou and M. E. Orlowska. Handling data skew in parallel hash join computation using two-phase scheduling. In IEEE 1st International Conference on Algorithm and Architecture for Parallel Processing, pages 527--536 vol.2, 1995.Google ScholarCross Ref

Index Terms

Handling data skew in parallel joins in shared-nothing systems
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Parallel and distributed DBMSs

Recommendations

Robust and Skew-resistant Parallel Joins in Shared-Nothing Systems
CIKM '14: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management

The performance of joins in parallel database management systems is critical for data intensive operations such as querying. Since data skew is common in many applications, poorly engineered join operations result in load imbalance and performance ...
Read More
Efficient outer join data skew handling in parallel DBMS

Large enterprises have been relying on parallel database management systems (PDBMS) to process their ever-increasing data volume and complex queries. The scalability and performance of a PDBMS comes from load balancing on all nodes in the system. Skewed ...
Read More
Data skew and the scalability of parallel joins
SPDP '91: Proceedings of the 1991 Third IEEE Symposium on Parallel and Distributed Processing

When data are uniformly distributed, parallel join algorithms scale up well. However, scalability is curtailed by data skew-nonuniform distribution of data between processors. Investigation of this problem has been hampered by incomplete understanding ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data
June 2008
1396 pages
ISBN:9781605581026
DOI:10.1145/1376616
General Chairs:
Laks V. S. Lakshmanan
University of British Columbia, Canada
,
Raymond T. Ng
University of British Columbia, Canada
,
Dennis Shasha
New York University, USA
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 June 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
data skew
parallel joins
shared nothing
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 75
  Total Citations
  View Citations
- 1,439
  Total Downloads
- Downloads (Last 12 months)102
- Downloads (Last 6 weeks)18
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Handling data skew in parallel joins in shared-nothing systems

SIGMOD '08: Proceedings of the 2008 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

Robust and Skew-resistant Parallel Joins in Shared-Nothing Systems

Efficient outer join data skew handling in parallel DBMS

Data skew and the scalability of parallel joins