Improving Online Aggregation Performance for Skewed Data Distribution

Wang, Yuxiang; Luo, Junzhou; Song, Aibo; Jin, Jiahui; Dong, Fang

doi:10.1007/978-3-642-29038-1_4

Yuxiang Wang²²,
Junzhou Luo²²,
Aibo Song²²,
Jiahui Jin²² &
…
Fang Dong²²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7238))

Included in the following conference series:

International Conference on Database Systems for Advanced Applications

1719 Accesses
5 Citations

Abstract

Online aggregation is a commonly-used technique to response aggregation queries with the refined approximate answers (within an estimated confidence interval) quickly. However, we observe that low selectivity and inappropriate sample proportion significantly affect the online aggregation performance when the data distribution is skewed. To overcome this problem, we propose a Partition-based Online Aggregation System called POAS. In POAS, the side effect of low selectivity can be reduced by efficient pruning of unneeded data due to the partition and shuffle strategies, and the appropriate sample proportion can be achieved as far as possible by drawing samples (tuples) from relevant partitions with dynamic sample size. Moreover, POAS applies some statistical approaches to calculate estimates from relevant partitions. We have implemented POAS and conducted an extensive experiments study on the TPC-H benchmark for skewed data distribution. Our results demonstrate the efficiency and effectiveness of POAS.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Wu, S., Ooi, B.C., Tan, K.L.: Continuous sampling for online aggregation over multiple queries. In: SIGMOD 2010, pp. 651–662. ACM, New York (2010)
Google Scholar
Chaudhuri, S., Das, G., Datar, M., Motwani, R., Narasayya, V.: Overcoming limitations of sampling for aggregation queries. In: ICDE 2001, pp. 534–542 (2001)
Google Scholar
Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. SIGMOD Rec. 26, 171–182 (1997)
Article Google Scholar
Haas, P.J.: Large-sample and deterministic confidence intervals for online aggregation. In: SSDBM 1997, pp. 51–63. IEEE Computer Society, Washington, DC, USA (1997)
Google Scholar
Haas, P.J., Hellerstein, J.M.: Ripple joins for online aggregation. SIGMOD Rec. (1999)
Google Scholar
Luo, G., Ellmann, C.J., Haas, P.J., Naughton, J.F.: A scalable hash ripple join algorithm. In: SIGMOD 2002 (2002)
Google Scholar
Wu, S., Jiang, S., Ooi, B.C., Tan, K.L.: Distributed online aggregations. In: Proc. VLDB Endow. (2009)
Google Scholar
Condie, T., Conway, N., Alvaro, P.: Hellerstein: Online aggregation and continuous query support in mapreduce. In: SIGMOD 2010 (2010)
Google Scholar
Böse, J.H., Andrzejak, A., Högqvist, M.: Beyond online aggregation: parallel and incremental data mining with online map-reduce. In: MDAC 2010 (2010)
Google Scholar
Pansare, N., Borkar, V., Jermaine, C., Condie, T.: Online aggregation for large mapreduce jobs. In: VLDB 2011, ACM, Seattle (2011)
Google Scholar
Borkar, V., Carey, M., Grover, R., Onose, N., Vernica, R.: Hyracks: A flexible and extensible foundation for data-intensive computing. In: ICDE 2011, pp. 1151–1162 (2011)
Google Scholar
Jacobs, A.: The pathologies of big data. Commun. ACM 52, 36–44 (2009)
Article Google Scholar
Bowen, T.F., Gopal, G., Herman, G., Hickey, T., Lee, K.C., Mansfield, W.H., Raitz, J., Weinrib, A.: The datacycle architecture. Commun. ACM (1992)
Google Scholar
Candea, G., Polyzotis, N., Vingralek, R.: A scalable, predictable join operator for highly concurrent data warehouses. In: Proc. VLDB Endow., vol. 2, pp. 277–288 (2009)
Google Scholar
Chaudhuri, S., Narasayya, V.: Program for tpc-d data generation with skew, ftp://ftp.research.microsoft.com/pub/user/viveknar/tpcdskew

Download references

Author information

Authors and Affiliations

School of Computer Science and Engineering, Southeast University, Nanjing, P.R. China
Yuxiang Wang, Junzhou Luo, Aibo Song, Jiahui Jin & Fang Dong

Authors

Yuxiang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Junzhou Luo
View author publications
You can also search for this author in PubMed Google Scholar
Aibo Song
View author publications
You can also search for this author in PubMed Google Scholar
Jiahui Jin
View author publications
You can also search for this author in PubMed Google Scholar
Fang Dong
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Computer Science and Engineering, Seoul National University, Gwanak-ro, Gwanak-gu, 151747, Seoul, South Korea
Sang-goo Lee
Computer School, Wuhan University, Luo-jia-shan, Wuchang, 430081, Wuhan, Hubei Province, China
Zhiyong Peng
School of Information Technology and Electrical Engineering, University of Queensland, QLD 4072, Brisbane, Australia
Xiaofang Zhou
Department of Computer Science, Kangwon National University, 192-1, Hyoja2-Dong, Chuncheon, 200701, Kangwon, South Korea
Yang-Sae Moon
Institute for Computer Science and Business Information, University of Duisburg-Essen, Schützenbahn 70, 45117, Essen, Germany
Rainer Unland
School of Information and Communication Engineering, Chungbuk National University, 52 Naesudong-ro, Heungdeok-gu, Cheongju, 4072, Chungbuk, South Korea
Jaesoo Yoo

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, Y., Luo, J., Song, A., Jin, J., Dong, F. (2012). Improving Online Aggregation Performance for Skewed Data Distribution. In: Lee, Sg., Peng, Z., Zhou, X., Moon, YS., Unland, R., Yoo, J. (eds) Database Systems for Advanced Applications. DASFAA 2012. Lecture Notes in Computer Science, vol 7238. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29038-1_4

Download citation

DOI: https://doi.org/10.1007/978-3-642-29038-1_4
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-29037-4
Online ISBN: 978-3-642-29038-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics