Skip to main content

Improving Online Aggregation Performance for Skewed Data Distribution

  • Conference paper
Database Systems for Advanced Applications (DASFAA 2012)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 7238))

Included in the following conference series:

Abstract

Online aggregation is a commonly-used technique to response aggregation queries with the refined approximate answers (within an estimated confidence interval) quickly. However, we observe that low selectivity and inappropriate sample proportion significantly affect the online aggregation performance when the data distribution is skewed. To overcome this problem, we propose a Partition-based Online Aggregation System called POAS. In POAS, the side effect of low selectivity can be reduced by efficient pruning of unneeded data due to the partition and shuffle strategies, and the appropriate sample proportion can be achieved as far as possible by drawing samples (tuples) from relevant partitions with dynamic sample size. Moreover, POAS applies some statistical approaches to calculate estimates from relevant partitions. We have implemented POAS and conducted an extensive experiments study on the TPC-H benchmark for skewed data distribution. Our results demonstrate the efficiency and effectiveness of POAS.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Wu, S., Ooi, B.C., Tan, K.L.: Continuous sampling for online aggregation over multiple queries. In: SIGMOD 2010, pp. 651–662. ACM, New York (2010)

    Google Scholar 

  2. Chaudhuri, S., Das, G., Datar, M., Motwani, R., Narasayya, V.: Overcoming limitations of sampling for aggregation queries. In: ICDE 2001, pp. 534–542 (2001)

    Google Scholar 

  3. Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. SIGMOD Rec. 26, 171–182 (1997)

    Article  Google Scholar 

  4. Haas, P.J.: Large-sample and deterministic confidence intervals for online aggregation. In: SSDBM 1997, pp. 51–63. IEEE Computer Society, Washington, DC, USA (1997)

    Google Scholar 

  5. Haas, P.J., Hellerstein, J.M.: Ripple joins for online aggregation. SIGMOD Rec. (1999)

    Google Scholar 

  6. Luo, G., Ellmann, C.J., Haas, P.J., Naughton, J.F.: A scalable hash ripple join algorithm. In: SIGMOD 2002 (2002)

    Google Scholar 

  7. Wu, S., Jiang, S., Ooi, B.C., Tan, K.L.: Distributed online aggregations. In: Proc. VLDB Endow. (2009)

    Google Scholar 

  8. Condie, T., Conway, N., Alvaro, P.: Hellerstein: Online aggregation and continuous query support in mapreduce. In: SIGMOD 2010 (2010)

    Google Scholar 

  9. Böse, J.H., Andrzejak, A., Högqvist, M.: Beyond online aggregation: parallel and incremental data mining with online map-reduce. In: MDAC 2010 (2010)

    Google Scholar 

  10. Pansare, N., Borkar, V., Jermaine, C., Condie, T.: Online aggregation for large mapreduce jobs. In: VLDB 2011, ACM, Seattle (2011)

    Google Scholar 

  11. Borkar, V., Carey, M., Grover, R., Onose, N., Vernica, R.: Hyracks: A flexible and extensible foundation for data-intensive computing. In: ICDE 2011, pp. 1151–1162 (2011)

    Google Scholar 

  12. Jacobs, A.: The pathologies of big data. Commun. ACM 52, 36–44 (2009)

    Article  Google Scholar 

  13. Bowen, T.F., Gopal, G., Herman, G., Hickey, T., Lee, K.C., Mansfield, W.H., Raitz, J., Weinrib, A.: The datacycle architecture. Commun. ACM (1992)

    Google Scholar 

  14. Candea, G., Polyzotis, N., Vingralek, R.: A scalable, predictable join operator for highly concurrent data warehouses. In: Proc. VLDB Endow., vol. 2, pp. 277–288 (2009)

    Google Scholar 

  15. Chaudhuri, S., Narasayya, V.: Program for tpc-d data generation with skew, ftp://ftp.research.microsoft.com/pub/user/viveknar/tpcdskew

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wang, Y., Luo, J., Song, A., Jin, J., Dong, F. (2012). Improving Online Aggregation Performance for Skewed Data Distribution. In: Lee, Sg., Peng, Z., Zhou, X., Moon, YS., Unland, R., Yoo, J. (eds) Database Systems for Advanced Applications. DASFAA 2012. Lecture Notes in Computer Science, vol 7238. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-29038-1_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-29038-1_4

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-29037-4

  • Online ISBN: 978-3-642-29038-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics