An Efficient Block Sampling Strategy for Online Aggregation in the Cloud

Ci, Xiang; Meng, Xiaofeng

doi:10.1007/978-3-319-21042-1_29

Xiang Ci¹⁷ &
Xiaofeng Meng¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9098))

Included in the following conference series:

International Conference on Web-Age Information Management

2702 Accesses
4 Citations

Abstract

As the development of social network, mobile Internet, etc., an increasing amount of data are being generated, which beyond the processing ability of traditional data management tools. In many real-life applications, users can accept approximate answers accompanied by accuracy guarantees. One of the most commonly used approaches is online aggregation. Online aggregation responds aggregation queries against the random samples and refines the result as more samples are received. In the era of big data, more and more data analysis applications are migrated to the cloud, so online aggregation in the cloud has also attracted more attention. There can be a huge difference between the number of tuples in each group when dealing with group-by queries. As a result, answers of online aggregation based on uniform random sampling can result in poor accuracy for groups with very few tuples. Data in the cloud are usually organized into blocks and this data organization makes sampling more complex. In this paper, we propose an efficient block sampling which can exactly reflect the importance of different blocks for answering group-by queries. We implement our methods in a cloud online aggregation system called COLA and the experimental results demonstrate our method can get results with higher accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: SIGMOD Conference, pp. 171–182 (1997)
Google Scholar
Haas, P.J.: Large-sample and deterministic confidence intervals for online aggregation. In: 9th IEEE International Conference on Scientific and Statistical Database Management, pp. 51–62. IEEE Press, New York (1997)
Google Scholar
Haas, P.J., Hellerstein, J.M.: Ripple Joins for online aggregation. In: SIGMOD Conference, pp. 287–298 (1999)
Google Scholar
Luo, G., Ellmann, C.J., Haas, P.J., Naughton, J.F.: A scalable hash ripple join algorithm. In: SIGMOD Conference, pp. 252–262 (2002)
Google Scholar
Jermaine, C., Dobra, A., Arumugam, S., Joshi, S., Pol, A.: A disk-based join with probabilistic guarantees. In: SIGMOD Conference, pp. 563–574 (2005)
Google Scholar
Wu, S., Ooi, B.C., Tan, K.: Continuous sampling for online aggregation over multiple queries. In: SIGMOD Conference, pp. 651–662 (2010)
Google Scholar
Wu, S., Jiang, S., Ooi, B.C., Tan, K.: Distributed online aggregation. presented at PVLDB, pp. 443–454 (2009)
Google Scholar
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Gerth, J., Talbot, J., Elmeleegy, K., Sears, R.: Online aggregation and continuous query support in MapReduce. In: SIGMOD Conference, pp. 1115–1118 (2010)
Google Scholar
Borkar, V.R., Carey, M.J., Grover, R., Onose, N., Vernica, R.: Hyracks: A flexible and extensible foundation for data-intensive computing. In: ICDE, pp. 1151–1162 (2011)
Google Scholar
Böse, J.-H., Andrzejak, A., Högqvist, M.: Beyond online aggregation: parallel and incremental data mining with online Map-Reduce. In: 2010 Workshop on Massive Data Analytics on the Cloud, pp. 1–6 (2010)
Google Scholar
Shi, Y., Meng, X., Wang, F., Gan, Y.: You can stop early with COLA: online processing of aggregate queries in the cloud. In: CIKM, pp. 1223–1232 (2012)
Google Scholar
Gan, Y., Meng, X., Shi, Y.: COLA: A cloud-based system for online aggregation. In: ICDE, pp. 1368–1371 (2013)
Google Scholar
Pansare, N., Borkar, V.R., Jermaine, C., Condie, T.: Online aggregation for large mapreduce jobs. In: PVLDB, pp. 1135–1145 (2011)
Google Scholar
HKalavri, V., Brundza, V., Vlassov, V.: Block sampling: efficient accurate online aggregation in mapreduce. In: CloudCom, vol. (1), pp. 250–257 (2013)
Google Scholar
Wang, Y., Luo, J., Song, A., Dong, F.: Partition-Based Online Aggregation with Shared Sampling in the Cloud. J. Comput. Sci. Technol., 989–1011 (2013)
Google Scholar
Qin, C., Rusu, F.: Parallel online aggregation in action. In: SSDBM, p. 46 (2013)
Google Scholar
Wang, Y., Luo, J., Song, A., Dong, F.: OATS: online aggregation with two-level sharing strategy in cloud. In: Distributed and Parallel Databases, pp. 1–39 (2014)
Google Scholar
Wu, M., Jermaine, C.: Guessing the extreme values in a data set: a bayesian method and its applications. VLDB J., 571–597 (2009)
Google Scholar
Antoshenkov, G.: Random sampling from pseudo-ranked B+ trees. In: VLDB, pp. 375–382 (1992)
Google Scholar
Chaudhuri, S., Das, G., Srivastava, U.: Effective use of block-level sampling in statistics estimation. In: SIGMOD Conference, pp. 287–298 (2004)
Google Scholar
Olken, F., Rotem, D.: Random sampling from database files: a survey. In: SSDBM, pp. 92–111 (1990)
Google Scholar
Haas, P.J., Koenig, C.: A Bi-level bernoulli scheme for database sampling. In: SIGMOD Conference, pp. 275–286 (2004)
Google Scholar
Chaudhuri, S., Das, G., Datar, M., Motwani, R., Narasayya, V.R.: Overcoming limitations of sampling for aggregation queries. In: ICDE, pp. 534–542 (2001)
Google Scholar
Acharya, S., Gibbons, P.B., Poosala, V.: Congressional samples for approximate answering of group-by queries. In: SIGMOD Conference, pp. 487–498 (2000)
Google Scholar
Babcock, B., Chaudhuri, S., Das, G.: Dynamic sample selection for approximate query processing. In: SIGMOD Conference, pp. 539–550 (2003)
Google Scholar
Rsch, P., Lehner, W.: Sample synopses for approximate answering of group-by queries. In: EDBT, pp. 403–414 (2009)
Google Scholar
Jacobs, A.: The pathologies of big data. Commun. ACM, 36–44 (2009)
Google Scholar
COLA, http://idke.ruc.edu.cn/COLA/

Download references

Author information

Authors and Affiliations

School of Information, Renmin University of China, Beijing, China
Xiang Ci & Xiaofeng Meng

Authors

Xiang Ci
View author publications
You can also search for this author in PubMed Google Scholar
Xiaofeng Meng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xiaofeng Meng .

Editor information

Editors and Affiliations

Google, CA, USA
Xin Luna Dong
Postdoc Apartments (Hong Lou) 4-1-4, Shandong University, Li Cheng, Jinan, China
Xiaohui Yu
Tsinghua University, Beijing, China
Jian Li
Northeastern University, BOSTON, USA
Yizhou Sun

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ci, X., Meng, X. (2015). An Efficient Block Sampling Strategy for Online Aggregation in the Cloud. In: Dong, X., Yu, X., Li, J., Sun, Y. (eds) Web-Age Information Management. WAIM 2015. Lecture Notes in Computer Science(), vol 9098. Springer, Cham. https://doi.org/10.1007/978-3-319-21042-1_29

Download citation

DOI: https://doi.org/10.1007/978-3-319-21042-1_29
Published: 06 June 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21041-4
Online ISBN: 978-3-319-21042-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics