Abstract
As the development of social network, mobile Internet, etc., an increasing amount of data are being generated, which beyond the processing ability of traditional data management tools. In many real-life applications, users can accept approximate answers accompanied by accuracy guarantees. One of the most commonly used approaches is online aggregation. Online aggregation responds aggregation queries against the random samples and refines the result as more samples are received. In the era of big data, more and more data analysis applications are migrated to the cloud, so online aggregation in the cloud has also attracted more attention. There can be a huge difference between the number of tuples in each group when dealing with group-by queries. As a result, answers of online aggregation based on uniform random sampling can result in poor accuracy for groups with very few tuples. Data in the cloud are usually organized into blocks and this data organization makes sampling more complex. In this paper, we propose an efficient block sampling which can exactly reflect the importance of different blocks for answering group-by queries. We implement our methods in a cloud online aggregation system called COLA and the experimental results demonstrate our method can get results with higher accuracy.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. In: SIGMOD Conference, pp. 171–182 (1997)
Haas, P.J.: Large-sample and deterministic confidence intervals for online aggregation. In: 9th IEEE International Conference on Scientific and Statistical Database Management, pp. 51–62. IEEE Press, New York (1997)
Haas, P.J., Hellerstein, J.M.: Ripple Joins for online aggregation. In: SIGMOD Conference, pp. 287–298 (1999)
Luo, G., Ellmann, C.J., Haas, P.J., Naughton, J.F.: A scalable hash ripple join algorithm. In: SIGMOD Conference, pp. 252–262 (2002)
Jermaine, C., Dobra, A., Arumugam, S., Joshi, S., Pol, A.: A disk-based join with probabilistic guarantees. In: SIGMOD Conference, pp. 563–574 (2005)
Wu, S., Ooi, B.C., Tan, K.: Continuous sampling for online aggregation over multiple queries. In: SIGMOD Conference, pp. 651–662 (2010)
Wu, S., Jiang, S., Ooi, B.C., Tan, K.: Distributed online aggregation. presented at PVLDB, pp. 443–454 (2009)
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Gerth, J., Talbot, J., Elmeleegy, K., Sears, R.: Online aggregation and continuous query support in MapReduce. In: SIGMOD Conference, pp. 1115–1118 (2010)
Borkar, V.R., Carey, M.J., Grover, R., Onose, N., Vernica, R.: Hyracks: A flexible and extensible foundation for data-intensive computing. In: ICDE, pp. 1151–1162 (2011)
Böse, J.-H., Andrzejak, A., Högqvist, M.: Beyond online aggregation: parallel and incremental data mining with online Map-Reduce. In: 2010 Workshop on Massive Data Analytics on the Cloud, pp. 1–6 (2010)
Shi, Y., Meng, X., Wang, F., Gan, Y.: You can stop early with COLA: online processing of aggregate queries in the cloud. In: CIKM, pp. 1223–1232 (2012)
Gan, Y., Meng, X., Shi, Y.: COLA: A cloud-based system for online aggregation. In: ICDE, pp. 1368–1371 (2013)
Pansare, N., Borkar, V.R., Jermaine, C., Condie, T.: Online aggregation for large mapreduce jobs. In: PVLDB, pp. 1135–1145 (2011)
HKalavri, V., Brundza, V., Vlassov, V.: Block sampling: efficient accurate online aggregation in mapreduce. In: CloudCom, vol. (1), pp. 250–257 (2013)
Wang, Y., Luo, J., Song, A., Dong, F.: Partition-Based Online Aggregation with Shared Sampling in the Cloud. J. Comput. Sci. Technol., 989–1011 (2013)
Qin, C., Rusu, F.: Parallel online aggregation in action. In: SSDBM, p. 46 (2013)
Wang, Y., Luo, J., Song, A., Dong, F.: OATS: online aggregation with two-level sharing strategy in cloud. In: Distributed and Parallel Databases, pp. 1–39 (2014)
Wu, M., Jermaine, C.: Guessing the extreme values in a data set: a bayesian method and its applications. VLDB J., 571–597 (2009)
Antoshenkov, G.: Random sampling from pseudo-ranked B+ trees. In: VLDB, pp. 375–382 (1992)
Chaudhuri, S., Das, G., Srivastava, U.: Effective use of block-level sampling in statistics estimation. In: SIGMOD Conference, pp. 287–298 (2004)
Olken, F., Rotem, D.: Random sampling from database files: a survey. In: SSDBM, pp. 92–111 (1990)
Haas, P.J., Koenig, C.: A Bi-level bernoulli scheme for database sampling. In: SIGMOD Conference, pp. 275–286 (2004)
Chaudhuri, S., Das, G., Datar, M., Motwani, R., Narasayya, V.R.: Overcoming limitations of sampling for aggregation queries. In: ICDE, pp. 534–542 (2001)
Acharya, S., Gibbons, P.B., Poosala, V.: Congressional samples for approximate answering of group-by queries. In: SIGMOD Conference, pp. 487–498 (2000)
Babcock, B., Chaudhuri, S., Das, G.: Dynamic sample selection for approximate query processing. In: SIGMOD Conference, pp. 539–550 (2003)
Rsch, P., Lehner, W.: Sample synopses for approximate answering of group-by queries. In: EDBT, pp. 403–414 (2009)
Jacobs, A.: The pathologies of big data. Commun. ACM, 36–44 (2009)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Ci, X., Meng, X. (2015). An Efficient Block Sampling Strategy for Online Aggregation in the Cloud. In: Dong, X., Yu, X., Li, J., Sun, Y. (eds) Web-Age Information Management. WAIM 2015. Lecture Notes in Computer Science(), vol 9098. Springer, Cham. https://doi.org/10.1007/978-3-319-21042-1_29
Download citation
DOI: https://doi.org/10.1007/978-3-319-21042-1_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21041-4
Online ISBN: 978-3-319-21042-1
eBook Packages: Computer ScienceComputer Science (R0)