OATS: online aggregation with two-level sharing strategy in cloud

Wang, Yuxiang; Luo, Junzhou; Song, Aibo; Dong, Fang

doi:10.1007/s10619-014-7141-2

OATS: online aggregation with two-level sharing strategy in cloud

Published: 30 January 2014

Volume 32, pages 467–505, (2014)
Cite this article

Distributed and Parallel Databases Aims and scope Submit manuscript

Yuxiang Wang¹,
Junzhou Luo¹,
Aibo Song¹ &
…
Fang Dong¹

627 Accesses
12 Citations
Explore all metrics

Abstract

Online aggregation (OLA) is an attractive sampling-based technology to response aggregation queries by an approximate estimate to the final result, with the confidence interval becomes tighter over time. It has been built into the MapReduce-based cloud system for big data analytics, which allows users to monitor the query progress, and save money by killing the computation early once sufficient accuracy has been obtained. However, there is a serious limitation that restricts the performance of OLA that is the sharing issue of multiple OLA queries processing. Note that, in the original MapReduce paradigm, each query is processed independently without considering the potential sharing opportunities, leading to two major unnecessary additional execution costs: (1) the large redundant I/O cost, and (2) the replicative statistical computation cost. To eliminate such additional execution cost and improve the overall performance, we present online aggregation with two-level sharing strategy in cloud (OATS) based on MapReduce framework in this paper to effectively support online aggregation for large scale concurrent query processing in skewed data distribution. In the first-level sharing, we propose a sample buffer management mechanism to share the sampling opportunities among multiple OLA queries to reduce redundant I/O cost. While in the second-level sharing, we propose a heuristic algorithm (with a good scalability for large input) for the statistical computation to share partial statistics calculation to decrease the number of final aggregation operations, reducing the statistical computation cost. Based on such two-level sharing strategy, we have implemented OATS in Hadoop and conducted an extensive experiments study on the TPC-H benchmark for skewed data distribution. Our results demonstrate the efficiency and effectiveness of OATS.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Efficient Block Sampling Strategy for Online Aggregation in the Cloud

An iterative sampling method for online aggregation

Article 19 December 2017

Online Aggregation: A Review

Notes

Aggregation are among the most common query types for data analysis, which often scan large amount of tuples to generate summary and statistical results
Southeast University Cloud Platform, which supports data processing applications of the whole university.
To our best knowledge, our OATS is the first work on studying shared OLA with two-level sharing strategy over MapReduce and the greedy algorithm proposed in [31] is the latest excellent solution to the second-level sharing. Therefore, we finally select this greedy algorithm as candidate to compare its performance with our SLSA algorithm.
In the case of right \(\rightarrow \) left, \(w_{in}\) indicates the weight assigned to the right \(sg\). Otherwise, it indicates the left \(sg\).
The expired partial statistics are defined as the statistics that have been reused by all \(M_x^i\in sg\)
The reasons we do not show the comparison result with COLA in this paper are that: (1) there are several implementation details of our COLA property may different to the original COLA due to the less information related in [2], which may affect the comparison result to some extent, and (2) the Hadoop platform of our property is deployed in the virtual clusters rather than the physical clusters mentioned in [2], which results in the diversity between our property and the original COLA.

References

Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F.B., Babu, S.: Starfish: a self-tuning system for big data analytics. In: Biennial Conference on Innovative Data Systems Research (CIDR), pp. 261–272, 2011
Shi, Y., Meng, X., Wang, F., Gan, Y.: You can stop early with cola: online processing of aggregate queries in the cloud. In: Proceedings of the 20th ACM International Conference on Information and knowledge Management (CIKM), pp. 1223–1232, 2012
Hadoop: The apache software foundation. http://hadoop.apache.org. Accessed Nov 2012
Borkar, V., Carey, M., Grover, R., Onose, N., Vernica, R.: Hyracks: A flexible and extensible foundation for data-intensive computing. In: Proceedings of IEEE International Conference on Data Engineering (ICDE), pp. 1151–1162, 2011
Kolodziej, J., Khan, S.U.: Data scheduling in data grids and data centers: a short taxonomy of problems and intelligent resolution techniques. Trans. Comput. Collect. Intell. X. 7777, 103–119 (2013)
Article Google Scholar
Zaharia, M., Borthakur, D., Sarma, J.S., et al.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: Proceedings of the 5th European Conference on Computer Systems (EuroSys), pp. 265–278, 2010
Jin, H., Yang, X., Sun, X., et al.: Adapt: Availability-aware mapreduce data placement for non-dedicated distributed computing. In: Proceedings of the 32nd International Conference on Distributed Computing Systems (ICDCS), pp. 516–525, 2012
Grover, R., Carey, M.J.: Extending map-reduce for efficient predicate-based sampling. In: Proceedings of the 28th International Conference on Data Engineering (ICDE), pp. 486–497, 2012
Moseley, B., Dasgupta, A., Kumar, R., et al.: On scheduling in map-reduce and flow-shops. In: Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pp. 289–298, 2011
Tao, Y., Lin, W., Xiao, X.: Minimal mapreduce algorithms. In: Proceedings of the International Conference on Management of Data (SIGMOD), pp. 529–540, 2013
Elteir, M., Lin, H., Feng, W.: Enhancing mapreduce via asynchronous data processing. In: Proceedings of the International Conference on Parallel and Distributed Systems (ICPADS), pp. 397–405 (2010)
Wu, S., Ooi, B.C., Tan, K.-L.: Continuous sampling for online aggregation over multiple queries. In: Proceedings of ACM International Conference on Management of Data (SIGMOD), pp. 651–662, 2010
Chaudhuri, S., Das, G., Datar, M., Motwani, R., Narasayya, V.: Overcoming limitations of sampling for aggregation queries. In: Proceedings of IEEE International Conference on Data Engineering (ICDE), pp. 534–542, 2001
Laptev, N., Zeng, K., Zaniolo, C.: Early accurate results for advanced analytics on mapreduce. Proc. VLDB Endow. 5(10), 1028–1039 (2012)
Article Google Scholar
Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. ACM SIGMOD Rec. 26(2), 171–182 (1997)
Article Google Scholar
Haas. P.J.: Large-sample and deterministic confidence intervals for online aggregation. In: Proceedings of the 9th International Conference on Scientific and Statistical Database Management (SSDBM), pp. 51–62, 1997
Haas, P.J., Hellerstein, J.M.: Ripple joins for online aggregation. ACM SIGMOD Rec. 28(2), 287–298 (1999)
Article Google Scholar
Luo, G., Ellmann, C.J., Haas, P.J., Naughton, J.F.: A scalable hash ripple join algorithm. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 252–262, 2002
Wang, Y., Luo, J., Song, A., Jin, J., Dong, F.: Improving online aggregation performance for skewed data distribution. In: Porceedings of 17th International Conference on Database Systems for Advanced Applications (DASFAA), pp. 18–32 (2012)
Wu, S., Jiang, S., Ooi, B.C., Tan, K.-L.: Distributed online aggregations. Proc. VLDB Endow. 2(1), 443–454 (2009)
Article Google Scholar
Pansare, N., Borkar, V., Jermaine, C., Condie, T.: Online aggregation for large mapreduce jobs. Proc. VLDB Endow. 4(11), 1135–1145 (2011)
Google Scholar
Böse, J.-H., Andrzejak, A., Högqvist, M.: Beyond online aggregation: parallel and incremental data mining with online map-reduce. In: Proceedings of the ACM Workshop on Massive Data Analytics on the Cloud (MDAC), pp. 3–8, 2010
Condie, T., Conway, N., Alvaro, P.: Hellerstein: online aggregation and continuous query support in mapreduce. In: Proceedings of ACM International Conference on Management of Data (SIGMOD), pp. 1115–1118, 2010
Wang, Y., Luo, J., Song, A., Fang, D.: Partition-based online aggregation with shared sampling in cloud. J. Comput. Sci. Technol. 28(6), 989–1011 (2013)
Article Google Scholar
Qin, C., Rusu, F.: Parallel online aggregation in action. In: Proceedings of the 25th International Conference on Scientific and Statistical Database Management (SSDBM), pp. 46–49, 2013
Gan, Y., Meng, X., Shi, Y.: Processing online aggregation on skewed data in mapreduce. In: Proceedings of the 5th International Workshop on Cloud Data Management (CloudDB), pp. 3–10, 2013
Ghemawat, S., Gobioff, H., Leung, S.T.: The google file system. In: Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP), pp. 29–43, 2003
Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J.: Cohadoop: flexible data placement and its exploitation in hadoop. Proc. VLDB Endow. 4(9), 575–585 (2011)
Article Google Scholar
Nykiel, T., Potamias, M., Mishra, C., Kollios, G., Koudas, N.: MRshare: sharing across multiple queries in mapreduce. Proc. VLDB Endow. 3(1), 494–505 (2010)
Article Google Scholar
Krishnamurthy, S., Wu, C., Franklin, M.: On-the-fly sharing for streamed aggregation. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 623–634, 2006
Guirguis, S., Sharaf, M.A., Chrysanthis, P.K., Labrinidis, A.: Optimized processing of multiple aggregate continuous queries. In: Proceedings of the 20th ACM International Conference on Information and knowledge Management (CIKM), pp. 1515–1524, 2011
Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, New York (1979)
MATH Google Scholar
Rosen, K.H.: Elementary Number Theory and Its Applications. Addison Weley Publishing, Reading (1988)
MATH Google Scholar
Chaudhuri, S., Narasayya, V.: Program for tpc-d data generation with skew. ftp://ftp.research.microsoft.com/pub/user/viveknar/tpcdskew. Accessed July 2012

Download references

Acknowledgments

This work is supported by National Key Basic Research Program of China under Grants No. 2010CB328104, National Natural Science Foundation of China under Grants No. 61320106007, No. 61070161, No. 61003257, No. 61202449, No. 61272054, China National Key Technology R&D Program under Grants No. 2010BAI88B03 and No. 2011BAK21B02, China Specialized Research Fund for the Doctoral Program of Higher Education under Grants No. 20110092130002, China National Science and Technology Major Project under Grants No. 2010ZX01044-001-001, Jiangsu Provincial Natural Science Foundation of China under Grants No. BK2008030, Jiangsu research prospective joint research project under Grants No. BY2012202, No. BY2013073-01, Jiangsu Provincial Key Laboratory of Network and Information Security under Grants No. BM2003201, Key Laboratory of Computer Network and Information Integration of Ministry of Education of China under Grants No. 93K-9, and Shanghai Key Laboratory of Scalable Computing and Systems (2010DS680095).

Author information

Authors and Affiliations

Southeast University, Nanjing, China
Yuxiang Wang, Junzhou Luo, Aibo Song & Fang Dong

Authors

Yuxiang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Junzhou Luo
View author publications
You can also search for this author in PubMed Google Scholar
Aibo Song
View author publications
You can also search for this author in PubMed Google Scholar
Fang Dong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuxiang Wang.

Additional information

Communicated by Shrideep Pallickara.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, Y., Luo, J., Song, A. et al. OATS: online aggregation with two-level sharing strategy in cloud. Distrib Parallel Databases 32, 467–505 (2014). https://doi.org/10.1007/s10619-014-7141-2

Download citation

Published: 30 January 2014
Issue Date: December 2014
DOI: https://doi.org/10.1007/s10619-014-7141-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

OATS: online aggregation with two-level sharing strategy in cloud

Abstract

Access this article

Similar content being viewed by others

An Efficient Block Sampling Strategy for Online Aggregation in the Cloud

An iterative sampling method for online aggregation

Online Aggregation: A Review

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

OATS: online aggregation with two-level sharing strategy in cloud

Abstract

Access this article

Similar content being viewed by others

An Efficient Block Sampling Strategy for Online Aggregation in the Cloud

An iterative sampling method for online aggregation

Online Aggregation: A Review

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation