Skip to main content
Log in

OATS: online aggregation with two-level sharing strategy in cloud

  • Published:
Distributed and Parallel Databases Aims and scope Submit manuscript

Abstract

Online aggregation (OLA) is an attractive sampling-based technology to response aggregation queries by an approximate estimate to the final result, with the confidence interval becomes tighter over time. It has been built into the MapReduce-based cloud system for big data analytics, which allows users to monitor the query progress, and save money by killing the computation early once sufficient accuracy has been obtained. However, there is a serious limitation that restricts the performance of OLA that is the sharing issue of multiple OLA queries processing. Note that, in the original MapReduce paradigm, each query is processed independently without considering the potential sharing opportunities, leading to two major unnecessary additional execution costs: (1) the large redundant I/O cost, and (2) the replicative statistical computation cost. To eliminate such additional execution cost and improve the overall performance, we present online aggregation with two-level sharing strategy in cloud (OATS) based on MapReduce framework in this paper to effectively support online aggregation for large scale concurrent query processing in skewed data distribution. In the first-level sharing, we propose a sample buffer management mechanism to share the sampling opportunities among multiple OLA queries to reduce redundant I/O cost. While in the second-level sharing, we propose a heuristic algorithm (with a good scalability for large input) for the statistical computation to share partial statistics calculation to decrease the number of final aggregation operations, reducing the statistical computation cost. Based on such two-level sharing strategy, we have implemented OATS in Hadoop and conducted an extensive experiments study on the TPC-H benchmark for skewed data distribution. Our results demonstrate the efficiency and effectiveness of OATS.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. Aggregation are among the most common query types for data analysis, which often scan large amount of tuples to generate summary and statistical results

  2. Southeast University Cloud Platform, which supports data processing applications of the whole university.

  3. To our best knowledge, our OATS is the first work on studying shared OLA with two-level sharing strategy over MapReduce and the greedy algorithm proposed in [31] is the latest excellent solution to the second-level sharing. Therefore, we finally select this greedy algorithm as candidate to compare its performance with our SLSA algorithm.

  4. In the case of right \(\rightarrow \) left, \(w_{in}\) indicates the weight assigned to the right \(sg\). Otherwise, it indicates the left \(sg\).

  5. The expired partial statistics are defined as the statistics that have been reused by all \(M_x^i\in sg\)

  6. The reasons we do not show the comparison result with COLA in this paper are that: (1) there are several implementation details of our COLA property may different to the original COLA due to the less information related in [2], which may affect the comparison result to some extent, and (2) the Hadoop platform of our property is deployed in the virtual clusters rather than the physical clusters mentioned in [2], which results in the diversity between our property and the original COLA.

References

  1. Herodotou, H., Lim, H., Luo, G., Borisov, N., Dong, L., Cetin, F.B., Babu, S.: Starfish: a self-tuning system for big data analytics. In: Biennial Conference on Innovative Data Systems Research (CIDR), pp. 261–272, 2011

  2. Shi, Y., Meng, X., Wang, F., Gan, Y.: You can stop early with cola: online processing of aggregate queries in the cloud. In: Proceedings of the 20th ACM International Conference on Information and knowledge Management (CIKM), pp. 1223–1232, 2012

  3. Hadoop: The apache software foundation. http://hadoop.apache.org. Accessed Nov 2012

  4. Borkar, V., Carey, M., Grover, R., Onose, N., Vernica, R.: Hyracks: A flexible and extensible foundation for data-intensive computing. In: Proceedings of IEEE International Conference on Data Engineering (ICDE), pp. 1151–1162, 2011

  5. Kolodziej, J., Khan, S.U.: Data scheduling in data grids and data centers: a short taxonomy of problems and intelligent resolution techniques. Trans. Comput. Collect. Intell. X. 7777, 103–119 (2013)

    Article  Google Scholar 

  6. Zaharia, M., Borthakur, D., Sarma, J.S., et al.: Delay scheduling: a simple technique for achieving locality and fairness in cluster scheduling. In: Proceedings of the 5th European Conference on Computer Systems (EuroSys), pp. 265–278, 2010

  7. Jin, H., Yang, X., Sun, X., et al.: Adapt: Availability-aware mapreduce data placement for non-dedicated distributed computing. In: Proceedings of the 32nd International Conference on Distributed Computing Systems (ICDCS), pp. 516–525, 2012

  8. Grover, R., Carey, M.J.: Extending map-reduce for efficient predicate-based sampling. In: Proceedings of the 28th International Conference on Data Engineering (ICDE), pp. 486–497, 2012

  9. Moseley, B., Dasgupta, A., Kumar, R., et al.: On scheduling in map-reduce and flow-shops. In: Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pp. 289–298, 2011

  10. Tao, Y., Lin, W., Xiao, X.: Minimal mapreduce algorithms. In: Proceedings of the International Conference on Management of Data (SIGMOD), pp. 529–540, 2013

  11. Elteir, M., Lin, H., Feng, W.: Enhancing mapreduce via asynchronous data processing. In: Proceedings of the International Conference on Parallel and Distributed Systems (ICPADS), pp. 397–405 (2010)

  12. Wu, S., Ooi, B.C., Tan, K.-L.: Continuous sampling for online aggregation over multiple queries. In: Proceedings of ACM International Conference on Management of Data (SIGMOD), pp. 651–662, 2010

  13. Chaudhuri, S., Das, G., Datar, M., Motwani, R., Narasayya, V.: Overcoming limitations of sampling for aggregation queries. In: Proceedings of IEEE International Conference on Data Engineering (ICDE), pp. 534–542, 2001

  14. Laptev, N., Zeng, K., Zaniolo, C.: Early accurate results for advanced analytics on mapreduce. Proc. VLDB Endow. 5(10), 1028–1039 (2012)

    Article  Google Scholar 

  15. Hellerstein, J.M., Haas, P.J., Wang, H.J.: Online aggregation. ACM SIGMOD Rec. 26(2), 171–182 (1997)

    Article  Google Scholar 

  16. Haas. P.J.: Large-sample and deterministic confidence intervals for online aggregation. In: Proceedings of the 9th International Conference on Scientific and Statistical Database Management (SSDBM), pp. 51–62, 1997

  17. Haas, P.J., Hellerstein, J.M.: Ripple joins for online aggregation. ACM SIGMOD Rec. 28(2), 287–298 (1999)

    Article  Google Scholar 

  18. Luo, G., Ellmann, C.J., Haas, P.J., Naughton, J.F.: A scalable hash ripple join algorithm. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 252–262, 2002

  19. Wang, Y., Luo, J., Song, A., Jin, J., Dong, F.: Improving online aggregation performance for skewed data distribution. In: Porceedings of 17th International Conference on Database Systems for Advanced Applications (DASFAA), pp. 18–32 (2012)

  20. Wu, S., Jiang, S., Ooi, B.C., Tan, K.-L.: Distributed online aggregations. Proc. VLDB Endow. 2(1), 443–454 (2009)

    Article  Google Scholar 

  21. Pansare, N., Borkar, V., Jermaine, C., Condie, T.: Online aggregation for large mapreduce jobs. Proc. VLDB Endow. 4(11), 1135–1145 (2011)

    Google Scholar 

  22. Böse, J.-H., Andrzejak, A., Högqvist, M.: Beyond online aggregation: parallel and incremental data mining with online map-reduce. In: Proceedings of the ACM Workshop on Massive Data Analytics on the Cloud (MDAC), pp. 3–8, 2010

  23. Condie, T., Conway, N., Alvaro, P.: Hellerstein: online aggregation and continuous query support in mapreduce. In: Proceedings of ACM International Conference on Management of Data (SIGMOD), pp. 1115–1118, 2010

  24. Wang, Y., Luo, J., Song, A., Fang, D.: Partition-based online aggregation with shared sampling in cloud. J. Comput. Sci. Technol. 28(6), 989–1011 (2013)

    Article  Google Scholar 

  25. Qin, C., Rusu, F.: Parallel online aggregation in action. In: Proceedings of the 25th International Conference on Scientific and Statistical Database Management (SSDBM), pp. 46–49, 2013

  26. Gan, Y., Meng, X., Shi, Y.: Processing online aggregation on skewed data in mapreduce. In: Proceedings of the 5th International Workshop on Cloud Data Management (CloudDB), pp. 3–10, 2013

  27. Ghemawat, S., Gobioff, H., Leung, S.T.: The google file system. In: Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP), pp. 29–43, 2003

  28. Eltabakh, M.Y., Tian, Y., Özcan, F., Gemulla, R., Krettek, A., McPherson, J.: Cohadoop: flexible data placement and its exploitation in hadoop. Proc. VLDB Endow. 4(9), 575–585 (2011)

    Article  Google Scholar 

  29. Nykiel, T., Potamias, M., Mishra, C., Kollios, G., Koudas, N.: MRshare: sharing across multiple queries in mapreduce. Proc. VLDB Endow. 3(1), 494–505 (2010)

    Article  Google Scholar 

  30. Krishnamurthy, S., Wu, C., Franklin, M.: On-the-fly sharing for streamed aggregation. In: Proceedings of the ACM International Conference on Management of Data (SIGMOD), pp. 623–634, 2006

  31. Guirguis, S., Sharaf, M.A., Chrysanthis, P.K., Labrinidis, A.: Optimized processing of multiple aggregate continuous queries. In: Proceedings of the 20th ACM International Conference on Information and knowledge Management (CIKM), pp. 1515–1524, 2011

  32. Garey, M.R., Johnson, D.S.: Computers and Intractability: A Guide to the Theory of NP-Completeness. W.H. Freeman, New York (1979)

    MATH  Google Scholar 

  33. Rosen, K.H.: Elementary Number Theory and Its Applications. Addison Weley Publishing, Reading (1988)

    MATH  Google Scholar 

  34. Chaudhuri, S., Narasayya, V.: Program for tpc-d data generation with skew. ftp://ftp.research.microsoft.com/pub/user/viveknar/tpcdskew. Accessed July 2012

Download references

Acknowledgments

This work is supported by National Key Basic Research Program of China under Grants No. 2010CB328104, National Natural Science Foundation of China under Grants No. 61320106007, No. 61070161, No. 61003257, No. 61202449, No. 61272054, China National Key Technology R&D Program under Grants No. 2010BAI88B03 and No. 2011BAK21B02, China Specialized Research Fund for the Doctoral Program of Higher Education under Grants No. 20110092130002, China National Science and Technology Major Project under Grants No. 2010ZX01044-001-001, Jiangsu Provincial Natural Science Foundation of China under Grants No. BK2008030, Jiangsu research prospective joint research project under Grants No. BY2012202, No. BY2013073-01, Jiangsu Provincial Key Laboratory of Network and Information Security under Grants No. BM2003201, Key Laboratory of Computer Network and Information Integration of Ministry of Education of China under Grants No. 93K-9, and Shanghai Key Laboratory of Scalable Computing and Systems (2010DS680095).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuxiang Wang.

Additional information

Communicated by Shrideep Pallickara.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, Y., Luo, J., Song, A. et al. OATS: online aggregation with two-level sharing strategy in cloud. Distrib Parallel Databases 32, 467–505 (2014). https://doi.org/10.1007/s10619-014-7141-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10619-014-7141-2

Keywords

Navigation