Abstract
Recently there has been a significant interest to perform real-time analytical queries in systems that can handle both “big data” and “fast data”. In this paper, we propose an approximate answering approach, called ROSE, which can manage the big and fast data streams and support complex analytical queries against the data streams. To achieve this goal, we start with an analysis of existing query processing techniques in big data systems to understand the requirements of building a distributed analytic sketch. We then propose a sampling-based sketch that can extract multi-faced samples from asynchronous data streams, and augment its usability with accuracy-lossless distributed sketch construction operations, such as splitting, merging and union. The experimental results with real-world data sets indicate that compared with state-of-the-art approximate answering engine BlinkDB, our techniques can obtain more accurate estimates and improve 2 times of system throughput. When compared with distributed memory-computing system Spark, our system can achieve 2 orders of magnitude improvement on query response time.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Katsipoulakis, N.R., Thoma, C., Gratta, E.A., Labrinidis, A., Lee, A.J., Chrysanthis, P.K.: CE-Storm: confidential elastic processing of data streams. In: SIGMOD, pp. 859–864 (2015)
Goodstein, M.L., Chen, S., Gibbons, P.B., Kozuch, M.A., Mowry, T.C.: Chrysalis analysis: incorporating synchronization arcs in dataflow-analysis-based parallel monitoring. In: PACT, pp. 201–212 (2012)
Zhang, Y., Chen, S., Wang, Q., Yu, G.: i2MapReduce: incremental MapReduce for mining evolving big data. In: KDD, pp. 1906–1919 (2012)
Preis, T., Moat, H.S., Stanley, E.H.: Quantifying trading behavior in financial markets using Google trends. Sci. Rep. 3, 1684 (2013)
Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams: fault-tolerant streaming computation at scale. In: SOSP, pp. 423–438 (2013)
Brito, A., Martin, A., Knauth, T., Creutz, S., Becker, D., Weigert, S., Fetzer, C.: Scalable and low-latency data processing with stream MapReduce. In: CloudComp, pp. 48–58 (2011)
Li, B., Mazur, E., Diao, Y., McGregor, A., Shenoy, P.: Scalla: a platform for scalable one-pass analytics using MapReduce. ACM Trans. Database Syst. 37(4), 27:1–27:43 (2012)
Gulisano, V., Jimenez-Peris, R., Patino-Martinez, M., Soriente, C., Valduriez, P.: StreamCloud: an elastic and scalable data streaming system. Parallel Distrib. Syst. 23(12), 2351–2365 (2012)
Qian, Z., He, Y., Su, C., Wu, Z., Zhu, H., Zhang, T., Zhou, L., Yu, Y., Zhang, Z.: TimeStream: reliable stream computation in the cloud. In: EuroSys, pp. 1–14 (2013)
Li, B., Diao, Y., Shenoy, P.: Supporting scalable analytics with latency constraints. Proc. VLDB Endow. 8(11), 1166–1177 (2015)
Cormode, G., Garofalakis, M., Haas, P.J., Jermaine, C.: Synopses for massive data: samples, histograms, wavelets, sketches. Found. Trends Databases 4(1–3), 1–294 (2012)
Yun, X., Wu, G., Zhang, G., Li, K., Wang, S.: FastRAQ: a fast approach to range-aggregate queries in big data environments. IEEE Trans. Cloud Comput. 3(2), 206–218 (2014)
Agarwal, S., Mozafari, B., Panda, A., Milner, H., Madden, S., Stoica, I.: BlinkDB: queries with bounded errors and bounded response times on very large data. In: EuroSys, pp. 29–42 (2013)
Zeng, K., Agarwal, S., Dave, A., Armbrust, M., Stoica, I.: G-OLA: generalized on-line aggregation for interactive analysis on big data. In: SIGMOD, pp. 913–918 (2015)
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Gerth, J., Talbot, J., Elmeleegy, K., Sears, R.: Online aggregation and continuous query support in MapReduce. In: SIGMOD, pp. 1115–1118 (2010)
Chen, C., Li, F., Ooi, B.C., Wu, S.: TI: an efficient indexing mechanism for real-time search on tweets. In: SIGMOD, pp. 649–660 (2011)
Mousavi, H., Zaniolo, C.: Fast computation of approximate biased histograms on sliding windows over data streams. In: SSDBM, pp. 13:1–13:12 (2013)
Papapetrou, O., Garofalakis, M., Deligiannakis, A.: Sketching distributed sliding-window data streams. VLDB J. 24(3), 345–368 (2015)
Tirthapura, S., Xu, B., Busch, C.: Sketching asynchronous streams over a sliding window. In: PODC, pp. 82–91 (2006)
Gibbons, P.B., Tirthapura, S.: Distributed streams algorithms for sliding windows. In: SPAA, pp. 63–72 (2002)
Datar, M., Gionis, A., Indyk, P., Motwani, R.: Maintaining stream statistics over sliding windows. In: SODA, pp. 635–644 (2002)
Wang, L., Luo, G., Yi, K., Cormode, G.: Quantiles over data streams: an experimental study. In: SIGMOD, pp. 737–748 (2013)
Arasu, A., Manku, G.S.: Approximate counts and quantiles over sliding windows. In: PODS, pp. 286–296 (2004)
Gibbons, P.B., Matias, Y., Poosala, V.: Fast incremental maintenance of approximate histograms. ACM Trans. Database Syst. 27(3), 261–298 (2002)
Sharfman, I., Schuster, A., Keren, D.: A geometric approach to monitoring threshold functions over distributed data streams. ACM Trans. Database Syst. 32(4), 23 (2007)
Acknowledgment
The authors would like to thank the anonymous reviewers for their comments and suggestions which have helped to improve the quality of this paper. This work was supported by the National Key Research and Development Program of China (2016YFB0801305).
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Wu, G. et al. (2017). Supporting Real-Time Analytic Queries in Big and Fast Data Environments. In: Candan, S., Chen, L., Pedersen, T., Chang, L., Hua, W. (eds) Database Systems for Advanced Applications. DASFAA 2017. Lecture Notes in Computer Science(), vol 10178. Springer, Cham. https://doi.org/10.1007/978-3-319-55699-4_29
Download citation
DOI: https://doi.org/10.1007/978-3-319-55699-4_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-55698-7
Online ISBN: 978-3-319-55699-4
eBook Packages: Computer ScienceComputer Science (R0)