Abstract
Current work on stream processing is focused on approximation techniques that calculate approximate answers to simple queries by focusing on a fixed or sliding window that contains the most recent tuples from an input stream and by using condensed synopses to summarize the state. It is widely believed that without using approximation techniques, most interesting queries would be blocking (i.e., they would have to wait for the end of stream to release their results) or unbounded (i.e., their memory requirements would grow proportionally to the stream size, which may be infinite). The goal of this paper is to convert nested-relational queries to incremental stream processing programs automatically. In contrast to most current stream processing systems that calculate approximate answers, our system derives incremental programs that return accurate results. This is accomplished by retaining a state during the query evaluation lifetime and by using incremental evaluation techniques to return an accurate snapshot answer at each time interval that depends on the current state and the data in the current fixed window. Our methods can handle most forms of declarative queries on nested data collections, including arbitrarily nested queries, group-by with aggregation, and equi-joins. We report on a prototype system implementation and we show some preliminary results on evaluating queries on a small computer cluster running Spark.
Keywords
- Return Accurate Results
- Incremental Evaluation Technique
- Distributed Stream Processing Engine (DSPE)
- GroupBy
- Streaming Data Sources
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abadi, D.J., Carney, D., Cetintemel, U., et al.: Aurora: a new model and architecture for data stream management. VLDB J. 12(2), 120–139 (2003)
Acar, U.A., Blelloch, G.E., Blume, M., Harper, R., Tangwongsan, K.: An experimental analysis of self-adjusting computation. ACM Trans. Program. Lang. Syst. 32(1), 3:1–3:53 (2009)
Acar, U.A., Chen, Y.: Streaming big data with self-adjusting computation. In: Workshop on Data Driven Functional Programming (DDFP) (2013)
Babcock, B., Babu, S., Datar, M., Motwani, R., Widom, J.: Models and issues in data stream systems. In: Symposium on Principles of Database Systems (PODS), pp. 1–16 (2002)
Benjelloun, O., Sarma, A.D., Halevy, A., Widom, J.: ULDBs: databases with uncertainty and lineage. In: International Conference on Very Large Data Bases (VLDB), pp. 953–964 (2006)
Bhagwat, D., Chiticariu, L., Tan, W.C., Vijayvargiya, G.: An annotation management system for relational databases. In: International Conference on Very Large Data Bases (VLDB), pp. 900–911 (2004)
Bhatotia, P., Wieder, A., Rodrigues, R., Acar, U.A., Pasquin, R.: Incoop: MapReduce for incremental computations. In: ACM Symposium on Cloud Computing (SoCC) (2011)
Cai, Y., Giarrusso, P.G., Rendel, T., Ostermann, K.: A theory of changes for higher-order languages. Incrementalizing \(\lambda \)-calculi by static differentiation. In: ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pp. 145–155 (2014)
Chandramouli, B., Goldstein, J., Barnett, M., DeLine, R., Fisher, D., Platt, J.C., Terwilliger, J.F., Wernsing, J.: Trill: a high-performance incremental query processor for diverse analytics. In: International Conference on Very Large Data Bases (VLDB), pp. 401–412 (2014)
Chandrasekaran, S., Cooper, O., Deshpande, A., Franklin, M.J., Hellerstein, J.M., Hong, W., Krishnamurthy, S., Madden, S., Raman, V., Reiss, F., Shah, M.: TelegraphCQ: continuous data flow processing for an uncertain world. In: Conference on Innovative Data System Research (CIDR) (2003)
Condie, T., Conway, N., Alvaro, P., Hellerstein, J.M., Elmeleegy, K., Sears, R.: MapReduce online. In: USENIX Symposium on Networked Systems Design and Implementation (NSDI), vol. 10, no. (4) (2010)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Symposium on Operating System Design and Implementation (OSDI) (2004)
Fegaras, L., Li, C., Gupta, U., Philip, J.J.: XML query optimization in Map-Reduce. In: International Workshop on the Web and Databases (WebDB) (2011)
Fegaras, L., Li, C., Gupta, U.: An optimization framework for Map-Reduce queries. In: International Conference on Extending Database Technology (EDBT), pp. 26–37 (2012)
Fegaras, L., Maier, D.: Optimizing object queries using an effective calculus. ACM Trans. Database Syst. (TODS) 25(4), 457–516 (2000)
Apache Flink. http://flink.apache.org/
Gupta, A., Mumick, I.S.: Maintenance of materialized views: problems, techniques, and applications. IEEE Bull. Data Eng. 18(2), 145–157 (1995)
Apache Hadoop. http://hadoop.apache.org/
Apache Hive. http://hive.apache.org/
Logothetis, D., Olston, C., Reed, B., Webb, K.C., Yocum, K.: Stateful bulk processing for incremental analytics. In: ACM Symposium on Cloud Computing (SoCC) (2010)
Low, Y., Gonzalez, J., Kyrola, A., Bickson, D., Guestrin, C., Hellerstein, J.M.: Distributed GraphLab: a framework for machine learning and data mining in the cloud. Proc. VLDB Endow. 5(8), 716–727 (2012)
Malewicz, G., Austern, M.H., Bik, A.J.C., Dehnert, J.C., Horn, I., Leiser, N., Czajkowski, G.: Pregel: a System for large-scale graph processing. In: ACM Symposium on Principles of Distributed Computing (PODC) (2009)
McSherry, F., Murray, D.G., Isaacs, R., Isard, M.: Differential dataflow. In: Conference on Innovative Data System Research (CIDR) (2013)
Mihaylov, S.R., Ives, Z.G., Guha, S.: REX: recursive, delta-based data-centric computation. Proc. VLDB Endow. 5(11), 1280–1291 (2012)
Murray, D.G., McSherry, F., Isaacs, R., Isard, M., Barham, P., Abadi, M.: Naiad: a timely dataflow system. In: ACM Symposium on Operating Systems Principles (SOSP) (2013)
Apache MRQL (incubating). http://mrql.incubator.apache.org/
Olston, C., Reed, B., Srivastava, U., Kumar, R., Tomkins, A.: Pig Latin: a not-so-foreign language for data processing. In: ACM SIGMOD International Conference on Management of Data, pp. 1099–1110 (2008)
Peng, D., Dabek, F.: Large-scale incremental processing using distributed transactions and notifications. In: Symposium on Operating System Design and Implementation (OSDI) (2010)
Power, R., Li, J.: Piccolo: building fast, distributed programs with partitioned tables. In: Symposium on Operating System Design and Implementation (OSDI) (2010)
Shinnar, A., Cunningham, D., Herta, B., Saraswat, V.: M3R: increased performance for in-memory Hadoop jobs. Proc. VLDB Endow. 5(12), 1736–1747 (2012)
Apache Spark. http://spark.apache.org/
Apache Storm: A System for Processing Streaming Data in Real Time. http://hortonworks.com/hadoop/storm/
Tangwongsan, K., Hirzel, M., Schneider, S., Wu, K.-L.: General incremental sliding-window aggregation. Proc. VLDB Endow. 8(7), 702–713 (2015)
Valiant, L.G.: A bridging model for parallel computation. Commun. ACM (CACM) 33(8), 103–111 (1990)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: USENIX Symposium on Networked Systems Design and Implementation (NSDI) (2012)
Zaharia, M., Das, T., Li, H., Hunter, T., Shenker, S., Stoica, I.: Discretized streams: fault-tolerant streaming computation at scale. In: Symposium on Operating Systems Principles (SOSP) (2013)
Zhang, Y., Chen, S., Wang, Q., Yu, G.: \(i^2\) MapReduce: incremental MapReduce for mining evolving big data. IEEE Trans. Knowl. Data Eng. (TKDE) 27(7), 1906–1919 (2015)
Acknowledgments
This work is supported in part by the National Science Foundation under the grant CCF-1117369. Our performance evaluations were performed at the Chameleon cloud computing infrastructure, www.chameleoncloud.org, supported by NSF.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Fegaras, L. (2016). Incremental Stream Processing of Nested-Relational Queries. In: Hartmann, S., Ma, H. (eds) Database and Expert Systems Applications. DEXA 2016. Lecture Notes in Computer Science(), vol 9827. Springer, Cham. https://doi.org/10.1007/978-3-319-44403-1_19
Download citation
DOI: https://doi.org/10.1007/978-3-319-44403-1_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-44402-4
Online ISBN: 978-3-319-44403-1
eBook Packages: Computer ScienceComputer Science (R0)