Skip to main content

Incremental Sliding Window Analytics

  • Living reference work entry
  • First Online:
Book cover Encyclopedia of Big Data Technologies

Abstract

Sliding window computations are widely used for large-scale data analysis, particularly in live systems where new data arrives continuously. These computations consume significant computational resources because they usually recompute over the full window of data every time the window slides. In this chapter, we propose techniques for improving the scalability of sliding window computations by performing them incrementally. In our approach, when some new data is added at the end of the window or old data dropped from its beginning, the output is updated automatically and efficiently by reusing previously run sub-computations. The key idea behind our approach is to organize the sub-computations as a shallow (logarithmic depth) balanced tree and perform incremental updates by propagating changes through this tree. This approach is motivated and inspired by advances on self-adjusting computation, which enables automatic and efficient incremental computation. We present an Hadoop-based implementation that also provides a dataflow query processing interface. We evaluate it with a variety of applications and real-world case studies. Our results show significant performance improvements for large-scale sliding window computations without any modifications to the existing data analysis code.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

References

  • Acar UA (2005) Self-adjusting computation. PhD thesis, Carnegie Mellon University

    Google Scholar 

  • Acar UA, Blelloch GE, Blume M, Harper R, Tangwongsan K (2009) An experimental analysis of self-adjusting computation. ACM Trans Program Lang Syst (TOPLAS) 32(1):1–53

    Article  Google Scholar 

  • Acar UA, Cotter A, Hudson B, Türkoğlu D (2010) Dynamic well-spaced point sets. In: Proceedings of the 26th annual symposium on computational geometry (SoCG)

    Google Scholar 

  • Ananthanarayanan G, Ghodsi A, Wang A, Borthakur D, Shenker S, Stoica I (2012) PACMan: coordinated memory caching for parallel jobs. In: Proceedings of the 9th USENIX conference on networked systems design and implementation (NSDI)

    Google Scholar 

  • A.S. Foundation. Apache Hive (2017)

    Google Scholar 

  • Babcock B, Datar M, Motwani R, O’Callaghan L (2002) Sliding window computations over data streams. Technical report

    MATH  Google Scholar 

  • Bhatotia P (2015) Incremental parallel and distributed systems. PhD thesis, Max Planck Institute for Software Systems (MPI-SWS)

    Google Scholar 

  • Bhatotia P (2016) Asymptotic analysis of self-adjusting contraction trees. CoRR, abs/1604.00794

    Google Scholar 

  • Bhatotia P, Wieder A, Akkus IE, Rodrigues R, Acar UA (2011a) Large-scale incremental data processing with change propagation. In: Proceedings of the conference on hot topics in cloud computing (HotCloud)

    Google Scholar 

  • Bhatotia P, Wieder A, Rodrigues R, Acar UA, Pasquini R (2011b) Incoop: MapReduce for incremental computations. In: Proceedings of the ACM symposium on cloud computing (SoCC)

    Google Scholar 

  • Bhatotia P, Rodrigues R, Verma A (2012a) Shredder: GPU-accelerated incremental storage and computation. In: Proceedings of USENIX conference on file and storage technologies (FAST)

    Google Scholar 

  • Bhatotia P, Dischinger M, Rodrigues R, Acar UA (2012b) Slider: incremental sliding-window computations for large-scale data analysis. Technical report MPI-SWS-2012-004, MPI-SWS. http://www.mpi-sws.org/tr/2012-004.pdf

  • Bhatotia P, Acar UA, Junqueira FP, Rodrigues R (2014) Slider: incremental sliding window analytics. In: Proceedings of the 15th international middleware conference (Middleware)

    Google Scholar 

  • Bhatotia P, Fonseca P, Acar UA, Brandenburg B, Rodrigues R (2015) iThreads: a threading library for parallel incremental computation. In: Proceedings of the 20th international conference on architectural support for programming languages and operating systems (ASPLOS)

    Google Scholar 

  • Bu Y, Howe B, Balazinska M, Ernst MD (2010) HaLoop: efficient iterative data processing on large clusters. In: Proceedings of the international conference on very large data bases (VLDB)

    Google Scholar 

  • Ceri S, Widom J (1991) Deriving production rules for incremental view maintenance. In: Proceedings of the international conference on very large data bases (VLDB)

    Google Scholar 

  • Chiang Y-J, Tamassia R (1992) Dynamic algorithms in computational geometry. In: Proceedings of the IEEE

    Google Scholar 

  • Condie T, Conway N, Alvaro P, Hellerstein JM, Elmeleegy K, Sears R (2010) MapReduce online. In: Proceedings of the 7th USENIX conference on networked systems design and implementation (NSDI)

    Google Scholar 

  • Costa et al (2012) Camdoop: exploiting in-network aggregation for big data applications. In: Proceedings of the 9th USENIX conference on networked systems design and implementation (NSDI)

    Google Scholar 

  • Dean J, Ghemawat S (2004) MapReduce: simplified data processing on large clusters. In: Proceedings of the USENIX conference on operating systems design and implementation (OSDI)

    Google Scholar 

  • Demetrescu C, Finocchi I, Italiano G (2004) Handbook on data structures and applications. Chapman & Hall/CRC, Boca Raton

    Google Scholar 

  • Gunda PK, Ravindranath L, Thekkath CA, Yu Y, Zhuang L (2010) Nectar: automatic management of data and computation in datacenters. In: Proceedings of the USENIX conference on operating systems design and implementation (OSDI)

    Google Scholar 

  • He B, Yang M, Guo Z, Chen R, Su B, Lin W, Zhou L (2010) Comet: batched stream processing for data intensive distributed computing. In: Proceedings of the ACM symposium on cloud computing (SoCC)

    Google Scholar 

  • Isard M, Budiu M, Yu Y, Birrell A, Fetterly D (2007) Dryad: distributed data-parallel programs from sequential building blocks. In: Proceedings of the ACM European conference on computer systems (EuroSys)

    Google Scholar 

  • Krishnan DR, Quoc DL, Bhatotia P, Fetzer C, Rodrigues R (2016) IncApprox: a data analytics system for incremental approximate computing. In: Proceedings of the 25th international conference on world wide web (WWW)

    Google Scholar 

  • Logothetis D, Olston C, Reed B, Web K, Yocum K (2010) Stateful bulk processing for incremental analytics. In: Proceedings of the ACM symposium on cloud computing (SoCC)

    Google Scholar 

  • Logothetis D, Trezzo C, Webb KC, Yocum K (2011) In-situ MapReduce for log processing. In: Proceedings of the 2011 USENIX conference on USENIX annual technical conference (USENIX ATC)

    Google Scholar 

  • Malewicz G, Austern MH, Bik AJ, Dehnert JC, Horn I, Leiser N, Czajkowski G (2010) Pregel: a system for large-scale graph processing. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD)

    Google Scholar 

  • Murray DG, Schwarzkopf M, Smowton C, Smith S, Madhavapeddy A, Hand S (2011) CIEL: a universal execution engine for distributed data-flow computing. In: Proceedings of the 8th USENIX conference on networked systems design and implementation (NSDI)

    Google Scholar 

  • Olston C et al (2008) Pig Latin: a not-so-foreign language for data processing. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD)

    Google Scholar 

  • Olston C et al (2011) Nova: continuous pig/hadoop workflows. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD)

    Google Scholar 

  • Ongaro D, Rumble SM, Stutsman R, Ousterhout J, Rosenblum M (2011) Fast crash recovery in RAMCloud. In: Proceedings of the twenty-third ACM symposium on operating systems principles (SOSP)

    Google Scholar 

  • Peng D, Dabek F (2010) Large-scale incremental processing using distributed transactions and notifications. In: Proceedings of the USENIX conference on operating systems design and implementation (OSDI)

    Google Scholar 

  • Quoc DL, Beck M, Bhatotia P, Chen R, Fetzer C, Strufe T (2017a) Privacy preserving stream analytics: the marriage of randomized response and approximate computing. https://arxiv.org/abs/1701.05403

  • Quoc DL, Beck M, Bhatotia P, Chen R, Fetzer C, Strufe T (2017b) PrivApprox: privacy-preserving stream analytics. In: Proceedings of the 2017 USENIX conference on USENIX annual technical conference (USENIX ATC)

    Google Scholar 

  • Quoc DL, Chen R, Bhatotia P, Fetzer C, Hilt V, Strufe T (2017c) Approximate stream analytics in Apache flink and Apache spark streaming. CoRR, abs/1709.02946

    Google Scholar 

  • Quoc DL, Chen R, Bhatotia P, Fetzer C, Hilt V, Strufe T (2017d) StreamApprox: approximate computing for stream analytics. In: Proceedings of the international middleware conference (Middleware)

    Google Scholar 

  • Ramalingam G, Reps T (1993) A categorized bibliography on incremental computation. In: Proceedings of the ACM SIGPLAN-SIGACT symposium on principles of programming languages (POPL)

    Google Scholar 

  • Sümer O, Acar UA, Ihler A, Mettu R (2011) Adaptive exact inference in graphical models. J Mach Learn

    MATH  Google Scholar 

  • Wieder A, Bhatotia P, Post A, Rodrigues R (2010a) Brief announcement: modelling MapReduce for optimal execution in the cloud. In: Proceedings of the 29th ACM SIGACT-SIGOPS symposium on principles of distributed computing (PODC)

    Google Scholar 

  • Wieder A, Bhatotia P, Post A, Rodrigues R (2010b) Conductor: orchestrating the clouds. In: Proceedings of the 4th international workshop on large scale distributed systems and middleware (LADIS)

    Google Scholar 

  • Wieder A, Bhatotia P, Post A, Rodrigues R (2012) Orchestrating the deployment of computations in the cloud with conductor. In: Proceedings of the 9th USENIX symposium on networked systems design and implementation (NSDI)

    Google Scholar 

  • Zaharia M, Konwinski A, Joseph AD, Katz R, Stoica I (2008) Improving MapReduce performance in heterogeneous environments. In: Proceedings of the USENIX conference on operating systems design and implementation (OSDI)

    Google Scholar 

  • Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation (NSDI)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pramod Bhatotia .

Editor information

Editors and Affiliations

Section Editor information

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Bhatotia, P., Acar, U.A., Junqueira, F.P., Rodrigues, R. (2018). Incremental Sliding Window Analytics. In: Sakr, S., Zomaya, A. (eds) Encyclopedia of Big Data Technologies. Springer, Cham. https://doi.org/10.1007/978-3-319-63962-8_156-1

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-63962-8_156-1

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-63962-8

  • Online ISBN: 978-3-319-63962-8

  • eBook Packages: Springer Reference MathematicsReference Module Computer Science and Engineering

Publish with us

Policies and ethics