ABSTRACT
Processing large data streams is now a major topic in data management. The data involved can be truly massive, and the required analyses complex. In a stream of sequential events such as stock feeds, sensor readings, or IP traffic measurements, data tuples pertaining to recent events are typically more important than older ones. This can be formalized via time-decay functions, which assign weights to data based on the age of data. Decay functions such as sliding windows and exponential decay have been studied under the assumption of well-ordered arrivals, i.e., data arrives in non-decreasing order of time stamps. However, data quality issues are prevalent in massive streams (due to network asynchrony and delays etc.), and correct arrival order is not guaranteed.
We focus on the computation of decayed aggregates such as range queries, quantiles, and heavy hitters on out-of-order streams, where elements do not necessarily arrive in increasing order of timestamps. Existing techniques such as Exponential Histograms and Waves are unable to handle out-of-order streams. We give the first deterministic algorithms for approximating these aggregates under popular decay functions such as sliding window and polynomial decay. We study the overhead of allowing out-of-order arrivals when compared to well-ordered arrivals, both analytically and experimentally. Our experiments confirm that these algorithms can be applied in practice, and compare the relative performance of different approaches for handling out-of-order arrivals.
- D. Abadi et al. Aurora: a data stream management system. In SIGMOD, 2003. Google ScholarDigital Library
- N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the frequency moments. JCSS: Journal of Computer and System Sciences, 58:137--147, 1999. Google ScholarDigital Library
- A. Arasu and G. S. Manku. Approximate counts and quantiles over sliding windows. In PODS, 2004. Google ScholarDigital Library
- B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. In PODS, 2002. Google ScholarDigital Library
- B. Babcock, M. Datar, R. Motwani, and L. O'Callaghan. Maintaining variance and k-medians over data stream windows. In PODS, 2003. Google ScholarDigital Library
- V. Braverman and R. Ostrovsky Smooth Histograms for Sliding Windows. In FOCS, 2007. Google ScholarDigital Library
- C. Busch and S. Tirthapura. A deterministic algorithm for summarizing asynchronous streams over a sliding window. In STACS, 2007. Google ScholarDigital Library
- S. Cohen. User-defined aggregate functions: bridging theory and practice. In SIGMOD, 2006. Google ScholarDigital Library
- E. Cohen and M. Strauss. Maintaining time-decaying stream aggregates. In PODS, 2003. Google ScholarDigital Library
- G. Cormode, F. Korn, S. Muthukrishnan, T. Johnson, O. Spatscheck, and D. Srivastava. Holistic UDAFs at streaming speeds. In SIGMOD, 2004. Google ScholarDigital Library
- G. Cormode, F. Korn, S. Muthukrishnan, and D. Srivastava. Spaceand time-efficient deterministic algorithms for biased quantiles over data streams. In PODS, 2006. Google ScholarDigital Library
- G. Cormode, F. Korn, and S. Tirthapura. Exponentially Decayed Aggregates on Data Streams. In ICDE, 2008. Google ScholarDigital Library
- G. Cormode and S. Muthukrishnan. An improved data stream summary: The count-min sketch and its applications. Journal of Algorithms, 55(1):58--75, 2005. Google ScholarDigital Library
- G. Cormode and S. Muthukrishnan. Space efficient mining of multigraph streams. In PODS, 2005. Google ScholarDigital Library
- M. Datar, A. Gionis, P. Indyk, and R. Motwani. Maintaining stream statistics over sliding windows. In SODA, 2002. Google ScholarDigital Library
- P. Gibbons and S. Tirthapura. Distributed streams algorithms for sliding windows. Theory of Computing Systems, 37:457--478, 2004.Google ScholarDigital Library
- J. Hershberger, N. Shrivastava, S. Suri, and C. Toth. Adaptive spatial partitioning for multidimensional data streams. In ISAAC, 2004. Google ScholarDigital Library
- T. Kopelowitz and E. Porat. Improved Algorithms for Polynomial Time-Decay and Time-Decay with Additive error. In ICTCS, 2005. Google ScholarDigital Library
- L.K. Lee and H.F. Ting. A simpler and more efficient deterministic scheme for finding frequent items over sliding windows. In PODS, 2006. Google ScholarDigital Library
- A. Manjhi, V. Shkapenyuk, K. Dhamdhere, and C. Olston. Finding (recently) frequent items in distributed data streams. In ICDE, 2005. Google ScholarDigital Library
- J. Misra and D. Gries. Finding repeated elements. Science of Computer Programming, 2:143--152, 1982.Google ScholarCross Ref
- S. Muthukrishnan. Data streams: Algorithms and applications. In SODA, 2003. Google ScholarDigital Library
- J. I. Munro and M. Paterson. Selection and sorting with limited storage. Theor. Comput. Sci., 12:315--323, 1980.Google ScholarCross Ref
- L. Qiao, D. Agrawal, and A. El Abbadi. Supporting sliding window queries for continuous data streams. In SSDBM, 2003. Google ScholarDigital Library
- N. Shrivastava, C. Buragohain, D. Agrawal, and S. Suri. Medians and beyond: New aggregation techniques for sensor networks. In ACM SenSys, 2004. Google ScholarDigital Library
- S. Tirthapura, C. Busch, and B. Xu. Sketching asycnhronous streams over sliding windows. In PODC, 2006. Google ScholarDigital Library
- P. A. Tucker, D. Maier, T. Sheard, and L. Fegaras. Exploiting punctuation semantics in countinuous data streams. IEEE TKDE, 15(3):555--568, May 2003. Google ScholarDigital Library
Index Terms
- Time-decaying aggregates in out-of-order streams
Recommendations
Quality-driven processing of sliding window aggregates over out-of-order data streams
DEBS '15: Proceedings of the 9th ACM International Conference on Distributed Event-Based SystemsOne fundamental challenge in data stream processing is to cope with the ubiquity of disorder of tuples within a stream caused by network latency, operator parallelization, merging of asynchronous streams, etc. High result accuracy and low result latency ...
Maintaining time-decaying stream aggregates
PODS '03: Proceedings of the twenty-second ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systemsWe formalize the problem of maintaining time-decaying aggregates and statistics of a data stream: the relative contribution of each data item to the aggregate is scaled down by a factor that depends on, and is non-decreasing with, elapsed time. Time-...
Maintaining time-decaying stream aggregates
We formalize the problem of maintaining time-decaying aggregates and statistics of a data stream: the relative contribution of each data item to the aggregate is scaled down by a factor that depends on, and is non-increasing with, elapsed time. Time-...
Comments