ABSTRACT
This paper introduces the problem of random sampling from time-based sliding windows over weighted streaming data and presents a priority random sampling (PRS) algorithm for this problem. The algorithm extends classic reservoir-sampling algorithm and weighted random sampling algorithm with a reservoir to deal with the expiration of data items from time-based sliding window, and can avoid drawbacks of classic reservoir-sampling algorithm and weighted sampling algorithm with a reservoir. In the new algorithm, a key is assigned for each data item in the time-based sliding window by compromising its weight and arrival time, and works even when the number of data items in a sliding window varies dynamically over time. The experiments show that PRS algorithm is somewhat superior to WRS algorithm.
- B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom. Models and issues in data stream systems. Proceeding of 21st ACM SIGACT-SIGMODSIGART Symp. on Principles of Database Systems, Madison, Wisconsin, pp. 1--16, May 2002. Google ScholarDigital Library
- Sirish Chandrasekaran and Michael J. Franklin. Streaming queries over streaming data. Proc. of the 28th Int'l Conf. on Very Large Data Bases (VLDB), Hong Kong, China, 2002. Google ScholarDigital Library
- P. Gibbons. Distinct sampling for highly-accurate answers to distinct values queries and event reports. Proc. of the 27th Int'l Conf. on Very Large Data Bases (VLDB), Roma, Italy, 2001. Google ScholarDigital Library
- D. J. Abadi, D. Carney, U. Cetintemel, et al. Aurora: a new model and architecture for data stream management. The VLDB Journal (2003)/Digital Object Identifier (DOI) 10.1007/s00778-003-0095-z Google ScholarDigital Library
- Zhu Y, Shasha D. Statstream: statistical monitoring of thousands of data streams in real time. Proc. of the 28th Int'l Conf. on Very Large Data Bases (VLDB), Hong Kong, China, 2002. Google ScholarDigital Library
- Vitter JS. Random sampling with a reservoir. ACM Trans. on Mathematical Software, 1985, 11(1): 37--57. Google ScholarDigital Library
- G. Manku and R. Motwani. Approximate frequency counts over data streams. Proc. of the 28th Int'l Conf. on Very Large Data Bases. Hong Kong, China, pp. 346--357, 2002. Google ScholarDigital Library
- Babcock B, Datar M, Motwani R. Sampling from a moving window over streaming data. Proc. of the 13th Annual ACM-SIAM Symp. on Discrete Algorithms. San Francisco: ACM/SIAM, pp. 633--634. 2002. Google ScholarDigital Library
- M Datar, A Gionis, P Indyk, et al. Maintaining stream statistics over sliding windows. Proc. of the 13th Annual ACM-SIAM Symp on Discrete Algorithms, San Francisco, California, 2002. Google ScholarDigital Library
- M. Greenwald and S. Khanna, Space-efficient online computation of quantile summaries, Proc. of SIGMOD 2001. Google ScholarDigital Library
- M. Datar. Algorithms for data stream systems. Ph. D Thesis, Stanford University, 2004. Google ScholarDigital Library
- P. S. Efraimidis, P. G. Spirakis. Weighted random sampling with a reservoir. Information Processing Letters, Volume 97, Issue 5, pp. 181--185, March 2006. Google ScholarDigital Library
- C. Cranor, T. Johnson, O. Spatschnek, V. Shkapenyuk. Gogascope: a stream database for network applications. Proc. of ACM SIGMOD 2002, pp. 262, 2002. Google ScholarDigital Library
- Zhang L, Li Z, Yu M, et al. Random sampling algorithms for sliding windows over data streams. Proc. of the 11th Joint International Computer Conference (JICC 2005). Chongqing, China, pp. 572--575. 2005.Google ScholarCross Ref
- T. Johnson, S. Muthukrishnan, I. Rozenbaum. Sampling algorithms in a stream operator. SIGMOD Record 2005. Google ScholarDigital Library
- P. Domingos, G. Hulten. A general framework for mining massive data streams. Journal of Computational & Graphical Statistics, Vol. 12, No. 4, pp.945--949. 2003.Google ScholarCross Ref
- http://en.wikipedia.org/wiki/Zipf's_law.Google Scholar
- http://www.nslij-genetics.org/wli/zipf/Google Scholar
Index Terms
- A priority random sampling algorithm for time-based sliding windows over weighted streaming data
Recommendations
EclatDS: An efficient sliding window based frequent pattern mining method for data streams
Mining frequent patterns over data streams is an interesting problem due to its wide application area. The researchers in this field have been facing two key challenges, namely reduction in runtime and memory usage. In this study, a novel method for ...
Clustering Algorithm for High Dimensional Data Stream over Sliding Windows
TRUSTCOM '11: Proceedings of the 2011IEEE 10th International Conference on Trust, Security and Privacy in Computing and CommunicationsData stream clustering is confronted with great challenges due to the memory usages and the processing speed. Besides, lots of stream data are high-dimensional in natural and high-dimensional data are inherently more complex in clustering. This paper ...
Sliding Sketches: A Framework using Time Zones for Data Stream Processing in Sliding Windows
KDD '20: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data MiningData stream processing has become a hot issue in recent years due to the arrival of big data era. There are three fundamental stream processing tasks: membership query, frequency query and heavy hitter query. While most existing solutions address these ...
Comments