ABSTRACT
Reservoir sampling is an interesting statistical sampling technique, developed almost 40 years ago in order to enable analysis of large scale data (for that time) while utilizing limited computer memory resources. We present an overview of frequently used reservoir sampling techniques and discuss how they can be used for learning from data streams. While they are not perfect for all scenarios, they can easily be modified for many purpose, and also find place in surprisingly useful modern data analysis approaches.
- C. C. Aggarwal. On biased reservoir sampling in the presence of stream evolution. In Proc. VLDB, pages 607--618, 2006. Google ScholarDigital Library
- J. S. Vitter, Random Sampling with a Reservoir. Brown University, 1985.Google Scholar
- N. Littlestone, Learning Quickly When Irrelevant Attributes Abound: A New Linear-threshold Algorithm. University of California, 1988Google Scholar
- B. Babcock, M. Datar, and R. Motwani. Sampling from a moving window over streaming data. In Proc. SODA, pages 633--634, 2002. Google ScholarDigital Library
- P. Zhao, R. Jing, Online AUC Maximization, Proc. ICML 2011,Google Scholar
- R. Kessl, Parallel algorithms for mining of frequent itemsets., PhD Thesis, Faculty of Electrical Engineering, Czech Technical University in Prague, 2011.Google Scholar
- Hanley, James A. and McNeil, Barbara J. The meaning and use of the area under of receiver operating characteristic (ROC) curve. 1982.Google Scholar
Index Terms
- Reservoir sampling techniques in modern data analysis
Recommendations
Adaptive stratified reservoir sampling over heterogeneous data streams
Reservoir sampling is a known technique for maintaining a random sample of a fixed size over a data stream of an unknown size. While reservoir sampling is suitable for applications demanding a sample over the whole data stream, it is not designed for ...
DSM-FI: an efficient algorithm for mining frequent itemsets in data streams
Online mining of data streams is an important data mining problem with broad applications. However, it is also a difficult problem since the streaming data possess some inherent characteristics. In this paper, we propose a new single-pass algorithm, ...
Mining frequent itemsets over data streams using efficient window sliding techniques
Online mining of frequent itemsets over a stream sliding window is one of the most important problems in stream data mining with broad applications. It is also a difficult issue since the streaming data possess some challenging characteristics, such as ...
Comments