Abstract
The problem of sampling from data streams has attracted significant interest in the last decade. Whichever sampling criteria is considered (uniform sample, maximally diverse sample, etc.), the challenges stem from the relatively small amount of memory available in the face of unbounded streams. In this work we consider an interesting extension of this problem, the framework of which is stimulated by recent improvements in sensing technologies and robotics. In some situations it is not only possible to digitally sense some aspects of the world, but to physically capture a tangible aspect of that world. Currently deployed examples include devices that can capture water/air samples, and devices that capture individual insects or fish. Such devices create an interesting twist on the stream sampling problem, because in most cases, the decision to take a physical sample is irrevocable. In this work we show how to generalize diversification sampling strategies to the irrevocable-choice setting, demonstrating our ideas on several real world domains.
Similar content being viewed by others
References
Aggarwal CC (2006) Data streams: models and algorithms (advances in database systems). Springer, New York
Anderson R et al (2010) Mars Science Laboratory participating scientists program proposal information package. NASA/Jet Propulsion Laboratory, Pasadena
Baldridge AM, Hook SJ, Grove CI, Rivera G (2009) The ASTER spectral library version 2.0. Remote Sens Environ 113(4):711–715
Begum N, Keogh E (2014) Rare time series motif discovery from unbounded streams. Proc VLDB Endow 8(2):149–160
Bowman A, Azzalini A (1997) Applied smoothing techniques for data analysis: the kernel approach with S-Plus illustrations. Oxford University Press, New York
Cerra D, Bieniarz J, Avbelj J, Reinartz P, Mueller R (2011) Compression-based unsupervised clustering of spectral signatures. Whispers, Oro Valley
Chen Y, Why A, Batista G, Mafra-Neto A, Keogh E (2014) Flying insect classification with inexpensive sensors. J Insect Behav 27(5):657–677
Cormode G, Hadjieleftheriou M (2010) Methods for finding frequent items in data streams. VLDB J 19(1):3–20
Drosou M, Pitoura E (2012a) Disc diversity: result diversification based on dissimilarity and coverage. Proc VLDB Endow 6(1):13–24
Drosou M, Pitoura E (2012b) Dynamic diversification of continuous data. In: Proceedings of the 15th EDBT/ICDT, ACM, pp 216–227
Erkut E (1990) The discrete p-dispersion problem. Eur J Oper Res 46(1):48–60
Erkut E, Ülküsal Y, Yenicerioğlu O (1994) A comparison of p-dispersion heuristics. Comput Oper Res 21(10):1103–1113
Ferguson TS (2006) Optimal stopping and applications. Online Book. www.math.ucla.edu/~tom/Stopping/Contents.html
Fønss A, Munksgaard L (2008) Automatic blood sampling in dairy cows. Comput Electron Agric 64(1):27–33
Ghosh JB (1996) Computational aspects of the maximum diversity problem. Oper Res Lett 19(4):175–181
Goldberg D (2011) Huxley: a flexible robot control architecture for autonomous underwater vehicles. In: Proceedings of IEEE OCEANS conference (Spain, 2011), pp 1–10
Hill TP (2009) Knowing when to stop: how to gamble if you must—the mathematics of optimal stopping. Am Sci 97(2):126–133
Honda MC, Watanabe S (2007) Utility of an automatic water sampler to observe seasonal variability in nutrients and DIC in the Northwestern North Pacific. J Oceanogr 63(3):349–362
Jonsson F (2015) Real-time fish type recognition in underwater images for sustainable fishing. Technical report. Uppsala University, Uppsala
Matlab ksdensity function (2016) http://www.mathworks.com/help/stats/ksdensity.html
Minack E, Siberski W, Nejdl W (2011) Incremental diversification for very large sets: a streaming-based approach. In: ACM SIGIR (July 2011), pp 585–594
Peskir G, Shiryaev A (2006) Optimal stopping and free-boundary problems. Lectures in Mathematics. ETH, Zürich
Project Premonition (2015a) http://www.research.microsoft.com/en-us/um/redmond/projects/projectpremonition/default.aspx. Accessed 2 Aug 2015
Project Premonition (2015b) URL of Video of First Trials in Granada. Seeking to prevent disease outbreaks. https://www.youtube.com/watch?v=v8uG82Z7VLM
Project Webpage (2016) https://sites.google.com/site/irrevocablestreamingdata/
Rasmussen SL, Starr N (1979) Optimal and adaptive stopping in the search for new species. J Am Stat Assoc 74(367):661–667
Roman C, Mather R (2010) Autonomous underwater vehicles as tools for deep-submergence archaeology. Eng Marit Environ 224(4):327–340
Silver JB (2008) Chapter 14: estimating the size of the adult population. Mosquito ecology field sampling methods, 3rd edn. Springer, New York
Vitter JS (1985) Random sampling with a reservoir. ACM Trans. Math Softw (TOMS) 11(1):37–57
Webster G, Agle DC (2012) Mars Science Laboratory/Curiosity Mission status report. NASA, New York
Zhang D et al (2015) Automatic fish taxonomy using evolution-constructed features for invasive species removal. Pattern Anal Appl 18(2):451–459
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Thomas Gärtner, Mirco Nanni, Andrea Passerini and Celine Robardet.
Rights and permissions
About this article
Cite this article
Zhu, Y., Keogh, E. Irrevocable-choice algorithms for sampling from a stream. Data Min Knowl Disc 30, 998–1023 (2016). https://doi.org/10.1007/s10618-016-0472-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-016-0472-z