Skip to main content
Log in

Sliding windows over uncertain data streams

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Uncertain data streams can have tuples with both value and existential uncertainty. A tuple has value uncertainty when it can assume multiple possible values. A tuple is existentially uncertain when the sum of the probabilities of its possible values is \(<\)1. A situation where existential uncertainty can arise is when applying relational operators to streams with value uncertainty. Several prior works have focused on querying and mining data streams with both value and existential uncertainty. However, none of them have studied, in depth, the implications of existential uncertainty on sliding window processing, even though it naturally arises when processing uncertain data. In this work, we study the challenges arising from existential uncertainty, more specifically the management of count-based sliding windows, which are a basic building block of stream processing applications. We extend the semantics of sliding window to define the novel concept of uncertain sliding windows and provide both exact and approximate algorithms for managing windows under existential uncertainty. We also show how current state-of-the-art techniques for answering similarity join queries can be easily adapted to be used with uncertain sliding windows. We evaluate our proposed techniques under a variety of configurations using real data. The results show that the algorithms used to maintain uncertain sliding windows can efficiently operate while providing a high-quality approximation in query answering. In addition, we show that sort-based similarity join algorithms can perform better than index-based techniques (on 17 real datasets) when the number of possible values per tuple is low, as in many real-world applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. Each dimension can be considered as an attribute.

  2. The uniform distribution over \([0, x]\) has a fixed standard deviation that is only dependent on \(x\). To vary the standard deviation, we adapt the value of \(x\) (for \(\sigma =0.25\), \(x\approx 0.87\)).

References

  1. Abadi D, Ahmad Y, Balazinska M, Çetintemel U, Cherniack M, Hwang JH, Lindner W, Maskey A, Rasin A, Ryvkina E, Tatbul N, Xing Y, Zdonik S (2005) The design of the Borealis stream processing engine. In: CIDR

  2. Aggarwal CC (2009) Managing and mining uncertain data. Springer, Berlin

    Book  MATH  Google Scholar 

  3. Aggarwal CC, Yu PS (2008) A framework for clustering uncertain data streams. In: IEEE ICDE

  4. Aßfalg J, Kriegel H-P, Kröger P, Renz M (2009) Probabilistic similarity search for uncertain time series. In: SSDBM

  5. Aßfalg J, Kriegel HP, Kröger P, Renz M (2009) Probabilistic similarity search for uncertain time series. In: SSDBM, pp 435–443

  6. Benjelloun O, Sarma A, Halevy A, Widom J (2006) Uldbs: databases with uncertainty and lineage. In: VLDB

  7. Bernecker T, Kriegel HP, Renz M, Verhein F, Züfle A (2009) Probabilistic frequent itemset mining in uncertain databases. In: KDD, pp 119–128

  8. Biem A, Bouillet E, Feng H, Ranganathan A, Riabov A, Verscheure O, Koutsopoulos H, Moran C (2010) IBM infosphere streams for scalable, real-time, intelligent transportation services. In: ACM SIGMOD

  9. Calders T, Garboni C, Goethals B (2010) Approximation of frequentness probability of itemsets in uncertain data. In: Data mining (ICDM), 2010 IEEE 10th international conference on IEEE, pp 749–754

  10. Cheng R, Kalashnikov D, Prabhakar S (2004) Querying imprecise data in moving object environments. IEEE Trans Knowl Data Eng 16(9):1112–1127

    Article  Google Scholar 

  11. Dai X, Yiu M, Mamoulis N, Tao Y, Vaitis M (2005) Probabilistic spatial queries on existentially uncertain data. In: SSTD

  12. Dallachiesa M, Aggarwal C, Palpanas T (2014) Node classification in uncertain graphs. In: SSDBM 32

  13. Dallachiesa M, Nushi B, Mirylenka K, Palpanas T (2012) Uncertain time-series similarity: return to the basics. PVLDB 5(11):1662–1673

    Google Scholar 

  14. Dallachiesa M, Palpanas T (2013) Identifying streaming frequent items in ad hoc time windows. Data Knowl Eng 87:66–90

    Article  Google Scholar 

  15. Dallachiesa M, Palpanas T, Ilyas FI (2014) Top-k nearest neighbor search in uncertain data series. Proc VLDB Endowment

  16. Daskalakis C, Diakonikolas I, Servedio RA (2012) Learning poisson binomial distributions. In: Proceedings of the 44th symposium on theory of computing. ACM, pp 709–728

  17. Diao Y, Li B, Liu A, Peng L, Sutton C, Tran TTL, Zink M (2009) Capturing data uncertainty in high-volume stream processing. In: CIDR

  18. Fernandez M, Williams S (2010) Closed-form expression for the poisson-binomial probability density function. IEEE Trans Aerosp Electron Syst 46(2):803–817

  19. Fung BCM, Wang K, Chen R, Yu PS (2010) Privacy-preserving data publishing: A survey of recent developments. ACM Comput Surv 42(4):14

  20. Gedik B (2013) Generic windowing support for extensible stream processing systems. Softw Pract Exp 44(9):1105–1128

  21. Gedik B, Andrade H (2012) A model-based framework for building extensible, high performance stream processing middleware and programming language for IBM infosphere streams. Softw Pract Exp 42(11):1363–1391

  22. Getoor L, Friedman N, Koller D, Taskar B (2003) Learning probabilistic models of link structure. J Mach Learn Res 3:679–707

    MATH  MathSciNet  Google Scholar 

  23. Halpern J (2003) Reasoning about uncertainty. MIT Press, Cambridge

    MATH  Google Scholar 

  24. Hirzel M, Andrade H, Gedik B, Kumar V, Losa G, Mendell M, Nasgaard H, Soulé R, Wu KL (2009) SPL language specification. Technical report RC24897. IBM Research

  25. Hong Y (2011) On computing the distribution function for the sum of independent and non-identical random indicators. Technical report, Department of Statistics, Virginia Tech

  26. Jayram TS, McGregor A, Muthukrishnan S, Vee E (2007) Estimating statistical aggregates on probabilistic data streams. In: ACM PODS

  27. Jin C, Yi K, Chen L, Yu JX, Lin X (2008) Sliding-window top-k queries on uncertain streams. Proc VLDB Endowment 1(1):301–312

    Article  Google Scholar 

  28. Kanagal B, Deshpande A (2008) Online filtering, smoothing and probabilistic modeling of streaming data. In IEEE ICDE

  29. Keogh E, Xi X, Wei L, Ratanamahatana CA (2006) The UCR time series classification/clustering homepage. http://www.cs.ucr.edu/-eamonn/time_series_data

  30. Kriegel H, Kunath P, Pfeifle M, Renz M (2006) Probabilistic similarity join on uncertain data. In: DASFAA

  31. Kuo W, Zuo M (2003) Optimal reliability modeling: principles and applications. Wiley, New York

    Google Scholar 

  32. Leung CKS, Hao B (2009) Mining of frequent itemsets from streams of uncertain data. In: IEEE ICDE

  33. Lian X, Chen L (2011) Similarity join processing on uncertain data streams. IEEE TKDE 23(11)

  34. Liao L, Fox D, Kautz H (2007) Extracting places and activities from gps traces using hierarchical conditional random fields. Int J Rob Res 26(1):119–134

    Article  Google Scholar 

  35. Liao L, Patterson DJ, Fox D, Kautz H (2007) Learning and inferring transportation routines. Artif Intell 171(5):311–331

    Article  MATH  MathSciNet  Google Scholar 

  36. Moon B, Jagadish HV, Faloutsos C, Saltz JH (2001) Analysis of the clustering properties of the hilbert space-filling curve. IEEE TKDE 13(1)

  37. Neumeyer L, Robbins B, Nair A, Kesari A (2010) S4: distributed stream computing platform. In: KDCloud

  38. Nybø R (2008) Time series opportunities in the petroleum industry. In: ESTSP 08, European symposium on time series prediction, Porvoo, Finland

  39. Raza U, Camerra A, Murphy AL, Palpanas T, Picco GP (2012) What does model-driven data acquisition really achieve in wireless sensor networks? In: PERCOM

  40. Ré C, Letchner J, Balazinska M, Suciu D (2008) Event queries on correlated probabilistic streams. In: ACM SIGMOD

  41. Sarangi S, Murthy K (2010) DUST: a generalized notion of similarity between uncertain time series. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 383–392

  42. Singh S, Mayfield C, Shah R, Prabhakar S, Hambrusch SE, Neville J, Cheng R (2008) Database support for probabilistic attributes and tuples. In: IEEE ICDE

  43. Sow D, Biem A, Blount M, Ebling M, Verscheure O (2010) Body sensor data processing using stream computing. In: MIR

  44. Sun L, Cheng R, Cheung DW, Cheng J (2010) Mining uncertain data with probabilistic guarantees. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 273–282

  45. Tran TT, Peng L, Diao Y, McGregor A, Liu A (2012) Claro: modeling and processing uncertain data streams. VLDB J Int J Very Large Data Bases 21(5):651–676

    Article  Google Scholar 

  46. Tran TT, Peng L, Li B, Diao Y, Liu A (2010) Pods: a new model and processing algorithms for uncertain data streams. In: Proceedings of the 2010 ACM SIGMOD international conference on management of data. ACM, pp 159–170

  47. Wang L, Cheung D, Cheng R, Lee S, Yang X (2012) Efficient mining of frequent itemsets on large uncertain databases. IEEE Trans Knowl Data Eng 24(12):2170–2183

  48. Wu KL, Yu PS, Gedik B, Hildrum K, Aggarwal CC, Bouillet E, Fan W, George D, Gu X, Luo G, Wang H (2007) Challenges and experience in prototyping a multi-modal stream analytic and monitoring application on system. In: VLDB

  49. Yeh M, Wu K, Yu P, Chen M (2009) PROUD: a probabilistic approach to processing similarity queries over uncertain data streams. In: Proceedings of the 12th international conference on extending database technology: advances in database technology. ACM, pp 684–695

  50. Youssef M, Mah M, Agrawala A (2007) Challenges: device-free passive localization for wireless environments. In: ACM MOBICOM

  51. Zhang Q, Li F, Yi K (2008) Finding frequent items in probabilistic data. In: ACM SIGMOD

  52. Zhang W, Lin X, Zhang Y, Wang W, Yu JX (2009) Probabilistic skyline operator over sliding windows. In: IEEE ICDE

  53. Zhou Z, Gupta H, Das SR, Zhu X (2007) Slotted scheduled tag access in multi-reader rfid systems. In: IEEE ICNP

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Michele Dallachiesa.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dallachiesa, M., Jacques-Silva, G., Gedik, B. et al. Sliding windows over uncertain data streams. Knowl Inf Syst 45, 159–190 (2015). https://doi.org/10.1007/s10115-014-0804-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-014-0804-5

Keywords

Navigation