Skip to main content
Log in

Sequential pattern mining in databases with temporal uncertainty

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Temporally uncertain data widely exist in many real-world applications. Temporal uncertainty can be caused by various reasons such as conflicting or missing event timestamps, network latency, granularity mismatch, synchronization problems, device precision limitations, data aggregation. In this paper, we propose an efficient algorithm to mine sequential patterns from data with temporal uncertainty. We propose an uncertain model in which timestamps are modeled by random variables and then design a new approach to manage temporal uncertainty. We integrate it into the pattern-growth sequential pattern mining algorithm to discover probabilistic frequent sequential patterns. Extensive experiments on both synthetic and real datasets prove that the proposed algorithm is both efficient and scalable.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Aggarwal C, Yu P (2009) A survey of uncertain data algorithms and applications. IEEE Trans Knowl Data Eng 21(5):609–623

    Article  Google Scholar 

  2. Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th international conference on very large data bases VLDB’94, pp 487–499

  3. Agrawal R, Srikant R (1995) Mining sequential patterns. In: Proceedings of the eleventh international conference on data engineering, ICDE ’95, pp 3–14

  4. Allen J (1983) Maintaining knowledge about temporal intervals. Commun ACM 26(11):832–843

    Article  MATH  Google Scholar 

  5. Ayres J, Flannick J, Gehrke J et al (2002) Sequential pattern mining using a bitmap representation. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’02, pp 429–435

  6. Bernecker T, Kriegel H, Renz M et al (2009) Probabilistic frequent itemset mining in uncertain databases. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD’09, pp 119–128

  7. Cheng R, Kalashnikov D, Prabhakar S (2003) Evaluating probabilistic queries over imprecise data. In: Proceedings of the ACM international conference on management of data, SIGMOD ’03, pp 551–562

  8. Chui C, Kao B (2008) A decremental approach for mining frequent itemsets from uncertain data. In: Proceedings of the 12th Pacific-Asia conference on advances in knowledge discovery and data mining, PAKDD’08, pp 64–75

  9. Chiu D, Wu Y, Chen A (2004) An efficient algorithm for mining frequent sequences by a new strategy without support counting. In: Proceedings of the 20th international conference on data engineering, ICDE ’04, pp 275–286

  10. Chui C, Kao B, Hung E (2007) Mining frequent itemsets from uncertain data. In: Proceedings of the 11th Pacific-Asia conference on advances in knowledge discovery and data mining, PAKDD’07, pp 47–58

  11. Dyreson C, Snodgrass R (1998) Supporting valid-time indeterminacy. ACM Trans Datab Syst 23(1):1–57

    Article  Google Scholar 

  12. Ge J, Xia Y, Wang J (2015) Towards efficient sequential pattern mining in temporal uncertain databases. In: Proceedings of the 19th Pacific-Asia conference on advances in knowledge discovery and data mining, PAKDD’15, pp 268-279

  13. Han J, Pei J, Mortazavi-Asl B et al (2000) Freespan: frequent pattern-projected sequential pattern mining. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’00, pp 355–359

  14. Höppner F (2001) Discovery of temporal patterns. learning rules about the qualitative behaviour of time series. In: Proceedings of the 5th European conference on principles of data mining and knowledge discovery, PKDD ’01, pp 192–203

  15. Jestes J, Cormode G, Li F et al (2011) Semantics of ranking queries for probabilistic data. IEEE Trans Knowl Data Eng 23(12):1903–1917

    Article  Google Scholar 

  16. Li Y, Bailey J, Kulik L et al (2013) Mining probabilistic frequent spatio-temporal sequential patterns with gap constraints from uncertain databases. In: IEEE 13th international conference on data mining, ICDM’13, pp 448–457

  17. Muzammal M, Raman R (2011) Mining sequential patterns from probabilistic databases. In: Proceedings of the 15th Pacific-Asia conference on advances in knowledge discovery and data mining, PAKDD’11, pp 210–221

  18. Papapetrou P, Kollios G, Sclaroff S et al (2005) Discovering frequent arrangements of temporal intervals. In: Proceedings of the fifth IEEE international conference on data mining, ICDM ’05, pp 354–361

  19. Pei J, Han J, Mortazavi-asl B et al (2001) Prefixspan: mining sequential patterns efficiently by prefix-projected pattern growth. In: Proceedings of the 17th international conference on data engineering, ICDE’01, pp 215–224

  20. Pei J, Han J, Wang W (2002) Mining sequential patterns with constraints in large databases. In: Proceedings of the eleventh international conference on information and knowledge management, CIKM ’02, pp 18–25

  21. Sadri R, Zaniolo C, Zarkesh A et al (2004) Expressing and optimizing sequence queries in database systems. ACM Trans Database Syst 29(2):282–318

    Article  Google Scholar 

  22. Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: Proceedings of the 5th international conference on extending database technology: advances in database technology, EDBT ’96, pp 3–17

  23. Sun X, Orlowska M, Li X (2003) Introducing uncertainty into pattern discovery in temporal event sequences. In: Proceedings of the third IEEE international conference on data mining, pp 299–306

  24. Sun L, Cheng R, Cheung D et al (2010a) Mining uncertain data with probabilistic guarantees. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’10, pp 273–282

  25. Sun L, Cheng R, Cheung D et al (2010b) Mining uncertain data with probabilistic guarantees. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’10, pp 273–282

  26. Wan L, Chen L, Zhang C (2013) Mining frequent serial episodes over uncertain sequence data. In: Proceedings of the 16th international conference on extending database technology, EDBT’13, pp 215–226

  27. Winarko E, Roddick J (2007) Armada—an algorithm for discovering richer relative temporal association rules from interval-based data. Data Knowl Eng 63(1):76–90

    Article  Google Scholar 

  28. Yang J, Wang W, Yu P et al (2002) Mining long sequential patterns in a noisy environment. In: Proceedings of the 2002 ACM SIGMOD international conference on management of data, SIGMOD ’02, pp 406–417

  29. Zaki M (2001) Spade: an efficient algorithm for mining frequent sequences. Mach Learn 42(1–2):31–60

    Article  MATH  Google Scholar 

  30. Zhang H, Diao Y, Immerman N (2010) Recognizing patterns in streams with imprecise timestamps. Proc VLDB Endow 3(1–2):244–255

    Article  Google Scholar 

  31. Zhao Z, Yan D, Ng W (2012) Mining probabilistically frequent sequential patterns in uncertain databases. In: Proceedings of the 15th international conference on extending database technology, EDBT’12, pp 74–85

  32. Zhao Z, Yan D, Ng W (2013) Mining probabilistically frequent sequential patterns in large uncertain databases. IEEE Trans Knowl Data Eng 26(5):1171–1184

    Article  Google Scholar 

  33. Zhou Y, Ma C, Guo Q et al (2014) Sequence pattern matching over time-series data with temporal uncertainty. In: Proceedings of the 17th international conference on extending database technology, EDBT’14, pp 205–216

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiaqi Ge.

Appendix 1: Deriving the geographic approach

Appendix 1: Deriving the geographic approach

Suppose \(X \sim U(x^-, x^+)\) and \(Y \sim U(y^-, y^+)\) are two uniformly distributed uncertain timestamps, then the joint 2-D distribution of X and Y is:

$$\begin{aligned} f(x,y)= {\left\{ \begin{array}{ll} \frac{1}{(x^+-x^-)(y^+-y^-)}&{}\quad \text { if }\;\; x \in [x^-, x^+], y \in [y^-, y^+] \\ 0 &{}\quad \text {otherwise} \end{array}\right. } \end{aligned}$$
(28)

First of all, we define two functions \(V_x(x)\) and \(V_y(y)\) which restrict the possible values of x and y.

$$\begin{aligned} V_x(x)= & {} {\left\{ \begin{array}{ll} x^-,&{}\quad x< x^-\\ x,&{}\quad x^- \le x \le x^+\\ x^+,&{}\quad x > x^+ \end{array}\right. } \end{aligned}$$
(29)
$$\begin{aligned} V_y(y)= & {} {\left\{ \begin{array}{ll} y^-,&{}\quad y< y^-\\ y,&{}\quad y^- \le y \le y^+\\ y^+,&{}\quad y > y^+ \end{array}\right. } \end{aligned}$$
(30)

Given the minimal gap constraint \(g_l=l\) and maximal constraint \(g_h=h\) (\(l < h\)), we have:

$$\begin{aligned} Y \le X+h&\implies y \le x^++h \\ Y \ge X+l&\implies y \ge x^-+l \end{aligned}$$

Since \(y \in [y^-, y^+]\), we also have \(V_y(x^-+l) \le y \le V_y(x^++h)\). Let \(P(\left\langle XY\right\rangle )\) be the probability that X and Y satisfy gap constraints, then \(P(\left\langle XY\right\rangle )\) can be computed by:

$$\begin{aligned} P(\left\langle XY\right\rangle )&=\int \int _{l \le Y-X \le h}{f(x,y)}{\mathrm {d}}x{\mathrm {d}}y\nonumber \\&= \int _{V_y(x^-+l)}^{V_y(x^++h)} {\mathrm {d}}y\int _{V_x(y-h)}^{V_x(y-l)} \frac{1}{(x^+-x^-)(y^+-y^-)}{\mathrm {d}}x\nonumber \\&= \frac{1}{S}\int _{V_y(x^-+l)}^{V_y(x^++h)}[{V_x(y-l) - V_x(y-h)]} {\mathrm {d}}y \end{aligned}$$
(31)

where \(S={(x^+-x^-)(y^+-y^-)}\) is a constant. Let \(f(y)=V_x(y-h) - V_x(y-l)\). Notice that if \(V_x(y-h)=x^+\), it implies that \(V_x(y-l)=x^+\) because \(y-l > y-h\), and then the integration in Eq. (31) equals to 0; similarly, \(V_x(y-l)=x^-\) implies \(V_x(y-h)=x^-\), which leads to \(P(\left\langle XY\right\rangle )=0\). Therefore, we only need to consider the remaining four cases:

$$\begin{aligned} V_x(y-l)&=x^+, V_x(y-h)=y-h&\implies&\max (x^++l, x^-+h) \le y \le x^++h \\ V_x(y-l)&=x^+, V_x(y-h)=x^-&\implies&x^++l< y< x^-+h\\ V_x(y-l)&=y-l, V_x(y-h)=y-h&\implies&x^-+h< y < x^++l\\ V_x(y-l)&=y-l, V_x(y-h)=x^-&\implies&x^-+l \le y \le \min (x^+l,x^-+h) \end{aligned}$$

Since we also have \(y^-\le y \le y^+\), the value of f(y) depends on different ranges of y:

$$\begin{aligned} f(y)={\left\{ \begin{array}{ll} x^++h-y&{} \quad \max \big (V_y(x^++l),V_y(x^-+h)\big ) \le y \le V_y(x^++h)\\ x^+-x^-&{}\quad V_y(x^++l)<y< V_y(x^-+h)\\ h-l&{}\quad V_y(x^-+h)<y < V_y(x^++l)\\ y-l-x^-&{}\quad V_y(x^-+l) \le y \le \min \big (V_y(x^++l),V_y(x^-+h)\big )\\ \end{array}\right. } \end{aligned}$$
(32)

Here we first set:

$$\begin{aligned} a_1&=V_y(x^-+l),&a_2&=V_y(x^-+h)\\ a_3&=V_y(x^++l),&a_4&=V_y( x^++h) \end{aligned}$$

Then the range \([a_1, a_4]\) is divided into disjoint sub-partitions by two points \(a_2, a_3\). In order to sort the values of \(a_1,a_2,a_3 \text { and }a_4\), we set:

$$\begin{aligned} b_1&=a_1,&b_2&=\min (a_2,a_3)\\ b_3&=\max (a_2,a_3),&a_4&=b_4 \end{aligned}$$

so that we have \(b_1 \le b_2 \le b_3 \le b_4\). According to the law of total probability, \(P(\left\langle XY\right\rangle )\) can be computed by Eq. (33), since \([b_1,b_4]=[b_1,b_2]\cup [b_2,b_3]\cup [b_3,b_4]\).

$$\begin{aligned} P(\left\langle XY\right\rangle )&= \sum _{k=1}^3{P(\left\langle XY\right\rangle |y \in [b_k,b_{k+1}])*P(y \in [b_k, b_{k+1}])}\nonumber \\&=P(\left\langle XY_k\right\rangle )*P(Y=Y_k) \end{aligned}$$
(33)

where \(Y_k \sim U(b_k,b_{k+1})\) is a uniform random variable and \(P(Y=Y_k)\) can be computed as:

$$\begin{aligned} P(Y=Y_k)= \int _{b_k}^{b_{k+1}}{\frac{1}{y^+-y^-}{\mathrm {d}}y} = \frac{b_{k+1} - b_k}{y^+-y^-} \end{aligned}$$
(34)

Next, we prove the correctness of Eq. (18) in three cases.

  • If \(Y_k \sim U[b_1,b_2]\). Now \(f(y)=y-l-x^-\). According to Eq. (31), we can compute \(P(\left\langle XY_k\right\rangle )\) as:

    $$\begin{aligned} P(\left\langle XY_k\right\rangle )&= \frac{1}{S_k} \int _{b_1}^{b_2}{y-l-x^-} {\mathrm {d}}y\nonumber \\&=\frac{1}{2S_k}(y-l-x^-)^2|_{b_1}^{b_2} =\frac{b_1+b_2-2l-2x^-}{2(x^+-x^-)} \end{aligned}$$
    (35)

    Meanwhile, when \(y \in [b_1,b_2]\), we know that \(y-l \le x^+\) and \(y-h \le x^-\). According to Eq. (19), we can compute \(L_1=b_1-l-x^-\) and \(L_2=b_2-l-x^-\). By substituting the values of \(L_1\) and \(L_2\) to Eq. (18), we can compute \(P(\left\langle XY\right\rangle )\) by the geographic approach in Eq. (36), which is consistent with Eq. (35).

    $$\begin{aligned} P(\left\langle XY_k\right\rangle )=\frac{L_{k+1}+L_k}{2(x^+-x^-)}=\frac{b_1+b_2-2l-2x^-}{2(x^+-x^-)} \end{aligned}$$
    (36)
  • If \(Y_k \sim U[b_2,b_3]\). We consider two sub-cases here.

    (a) if \(V_y(x^++l) < V_y(x^-+h)\). Referring to Eq. (31), we can compute \(P(\left\langle XY_k\right\rangle )\) as follows:

    $$\begin{aligned} P(\left\langle XY_k\right\rangle )&= \frac{1}{S_k} \int _{V_y(x^++l) }^{V_y(x^-+h)}{x^+-x^-} {\mathrm {d}}y\nonumber \\&=\frac{(x^+-x^-)(V_y(x^-+h)-V_y(x^++l))}{S_k}=1 \end{aligned}$$
    (37)

    Meanwhile, when \(V_y(x^++l) \le y \le V_y(x^-+h)\), we have \(L_2=x^+-x^-\) and \(L_3=x^+-x^-\) by Eq. (19). Then, \(P(\left\langle XY_k\right\rangle )\) can be computed by the geographic function in Eq. (38), which is consistent with Eq. (37).

    $$\begin{aligned} P(\left\langle XY_k\right\rangle )=\frac{L_{k+1}+L_k}{2(x^+-x^-)}=1 \end{aligned}$$
    (38)

    (b) if \(V_y(x^-+h) < V_y(x^++l)\). We first have:

    $$\begin{aligned} P(\left\langle XY_k\right\rangle )&= \frac{1}{S_k} \int _{V_y(x^-+h) }^{V_y(x^++l)}{(h-l)} {\mathrm {d}}y\nonumber \\&=\frac{(h-l)(V_y(x^++l)-V_y(x^-+h))}{S_k}=\frac{h-l}{x^+-x^-} \end{aligned}$$
    (39)

    Meanwhile, \(L_2=h-l\) and \(L_3=h-l\), since now we have \(V_y(x^-+h) \le y \le V_y(x^++l)\). And the output of the geographic function in Eq. (40) is consistent with that of Eq. (39).

    $$\begin{aligned} P(\left\langle XY_k\right\rangle )=\frac{L_{k+1}+L_k}{2(x^+-x^-)}=\frac{h-l}{(x^+-x^-)} \end{aligned}$$
    (40)
  • If \(Y_k \sim U[b_3,b_4]\). First, referring to Eq. (31), we can compute \(P(\left\langle XY_k\right\rangle )\) as follows:

    $$\begin{aligned} P(\left\langle XY_k\right\rangle )&= \frac{1}{S_k} \int _{b_3}^{b_4}{(x^++h-y)} {\mathrm {d}}y=\frac{1}{2S_k}(x^++h-y)^2|_{b_4}^{b_3}\nonumber \\&=\frac{2x^++2h-b_3-b_4}{2(x^+-x^-)} \end{aligned}$$
    (41)

    Meanwhile, since \(\max (V(x^++l), V(x^-+h)) \le y \le V(x^++h)\), we have \(L_3=x^+-b_3+h\) and \(L_4=x^+-b_4+h\), and then we have Eq. (42), which is consistent with Eq. (41).

    $$\begin{aligned} P(\left\langle XY_k\right\rangle ) =\frac{L_{k+1}+L_k}{2*(x^+-x^-)}=\frac{2x^++2h-b_3-b_4}{2(x^+-x^-)} \end{aligned}$$
    (42)

Therefore, the correctness of our geographic approach in Sect. 5.1 is proved.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ge, J., Xia, Y., Wang, J. et al. Sequential pattern mining in databases with temporal uncertainty. Knowl Inf Syst 51, 821–850 (2017). https://doi.org/10.1007/s10115-016-0977-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-016-0977-1

Keywords

Navigation