Sequential pattern mining in databases with temporal uncertainty

Ge, Jiaqi; Xia, Yuni; Wang, Jian; Nadungodage, Chandima Hewa; Prabhakar, Sunil

doi:10.1007/s10115-016-0977-1

Sequential pattern mining in databases with temporal uncertainty

Regular Paper
Published: 30 July 2016

Volume 51, pages 821–850, (2017)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Jiaqi Ge^1,4,
Yuni Xia¹,
Jian Wang²,
Chandima Hewa Nadungodage¹ &
…
Sunil Prabhakar³

579 Accesses
18 Citations
Explore all metrics

Abstract

Temporally uncertain data widely exist in many real-world applications. Temporal uncertainty can be caused by various reasons such as conflicting or missing event timestamps, network latency, granularity mismatch, synchronization problems, device precision limitations, data aggregation. In this paper, we propose an efficient algorithm to mine sequential patterns from data with temporal uncertainty. We propose an uncertain model in which timestamps are modeled by random variables and then design a new approach to manage temporal uncertainty. We integrate it into the pattern-growth sequential pattern mining algorithm to discover probabilistic frequent sequential patterns. Extensive experiments on both synthetic and real datasets prove that the proposed algorithm is both efficient and scalable.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Stratified random sampling from streaming and stored data

Article 23 October 2020

Trong Duc Nguyen, Ming-Hung Shih, … Bojian Xu

A Systematic Review of Hidden Markov Models and Their Applications

Article 12 May 2020

Bhavya Mor, Sunita Garhwal & Ajay Kumar

A survey on the evolution of stream processing systems

Article Open access 22 November 2023

Marios Fragkoulis, Paris Carbone, … Asterios Katsifodimos

References

Aggarwal C, Yu P (2009) A survey of uncertain data algorithms and applications. IEEE Trans Knowl Data Eng 21(5):609–623
Article Google Scholar
Agrawal R, Srikant R (1994) Fast algorithms for mining association rules in large databases. In: Proceedings of the 20th international conference on very large data bases VLDB’94, pp 487–499
Agrawal R, Srikant R (1995) Mining sequential patterns. In: Proceedings of the eleventh international conference on data engineering, ICDE ’95, pp 3–14
Allen J (1983) Maintaining knowledge about temporal intervals. Commun ACM 26(11):832–843
Article MATH Google Scholar
Ayres J, Flannick J, Gehrke J et al (2002) Sequential pattern mining using a bitmap representation. In: Proceedings of the eighth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’02, pp 429–435
Bernecker T, Kriegel H, Renz M et al (2009) Probabilistic frequent itemset mining in uncertain databases. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, KDD’09, pp 119–128
Cheng R, Kalashnikov D, Prabhakar S (2003) Evaluating probabilistic queries over imprecise data. In: Proceedings of the ACM international conference on management of data, SIGMOD ’03, pp 551–562
Chui C, Kao B (2008) A decremental approach for mining frequent itemsets from uncertain data. In: Proceedings of the 12th Pacific-Asia conference on advances in knowledge discovery and data mining, PAKDD’08, pp 64–75
Chiu D, Wu Y, Chen A (2004) An efficient algorithm for mining frequent sequences by a new strategy without support counting. In: Proceedings of the 20th international conference on data engineering, ICDE ’04, pp 275–286
Chui C, Kao B, Hung E (2007) Mining frequent itemsets from uncertain data. In: Proceedings of the 11th Pacific-Asia conference on advances in knowledge discovery and data mining, PAKDD’07, pp 47–58
Dyreson C, Snodgrass R (1998) Supporting valid-time indeterminacy. ACM Trans Datab Syst 23(1):1–57
Article Google Scholar
Ge J, Xia Y, Wang J (2015) Towards efficient sequential pattern mining in temporal uncertain databases. In: Proceedings of the 19th Pacific-Asia conference on advances in knowledge discovery and data mining, PAKDD’15, pp 268-279
Han J, Pei J, Mortazavi-Asl B et al (2000) Freespan: frequent pattern-projected sequential pattern mining. In: Proceedings of the sixth ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’00, pp 355–359
Höppner F (2001) Discovery of temporal patterns. learning rules about the qualitative behaviour of time series. In: Proceedings of the 5th European conference on principles of data mining and knowledge discovery, PKDD ’01, pp 192–203
Jestes J, Cormode G, Li F et al (2011) Semantics of ranking queries for probabilistic data. IEEE Trans Knowl Data Eng 23(12):1903–1917
Article Google Scholar
Li Y, Bailey J, Kulik L et al (2013) Mining probabilistic frequent spatio-temporal sequential patterns with gap constraints from uncertain databases. In: IEEE 13th international conference on data mining, ICDM’13, pp 448–457
Muzammal M, Raman R (2011) Mining sequential patterns from probabilistic databases. In: Proceedings of the 15th Pacific-Asia conference on advances in knowledge discovery and data mining, PAKDD’11, pp 210–221
Papapetrou P, Kollios G, Sclaroff S et al (2005) Discovering frequent arrangements of temporal intervals. In: Proceedings of the fifth IEEE international conference on data mining, ICDM ’05, pp 354–361
Pei J, Han J, Mortazavi-asl B et al (2001) Prefixspan: mining sequential patterns efficiently by prefix-projected pattern growth. In: Proceedings of the 17th international conference on data engineering, ICDE’01, pp 215–224
Pei J, Han J, Wang W (2002) Mining sequential patterns with constraints in large databases. In: Proceedings of the eleventh international conference on information and knowledge management, CIKM ’02, pp 18–25
Sadri R, Zaniolo C, Zarkesh A et al (2004) Expressing and optimizing sequence queries in database systems. ACM Trans Database Syst 29(2):282–318
Article Google Scholar
Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: Proceedings of the 5th international conference on extending database technology: advances in database technology, EDBT ’96, pp 3–17
Sun X, Orlowska M, Li X (2003) Introducing uncertainty into pattern discovery in temporal event sequences. In: Proceedings of the third IEEE international conference on data mining, pp 299–306
Sun L, Cheng R, Cheung D et al (2010a) Mining uncertain data with probabilistic guarantees. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’10, pp 273–282
Sun L, Cheng R, Cheung D et al (2010b) Mining uncertain data with probabilistic guarantees. In: Proceedings of the 16th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’10, pp 273–282
Wan L, Chen L, Zhang C (2013) Mining frequent serial episodes over uncertain sequence data. In: Proceedings of the 16th international conference on extending database technology, EDBT’13, pp 215–226
Winarko E, Roddick J (2007) Armada—an algorithm for discovering richer relative temporal association rules from interval-based data. Data Knowl Eng 63(1):76–90
Article Google Scholar
Yang J, Wang W, Yu P et al (2002) Mining long sequential patterns in a noisy environment. In: Proceedings of the 2002 ACM SIGMOD international conference on management of data, SIGMOD ’02, pp 406–417
Zaki M (2001) Spade: an efficient algorithm for mining frequent sequences. Mach Learn 42(1–2):31–60
Article MATH Google Scholar
Zhang H, Diao Y, Immerman N (2010) Recognizing patterns in streams with imprecise timestamps. Proc VLDB Endow 3(1–2):244–255
Article Google Scholar
Zhao Z, Yan D, Ng W (2012) Mining probabilistically frequent sequential patterns in uncertain databases. In: Proceedings of the 15th international conference on extending database technology, EDBT’12, pp 74–85
Zhao Z, Yan D, Ng W (2013) Mining probabilistically frequent sequential patterns in large uncertain databases. IEEE Trans Knowl Data Eng 26(5):1171–1184
Article Google Scholar
Zhou Y, Ma C, Guo Q et al (2014) Sequence pattern matching over time-series data with temporal uncertainty. In: Proceedings of the 17th international conference on extending database technology, EDBT’14, pp 205–216

Download references

Author information

Authors and Affiliations

Department of Computer and Information Science, Indiana University Purdue University Indianapolis, Indianapolis, IN, USA
Jiaqi Ge, Yuni Xia & Chandima Hewa Nadungodage
School of Electronic Science and Engineering, Nanjing University, Nanjing, China
Jian Wang
Department of Computer Science, Purdue University, West Lafayette, IN, USA
Sunil Prabhakar
Expedia Inc., Chicago, IL, 60661, USA
Jiaqi Ge

Authors

Jiaqi Ge
View author publications
You can also search for this author in PubMed Google Scholar
Yuni Xia
View author publications
You can also search for this author in PubMed Google Scholar
Jian Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chandima Hewa Nadungodage
View author publications
You can also search for this author in PubMed Google Scholar
Sunil Prabhakar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jiaqi Ge.

Appendix 1: Deriving the geographic approach

Suppose $X \sim U(x^-, x^+)$ and $Y \sim U(y^-, y^+)$ are two uniformly distributed uncertain timestamps, then the joint 2-D distribution of X and Y is:

$$\begin{aligned} f(x,y)= {\left\{ \begin{array}{ll} \frac{1}{(x^+-x^-)(y^+-y^-)}&{}\quad \text { if }\;\; x \in [x^-, x^+], y \in [y^-, y^+] \\ 0 &{}\quad \text {otherwise} \end{array}\right. } \end{aligned}$$

(28)

First of all, we define two functions $V_x(x)$ and $V_y(y)$ which restrict the possible values of x and y.

$$\begin{aligned} V_x(x)= & {} {\left\{ \begin{array}{ll} x^-,&{}\quad x< x^-\\ x,&{}\quad x^- \le x \le x^+\\ x^+,&{}\quad x > x^+ \end{array}\right. } \end{aligned}$$

(29)

$$\begin{aligned} V_y(y)= & {} {\left\{ \begin{array}{ll} y^-,&{}\quad y< y^-\\ y,&{}\quad y^- \le y \le y^+\\ y^+,&{}\quad y > y^+ \end{array}\right. } \end{aligned}$$

(30)

Given the minimal gap constraint $g_l=l$ and maximal constraint $g_h=h$ ($l < h$), we have:

$$\begin{aligned} Y \le X+h&\implies y \le x^++h \\ Y \ge X+l&\implies y \ge x^-+l \end{aligned}$$

Since $y \in [y^-, y^+]$, we also have $V_y(x^-+l) \le y \le V_y(x^++h)$. Let $P(\left\langle XY\right\rangle )$ be the probability that X and Y satisfy gap constraints, then $P(\left\langle XY\right\rangle )$ can be computed by:

$$\begin{aligned} P(\left\langle XY\right\rangle )&=\int \int _{l \le Y-X \le h}{f(x,y)}{\mathrm {d}}x{\mathrm {d}}y\nonumber \\&= \int _{V_y(x^-+l)}^{V_y(x^++h)} {\mathrm {d}}y\int _{V_x(y-h)}^{V_x(y-l)} \frac{1}{(x^+-x^-)(y^+-y^-)}{\mathrm {d}}x\nonumber \\&= \frac{1}{S}\int _{V_y(x^-+l)}^{V_y(x^++h)}[{V_x(y-l) - V_x(y-h)]} {\mathrm {d}}y \end{aligned}$$

(31)

where $S={(x^+-x^-)(y^+-y^-)}$ is a constant. Let $f(y)=V_x(y-h) - V_x(y-l)$. Notice that if $V_x(y-h)=x^+$, it implies that $V_x(y-l)=x^+$ because $y-l > y-h$, and then the integration in Eq. (31) equals to 0; similarly, $V_x(y-l)=x^-$ implies $V_x(y-h)=x^-$, which leads to $P(\left\langle XY\right\rangle )=0$. Therefore, we only need to consider the remaining four cases:

$$\begin{aligned} V_x(y-l)&=x^+, V_x(y-h)=y-h&\implies&\max (x^++l, x^-+h) \le y \le x^++h \\ V_x(y-l)&=x^+, V_x(y-h)=x^-&\implies&x^++l< y< x^-+h\\ V_x(y-l)&=y-l, V_x(y-h)=y-h&\implies&x^-+h< y < x^++l\\ V_x(y-l)&=y-l, V_x(y-h)=x^-&\implies&x^-+l \le y \le \min (x^+l,x^-+h) \end{aligned}$$

Since we also have $y^-\le y \le y^+$, the value of f(y) depends on different ranges of y:

$$\begin{aligned} f(y)={\left\{ \begin{array}{ll} x^++h-y&{} \quad \max \big (V_y(x^++l),V_y(x^-+h)\big ) \le y \le V_y(x^++h)\\ x^+-x^-&{}\quad V_y(x^++l)<y< V_y(x^-+h)\\ h-l&{}\quad V_y(x^-+h)<y < V_y(x^++l)\\ y-l-x^-&{}\quad V_y(x^-+l) \le y \le \min \big (V_y(x^++l),V_y(x^-+h)\big )\\ \end{array}\right. } \end{aligned}$$

(32)

Here we first set:

$$\begin{aligned} a_1&=V_y(x^-+l),&a_2&=V_y(x^-+h)\\ a_3&=V_y(x^++l),&a_4&=V_y( x^++h) \end{aligned}$$

Then the range $[a_1, a_4]$ is divided into disjoint sub-partitions by two points $a_2, a_3$. In order to sort the values of $a_1,a_2,a_3 \text { and }a_4$, we set:

$$\begin{aligned} b_1&=a_1,&b_2&=\min (a_2,a_3)\\ b_3&=\max (a_2,a_3),&a_4&=b_4 \end{aligned}$$

so that we have $b_1 \le b_2 \le b_3 \le b_4$. According to the law of total probability, $P(\left\langle XY\right\rangle )$ can be computed by Eq. (33), since $[b_1,b_4]=[b_1,b_2]\cup [b_2,b_3]\cup [b_3,b_4]$.

$$\begin{aligned} P(\left\langle XY\right\rangle )&= \sum _{k=1}^3{P(\left\langle XY\right\rangle |y \in [b_k,b_{k+1}])*P(y \in [b_k, b_{k+1}])}\nonumber \\&=P(\left\langle XY_k\right\rangle )*P(Y=Y_k) \end{aligned}$$

(33)

where $Y_k \sim U(b_k,b_{k+1})$ is a uniform random variable and $P(Y=Y_k)$ can be computed as:

$$\begin{aligned} P(Y=Y_k)= \int _{b_k}^{b_{k+1}}{\frac{1}{y^+-y^-}{\mathrm {d}}y} = \frac{b_{k+1} - b_k}{y^+-y^-} \end{aligned}$$

(34)

Next, we prove the correctness of Eq. (18) in three cases.

If $Y_k \sim U[b_1,b_2]$. Now $f(y)=y-l-x^-$. According to Eq. (31), we can compute $P(\left\langle XY_k\right\rangle )$ as:
$$\begin{aligned} P(\left\langle XY_k\right\rangle )&= \frac{1}{S_k} \int _{b_1}^{b_2}{y-l-x^-} {\mathrm {d}}y\nonumber \\&=\frac{1}{2S_k}(y-l-x^-)^2|_{b_1}^{b_2} =\frac{b_1+b_2-2l-2x^-}{2(x^+-x^-)} \end{aligned}$$
(35)
Meanwhile, when $y \in [b_1,b_2]$, we know that $y-l \le x^+$ and $y-h \le x^-$. According to Eq. (19), we can compute $L_1=b_1-l-x^-$ and $L_2=b_2-l-x^-$. By substituting the values of $L_1$ and $L_2$ to Eq. (18), we can compute $P(\left\langle XY\right\rangle )$ by the geographic approach in Eq. (36), which is consistent with Eq. (35).
$$\begin{aligned} P(\left\langle XY_k\right\rangle )=\frac{L_{k+1}+L_k}{2(x^+-x^-)}=\frac{b_1+b_2-2l-2x^-}{2(x^+-x^-)} \end{aligned}$$
(36)
If $Y_k \sim U[b_2,b_3]$. We consider two sub-cases here.

(a) if $V_y(x^++l) < V_y(x^-+h)$. Referring to Eq. (31), we can compute $P(\left\langle XY_k\right\rangle )$ as follows:
$$\begin{aligned} P(\left\langle XY_k\right\rangle )&= \frac{1}{S_k} \int _{V_y(x^++l) }^{V_y(x^-+h)}{x^+-x^-} {\mathrm {d}}y\nonumber \\&=\frac{(x^+-x^-)(V_y(x^-+h)-V_y(x^++l))}{S_k}=1 \end{aligned}$$
(37)
Meanwhile, when $V_y(x^++l) \le y \le V_y(x^-+h)$, we have $L_2=x^+-x^-$ and $L_3=x^+-x^-$ by Eq. (19). Then, $P(\left\langle XY_k\right\rangle )$ can be computed by the geographic function in Eq. (38), which is consistent with Eq. (37).
$$\begin{aligned} P(\left\langle XY_k\right\rangle )=\frac{L_{k+1}+L_k}{2(x^+-x^-)}=1 \end{aligned}$$
(38)

(b) if $V_y(x^-+h) < V_y(x^++l)$. We first have:
$$\begin{aligned} P(\left\langle XY_k\right\rangle )&= \frac{1}{S_k} \int _{V_y(x^-+h) }^{V_y(x^++l)}{(h-l)} {\mathrm {d}}y\nonumber \\&=\frac{(h-l)(V_y(x^++l)-V_y(x^-+h))}{S_k}=\frac{h-l}{x^+-x^-} \end{aligned}$$
(39)
Meanwhile, $L_2=h-l$ and $L_3=h-l$, since now we have $V_y(x^-+h) \le y \le V_y(x^++l)$. And the output of the geographic function in Eq. (40) is consistent with that of Eq. (39).
$$\begin{aligned} P(\left\langle XY_k\right\rangle )=\frac{L_{k+1}+L_k}{2(x^+-x^-)}=\frac{h-l}{(x^+-x^-)} \end{aligned}$$
(40)
If $Y_k \sim U[b_3,b_4]$. First, referring to Eq. (31), we can compute $P(\left\langle XY_k\right\rangle )$ as follows:
$$\begin{aligned} P(\left\langle XY_k\right\rangle )&= \frac{1}{S_k} \int _{b_3}^{b_4}{(x^++h-y)} {\mathrm {d}}y=\frac{1}{2S_k}(x^++h-y)^2|_{b_4}^{b_3}\nonumber \\&=\frac{2x^++2h-b_3-b_4}{2(x^+-x^-)} \end{aligned}$$
(41)
Meanwhile, since $\max (V(x^++l), V(x^-+h)) \le y \le V(x^++h)$, we have $L_3=x^+-b_3+h$ and $L_4=x^+-b_4+h$, and then we have Eq. (42), which is consistent with Eq. (41).
$$\begin{aligned} P(\left\langle XY_k\right\rangle ) =\frac{L_{k+1}+L_k}{2*(x^+-x^-)}=\frac{2x^++2h-b_3-b_4}{2(x^+-x^-)} \end{aligned}$$
(42)

Therefore, the correctness of our geographic approach in Sect. 5.1 is proved.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ge, J., Xia, Y., Wang, J. et al. Sequential pattern mining in databases with temporal uncertainty. Knowl Inf Syst 51, 821–850 (2017). https://doi.org/10.1007/s10115-016-0977-1

Download citation

Received: 30 April 2015
Revised: 20 June 2016
Accepted: 26 July 2016
Published: 30 July 2016
Issue Date: June 2017
DOI: https://doi.org/10.1007/s10115-016-0977-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Sequential pattern mining in databases with temporal uncertainty

Abstract

Access this article

Similar content being viewed by others

Stratified random sampling from streaming and stored data

A Systematic Review of Hidden Markov Models and Their Applications

A survey on the evolution of stream processing systems

References

Author information

Authors and Affiliations

Corresponding author

Appendix 1: Deriving the geographic approach

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Sequential pattern mining in databases with temporal uncertainty

Abstract

Access this article

Similar content being viewed by others

Stratified random sampling from streaming and stored data

A Systematic Review of Hidden Markov Models and Their Applications

A survey on the evolution of stream processing systems

References

Author information

Authors and Affiliations

Corresponding author

Appendix 1: Deriving the geographic approach

Appendix 1: Deriving the geographic approach

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation