A new efficient approach for extracting the closed episodes for workload prediction in cloud

Amiri, Maryam; Mohammad-Khanli, Leyli; Mirandola, Raffaela

doi:10.1007/s00607-019-00734-3

A new efficient approach for extracting the closed episodes for workload prediction in cloud

Published: 13 June 2019

Volume 102, pages 141–200, (2020)
Cite this article

Computing Aims and scope Submit manuscript

Maryam Amiri¹,
Leyli Mohammad-Khanli² &
Raffaela Mirandola³

284 Accesses
7 Citations
Explore all metrics

Abstract

The prediction of the future workload of applications is an essential step guiding resource provisioning in cloud environments. In our previous works, we proposed two prediction models based on pattern mining. This paper builds on our previous experience and focuses on the issue of time and space complexities of the prediction model. Specifically, it presents a general approach to improve the efficiency of the pattern mining engine, which leads to improving the efficiency of the predictors. The approach is composed of two steps: (1) Firstly, to improve space complexity, redundant occurrences of patterns are defined and algorithms are suggested to identify and omit them. (2) To improve time complexity, a new data structure, called closed pattern backward tree, is presented for mining closed patterns directly. The approach not only improves the efficiency of our predictors, but also can be employed in different fields of pattern mining. The performance of the proposed approach is investigated based on real and synthetic workloads of cloud. The experimental results show that the proposed approach could improve the efficiency of the pattern mining engine significantly in comparison to common methods to extract closed patterns.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 4

Fig. 7

Fig. 8

Fig. 11

Trends and Future Perspective Challenges in Big Data

Big data analytics on Apache Spark

Article 13 October 2016

Salman Salloum, Ruslan Dautov, … Joshua Zhexue Huang

Big data preprocessing: methods and prospects

Article Open access 01 November 2016

Salvador García, Sergio Ramírez-Gallego, … Francisco Herrera

Notes

The proof of lemmas and theorems could be found in “Appendix A”.
These traces can be accessed at http://gwa.ewi.tudelft.nl/datasets/Bitbrains.
The proof of lemmas and theorems could be found in “Appendix A”.

References

Petcu D, Vzquez-Poletti JL (2012) European research activities in cloud computing. Cambridge Scholars Publishing, Cambridge
Google Scholar
Amiri M, Mohammad-Khanli L, Mirandola R (2018) An online learning model based on episode mining for workload prediction in cloud. Future Gener Comput Syst 87:83
Article Google Scholar
Amiri M, Mohammad-Khanli L (2017) Survey on prediction models of applications for resources provisioning in cloud. J Netw Comput Appl 82:93–113
Article Google Scholar
Jiang Y, Perng C-S, Li T, Chang RN (2013) Cloud analytics for capacity planning and instant VM provisioning. IEEE Trans Netw Serv Manag 10(3):312–325
Article Google Scholar
Cetinski K, Juric MB (2015) AME-WPC: advanced model for efficient workload prediction in the cloud. J Netw Comput Appl 55:191–201
Article Google Scholar
Amiri M, Feizi-Derakhshi MR, Mohammad-Khanli L (2017) IDS fitted Q improvement using fuzzy approach for resource provisioning in cloud. J Intell Fuzzy Syst 32(1):229–240
Article Google Scholar
Altevogt P, Denzel W, Kiss T (2016) Cloud modeling and simulation. Wiley-IEEE Press, London
Book Google Scholar
Yang J, Liu C, Shang Y, Cheng B, Mao Z, Liu C, Niu L, Chen J (2014) A cost-aware auto-scaling approach using the workload prediction in service clouds. Inf Syst Front 16(1):7–18
Article Google Scholar
Shi P, Wang H, Yin G, Fengshun L, Wang T (2012) Prediction-based federated management of multi-scale resources in cloud. Adv Inf Sci Serv Sci 4(6):324–334
Google Scholar
Matsunaga A, Fortes JAB (2010) On the use of machine learning to predict the time and resources consumed by applications. In: Proceedings of the 2010 10th IEEE/ACM international conference on cluster, cloud and grid computing, Melbourne, Victoria, Australia, pp 495–504. IEEE Computer Society
Amiri M, Mohammad-Khanli L, Mirandola R (2018) A sequential pattern mining model for application workload prediction in cloud environment. J Netw Comput Appl 105:21–62
Article Google Scholar
Achar A, Ibrahim A, Sastry PS (2013) Pattern-growth based frequent serial episode discovery. Data Knowl Eng 87:91–108
Article Google Scholar
Yan X, Han J, Afshar R (2003) CloSpan: mining—closed sequential patterns in large datasets. In: Proceedings of the 2003 SIAM international conference on data mining, San Francisco, CA, USA, pp 166–177
Fahed L, Brun A, Boyer A (2014) Episode rules mining algorithm for distant event prediction. Technical Report hal-01062542, HAL
Huang P, Liu CJ, Yang X, Xiao L, Chen J (2014) Wireless spectrum occupancy prediction based on partial periodic pattern mining. IEEE Trans Parallel Distrib Syst 25(7):1925–1934
Article Google Scholar
Li K, Fu Y (2014) Prediction of human activity by discovering temporal sequence patterns. IEEE Trans Pattern Anal Mach Intell 36(8):1644–1657
Article Google Scholar
Wright AP, Wright AT, McCoy AB, Sittig DF (2015) The use of sequential pattern mining to predict next prescribed medications. J Biomed Inf 53:73–80
Article Google Scholar
Gan W, Lin JCW, Fournier-Viger P, Chao HC, Yu PS (2018) A survey of parallel sequential pattern mining. CoRR, arXiv:1805.10515
Dinh D-T, Le B, Fournier-Viger P, Huynh V-N (2018) An efficient algorithm for mining periodic high-utility sequential patterns. Appl Intell 48(12):4694–4714
Article Google Scholar
Martin F, Méger N, Galichet S, Becourt N (2012) Forecasting failures in a data stream context application to vacuum pumping system prognosis. Trans Mach Learn Data Min 5(2):87–116
Google Scholar
D’Andreagiovanni M, Baiardi F, Lipilini J, Ruggieri S, Tonelli F (2019) Sequential pattern mining for ict risk assessment and management. J Log Algebraic Methods Program 102:1–16
Article MathSciNet Google Scholar
Van T, Yoshitaka A, Le B (2018) Mining web access patterns with super-pattern constraint. Appl Intell 48(11):3902–3914
Article Google Scholar
Mannila H, Toivonen H, Verkamo AI (1997) Discovery of frequent episodes in event sequences. Data Min Knowl Discov 1(3):259–289
Article Google Scholar
Rathore S, Goyal V (2015) Top-K high utility episode mining in complex event sequence. PhD thesis
Höppner F (2001) Discovery of temporal patterns. Learning rules about the qualitative behaviour of time series. In: Proceedings of the 5th European conference on principles of data mining and knowledge discovery, PKDD ’01. Springer, London, pp 192–203
Chapter Google Scholar
Papapetrou P, Kollios G, Sclaroff S, Gunopulos D (Nov 2005) Discovering frequent arrangements of temporal intervals. In: Fifth IEEE international conference on data mining (ICDM’05), Houston, TX, USA. IEEE
Batal I, Cooper GF, Fradkin D, Harrison J Jr, Moerchen F, Hauskrecht M (2016) An efficient pattern mining approach for event detection in multivariate temporal data. Knowl Inf Syst 46(1):115–150
Article Google Scholar
Winarko E, Roddick JF (2007) ARMADA: an algorithm for discovering richer relative temporal association rules from interval-based data. Data Knowl Eng 63(1):76–90 (Data Warehouse and Knowledge Discovery, DAWAK’05)
Article Google Scholar
Papadopoulos S, Drosou A, Tzovaras D (2016) Fast frequent episode mining based on finite-state machines. In: Abdelrahman OH, Gelenbe E, Gorbil G, Lent R (eds) Information sciences and systems 2015. Springer International Publishing, Cham, pp 199–208
Chapter Google Scholar
Lin M-Y, Lee S-Y (2002) Fast discovery of sequential patterns by memory indexing. Springer, Berlin, pp 150–160
MATH Google Scholar
Moskovitch R, Shahar Y (2009) Medical temporal-knowledge discovery via temporal abstraction. AMIA Annu Symp Proc 2009:452–456
Google Scholar
Moskovitch R, Walsh C, Wang F, Hripcsak G, Tatonetti N (Nov 2015) Outcomes prediction via time intervals related patterns. In: 2015 IEEE international conference on data mining, pp 919–924
Sacchi L, Larizza C, Combi C, Bellazzi R (2007) Data mining with temporal abstractions: learning rules from time series. Data Min Knowl Discov 15(2):217–247
Article MathSciNet Google Scholar
Allen JF (1984) Towards a general theory of action and time. Artif Intell 23(2):123–154
Article Google Scholar
Patel D, Hsu W, Lee ML (2008) Mining relationships among interval-based events for classification. In: Proceedings of the 2008 ACM SIGMOD international conference on management of data, SIGMOD ’08. ACM, New York, NY, USA, pp 393–404
Batal I, Fradkin D, Harrison J, Moerchen F, Hauskrecht M (2012) Mining recent temporal patterns for event detection in multivariate time series data. In: Proceedings of the 18th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’12. ACM, Beijing, China, pp 280–288
Ghosh S, Li J, Cao L, Ramamohanarao K (2017) Septic shock prediction for ICU patients via coupled HMM walking on sequential contrast patterns. J Biomed Inf 66:19–31
Article Google Scholar
Laxman S, Sastry P, Unnikrishnan K (2007) Discovering frequent generalized episodes when events persist for different durations. IEEE Trans Knowl Data Eng 19(9):1188–1201
Article Google Scholar
Tatti N, Cule B (2010) Mining closed strict episodes. In: Proceedings of the 2010 IEEE international conference on data mining, ICDM ’10. IEEE Computer Society, Washington, DC, USA, pp 501–510
Wu S-Y, Chen Y-L (2007) Mining nonambiguous temporal patterns for interval-based events. IEEE Trans Knowl Data Eng 19(6):742–758
Article Google Scholar
Laxman S, Sastry PS, Unnikrishnan KP (2005) Discovering frequent episodes and learning hidden markov models: a formal connection. IEEE Trans Knowl Data Eng 17(11):1505–1517
Article Google Scholar
Hwang K, Bai X, Shi M, Li Y, Chen WG, Wu Y (2016) Cloud performance modeling and benchmark evaluation of elastic scaling strategies. IEEE Trans Parallel Distrib Syst 27(1):130–143
Article Google Scholar
Tatti N, Cule B (2012) Mining closed strict episodes. Data Min Knowl Discov 25(1):34–66
Article MathSciNet Google Scholar
Zaki MJ (2001) Spade: an efficient algorithm for mining frequent sequences. Mach Learn 42(1–2):31–60
Article Google Scholar
Neapolitan RE, Neapolitan R, Naimipour K (2010) Foundations of algorithms. Jones & Bartlett Learning, Burlington
MATH Google Scholar
Alam M, Shakil KA, Sethi S (2016) Analysis and clustering of workload in google cluster trace based on resource usage. In: 2016 IEEE international conference on computational science and engineering (CSE) and IEEE international conference on embedded and ubiquitous computing (EUC) and 15th international symposium on distributed computing and applications for business engineering (DCABES), pp 740–747. IEEE
Alexandru I, Hui L, Mathieu J, Shanny A, Catalin D, Lex W, Epema Dick HJ (2008) The grid workloads archive. Future Gener Comput Syst 24(7):672–686
Article Google Scholar
Shen S, van Beek V, Iosup A (2015) Statistical characterization of business-critical workloads hosted in cloud datacenters. In: 2015 15th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGrid), pp 465–474. IEEE
Li A, Yang X, Kandula S, Zhang M (2010) Cloudcmp: comparing public cloud providers. In: Proceedings of the 10th ACM SIGCOMM conference on Internet measurement, pp 1–14. ACM

Download references

Acknowledgements

The GWA-T-12 Bitbrains traces are provided by Bitbrains IT Services Inc., which is a service provider that specializes in managed hosting and business computation for enterprises. We thank the GWA team and all those who have graciously provided the data for us.

Author information

Authors and Affiliations

Department of Computer Engineering, Faculty of Engineering, Arak University, Arak, 38156-8-8349, Iran
Maryam Amiri
Faculty of Electrical and Computer Engineering, University of Tabriz, 29 Bahman Blvd, Tabriz, Iran
Leyli Mohammad-Khanli
Dipartimento di Elettronica, Informazione e Bioingegneria Politecnico di Milano, Via Golgi 42, 20133, Milan, Italy
Raffaela Mirandola

Authors

Maryam Amiri
View author publications
You can also search for this author in PubMed Google Scholar
Leyli Mohammad-Khanli
View author publications
You can also search for this author in PubMed Google Scholar
Raffaela Mirandola
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Leyli Mohammad-Khanli.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendices

A Proofs

The proof of all of the theorems, lemmas and corollaries are presented in this appendix. Furthermore, we might present some new lemmas that are used to prove the other lemmas and theorems.

Lemma 5

Given the episode $\alpha =G'_1\rightarrow \cdots \rightarrow G'_k$ and the occurrence $x=([t^j_1,t^j_2]_{j=1}^{k})\in LO(\alpha )$, if there exists a valid occurrence $y=([w^j_1,w^j_2]_{j=1}^{k})$ such that $t^1_1<w^1_1$, then $t^k_1<w^k_1$.

Proof

The proof is by induction on k: Base case for $k=2$: The proof is by contradiction: Assume $w^2_1<t^2_1$. We have:

$$\begin{aligned} {\left. \begin{aligned}&w^1_2+\delta<w^2_1<w^1_2+\varDelta \\&t^1_2+\delta<t^2_1<t^1_2+\varDelta \\&t^1_2<w^1_2\\ \end{aligned}\right\} } \rightarrow w^1_2+\delta<w^2_1<t^2_1<w^1_2+\varDelta \end{aligned}$$

(A.1)

Therefore, x does not include LPO and it is not an LO. Induction Step: Assume it is true for $k=m-1$. Now, we should prove it for $k=m$: The proof is by contradiction: Assume $w^m_1<t^m_1$. We have:

$$\begin{aligned} {\left. \begin{aligned} w^{m-1}_2+\delta<w^m_1<w^{m-1}_2+\varDelta \\ t^{m-1}_2+\delta<t^m_1<t^{m-1}_2+\varDelta \\ t^{m-1}_2<w^{m-1}_2\\ \end{aligned}\right\} } \rightarrow w^{m-1}_2+\delta<w^{m}_1<t^{m}_1<w^{m-1}_2+\varDelta \qquad \quad \end{aligned}$$

(A.2)

It means that x does not include LPO. So it is not an LO, which is in contradiction to the assumption. $\square $

Lemma 1Given the episode$\alpha =G'_1\rightarrow \cdots \rightarrow G'_k$, if$MPO(\alpha )$is a set of all the minimal prefix occurrences of$\alpha $, then$LO(\alpha )\subseteq MPO(\alpha )$.^{Footnote 3}

Proof

The proof is by contradiction: assume there exists at least one LO$x=([t_1^{j},t_2^{j}]^{k}_{j=1})$ such that $x\notin MPO(\alpha )$. Since $x\notin MPO(\alpha )$, so there should exist a valid occurrence $y=([w_1^{j},w_2^{j}]^{k}_{j=1})$ where $w^{1}_1>t^{1}_1$ and $w^{k}_1\le t^{k}_1$. According to Lemma 5, for each valid occurrence $y = ([w^{j}_{1}, w^{j}_{2}]^{k}_{j=1})$ that $ t^{1}_{1}<w^{1}_{1} $, we should have $ t^{k}_{1}<w^{k}_{1} $. So x is not an LO, which is in contradiction to the assumption. $\square $

Lemma 2Given the episode$\alpha $, if$\beta $and$\gamma $are the serial and concurrent extensions of$\alpha $, removing redundant occurrences from$LOList(\alpha )$does not affect$freq(\alpha )$, $freq(\beta )$and$freq(\gamma )$.

Proof

We define $OSet^N_M(\alpha )=\{O_1,\ldots ,O_L\}$ as a set of all the non-overlapped minimal occurrences that $ O_1 $ is the first minimal occurrence of $ \alpha $ and $ O_{i+1},1\le i<L,$ is the first non-overlapped minimal occurrence after $ O_i $. In [11], we proved that $ OSet^N_M(\alpha ) $ is a maximal non-overlapped set of the minimal occurrences of $\alpha $ in the stream and $freq(\alpha )=|OSet^N_M(\alpha )|$. The proof of the lemma includes three cases:

Impact of removing redundant LOs on $freq(\alpha )$: according to the first condition of Definition 18, there is overlap between the two occurrences O and Q. So at most one of them could be in $OSet^N_M(\alpha )$. On the other hand, O is not a minimal occurrence. So $O\notin OSet^N_M(\alpha )$ and removing it does not affect $freq(\alpha )$.
Impact of removing redundant LOs on $freq(\beta )$: If there exists a sub-interval of $[t^k_2+\delta ,t^k_2+\varDelta ]$ such that the occurrence O covers it exclusively or O is a minimal occurrence, then O might be a non-overlapped occurrence of $\alpha $ and form a non-overlapped occurrence for $\beta $. Therefore, removing O might lead to losing a non-overlapped occurrence of $\beta $. According to the first condition of Definition 18, each non-overlapped occurrence of $\beta $ whose LPO is O could be formed by using Q. On the other hand, there exists no sub-interval of $[t^k_2+\delta ,t^k_2+\varDelta ]$ such that the occurrence O covers it exclusively. Therefore, removing O does not affect $freq(\beta )$.
Impact of removing redundant LOs on $freq(\gamma )$: If there exists the event $e=(v,s,st,et)$ such that $|t^k_2-st|<\epsilon $ and $|t^k_1-st|<\epsilon $, then removing O affects the frequency of $\gamma =\alpha \odot (r,s)$. So if there is no such event, removing O does not affect $freq(\gamma )$.

Therefore, if the three conditions are satisfied together, removing O does not affect $freq(\alpha )$, $freq(\beta )$ and $freq(\gamma )$. $\square $

Lemma 6

Given the episode $\alpha $ such that $|CNG_{\alpha }|=k$ and $\mu \ge \max (2\epsilon +1,\epsilon +2)$, the successive starting intervals of $G_i, 1\le i\le k$, have no overlap.

Proof

The proof is by contradiction: suppose there are two starting intervals $[t_1,t_2]$ and $[t'_1,t'_2]$ of $G_i$ such that there is overlap between them: $t_1\le t'_1\le t_2<t'_2$. According to the definition of the episode occurrence in [11], $t_2-t_1\le \epsilon $ and $t'_2-t'_1\le \epsilon $. $\forall e=(r,s,st,et)\in E\;that\;t_2< st\le t'_2$, then there should exist the other event $e'=(r'=r,s'=s,st',et')$ such that $t_1\le st'\le t_2$. Since $et'\le st$, if $t'_1\le st'\le t_2$ then $\varDelta e'<\epsilon $. If $t_1\le st'<t'_1$ and $et'=st$, we have $\varDelta e'=\mu <2\epsilon $, which is in contradiction to $\mu >2\epsilon $. If $t_1\le st'<t'_1$ and $et'<st$, there should exist the other event $e''=(r,s''\ne s,st'',et'')$ such that $et'\le st''<et''\le st'$. Since $\varDelta e'>\epsilon $, we have $et'>t_2$ and $\varDelta e''<\epsilon $. These show that the successive starting intervals of $G_i$ have no overlap. $\square $

Lemma 7

Given the episode $\alpha $ such that $|CNG_{\alpha }|=k$, $ 1\le i\le k $, $\mu \ge \max (2\epsilon +1,\epsilon +2)$ and the two occurrences $O,O'\in OSet(\alpha )$, if $[u,u']$ is the starting interval of $G_i$ in O, the following starting interval of $G_i$ in $O'$ is $[w,w']$ that $w>2\epsilon +u$.

Proof

According to the definition of the occurrence O in [11], $\exists A^{i}_{j}\in G_i, 1\le i\le k, j\in \{1,\ldots ,l_i\}$ that $g_{\alpha }(A^{i}_{j})=(r,s)$, $h(A^{i}_{j})=a$ and $e_a=(r,s,st=u,et)$. Since $\varDelta e_a>\epsilon $, we have $et>\epsilon +u$. According to Lemma 6, $w>u'$. For the occurrence $O'$, $h'(A^{i}_{j})=b$ such that $e'_b=(r'=r,s'=s,st'=v,et')$, $w\le v\le w'$, $\varDelta e'_b>\epsilon $ and $et'>v+\epsilon $. If $st'>et$, there should exist the other event $e_m=(r,s_m\ne s,st_m,et_m)$ such that $st_m=et$. Since $\varDelta e_a>\epsilon $ and $\varDelta e_m>\epsilon $, then $w>2\epsilon +u$. If $st'=et$, we have $\varDelta e_a=\mu $. So $w-u=\mu >2\epsilon $. Note that the condition $\mu >2\epsilon $ is reasonable because the minimum span of events is $\epsilon +1$ [11]. So the decomposition unit of events, $\mu $, could be twice more than the minimum span. $\square $

Lemma 8

Given the episode $\alpha =G'_1\rightarrow \cdots \rightarrow G'_k$ and the two occurrences $O=([t^i_1,t^i_2]^{k}_{i=1})$ and $Q=([w^i_1,w^i_2]^{k}_{i=1})$, where $O,Q\in LO(\alpha )$, if $\exists j, 1\le j\le k-1$, such that for $r=1,2,\ldots ,j-1$: $t^r_1=w^r_1$, $t^r_2=w^r_2$ and $t^j_1<w^j_1$, then $t^k_1<w^k_1$ and $t^k_2<w^k_2$.

Proof

The proof is by induction on k: Base case for $k=2$: we have j=1 and according to Lemmas 6 and 7 , $w^1_1>t^1_2$. If $t^2_1\in [t^1_2+\delta ,\min (t^1_2+\varDelta ,w^1_2+\delta -1)]$, then we have $O\in LO(\alpha )$. If $w^2_1\in [w^1_2+\delta ,w^1_2+\varDelta ]$, then $Q\in LO(\alpha )$. So we have $w^2_1>t^2_1$ and according to Lemmas 6 and 7 , $w^2_2>t^2_2$. Induction Step: Assume it is true for $k=m-1$. It should be proved for $k=m$. Since the lemma is correct for $k=m-1$, so if $t^j_1<w^j_1,1\le j\le m-2$, then $t^{m-1}_{2}<w^{m-1}_{2}$. The starting interval of $G'_m$ in O, $t^m_1$, should be in the interval of $[t^{m-1}_2+\delta ,\min (t^{m-1}_2+\varDelta ,w^{m-1}_2+\delta -1)]$ because if $t^m_1<t^{m-1}_2+\delta $, then O is not a valid occurrence and if $t^m_1>\min (t^{m-1}_2+\varDelta ,w^{m-1}_2+\delta -1)$, then O is not a valid occurrence or since $w^{m-1}_2>t^{m-1}_2$ and $t^1_1\le w^j_1$, O cannot include the starting interval of $G'_m$ (because O does not include an LPO).

The starting interval of $G'_m$ in Q, $w^m_1$, should also be in the interval of $[w^{m-1}_2+\delta ,w^{m-1}_2+\varDelta ]$ because gap constraints are satisfied. Since $t^1_1\le w^1_1$ and $w^{m-1}_2>t^{m-1}_2$, so Q could include the starting interval of $G'_m$ because it includes an LPO. So we have:

$$\begin{aligned} {\left. \begin{aligned} w^m_1\ge w^{m-1}_2+\delta \\ t^m_1<w^{m-1}_2+\delta \end{aligned}\right\} } \rightarrow t^m_1<w^m_1 \end{aligned}$$

On the other hand, according to Lemmas 6 and 7 , two occurrences of $G'_m$ have no overlap. So we have $w^m_2\ge w^m_1>t^m_2$. For $j=m-1$, in a similar way to the previous case, if $t^m_1\in [t^{m-1}_2+\delta ,\min (t^{m-1}_2+\varDelta ,w^{m-1}_{2}+\delta -1)]$, then O is an LO. If $w^m_1\in [w^{m-1}_2+\delta ,w^{m-1}_2+\varDelta ]$, then Q is an LO. So we have $w^m_1>t^m_1$ and according to Lemmas 6 and 7, $w^m_2\ge w^m_1>t^m_2\ge t^m_1$. $\square $

Lemma 3Given the episode$\alpha =G'_1\rightarrow G'_2\rightarrow \cdots \rightarrow G'_k$and the occurrences$A=([a^i_1,a^i_2]^k_{i=1})$, $B=([b^i_1,b^i_2]^k_{i=1})$and$C=([c^i_1,c^i_2]^k_{i=1})$, where$A,B,C\in LO(\alpha )$andAandCareLOs immediately before and afterBrespectively, if$a^1_1\ne b^1_1$and$[b^k_1,b^k_2]$is not covered by$[a^k_1,a^k_2]$and$[c^k_1,c^k_2]$, then all of theLOs beforeAstart beforeBand$[b^k_1,b^k_2]$is covered by none of theLOs beforeAand afterC.

Proof

According to Lemmas 6, 7 and 8 , the starting intervals of all the LOs before A are equal to $[a^1_1,a^1_2]$ or before $[a^1_1,a^1_2]$. If $a^1_1\ne b^1_1$, it means that $a^1_1<b^1_1$. It is clear that the starting intervals of all the LOs before A don’t coincide with $b^1_1$. If $VI([b^k_1,b^k_2],k+1)$ is not covered by $VI([a^k_1,a^k_2],k+1)$ and $VI([c^k_1,c^k_2],k+1)$, the starting intervals of all the LOs before A are before $[a^k_1,a^k_2]$ and the starting intervals of all the LOs after C are after $[c^k_1,c^k_2]$. So VIs of these intervals don’t cover $VI([b^k_1,b^k_2],k+1)$. $\square $

Lemma 4If$\epsilon \ge \frac{\delta }{4}$and$ \varDelta \in [\delta ,2\delta ) $, then there is no redundantLO.

Proof

According to the second condition of Definition 18 and Lemma 3, $[t^k_2+\delta ,t^k_2+\varDelta ]$ of O (the redundant LO) should be covered by the LOs immediately before and after O. Assume $O_1$ and $O_2$ are the LOs immediately before and after O respectively. We define $[b_1,b'_1], [b,b']$ and $[b_2,b'_2]$ as the starting intervals of the last group of the episode in $O_1$, O and $O_2$ respectively. It is clear that $b'_1<b'<b'_2$. If $b'_2+\delta \le b'_1+\varDelta $, then we have $b'_1+\delta<b'+\delta<b'_2+\delta \le b'_1+\varDelta<b'+\varDelta <b'_2+\varDelta $. It means that $[b'+\delta ,b'+\varDelta ]$ is covered completely. Since $b'_2+\delta \le b'_1+\varDelta $, we have $b'_2-b'_1\le \varDelta -\delta $. According to the upper bound of $\varDelta $ ($\delta \le \varDelta <2\delta $), $b'_2-b'_1< \delta $. On the other hand, we have $b>b_1+2\epsilon $ and $b_2>b+2\epsilon $ according to Lemma 7. So we have $b'_2-b'_1>4\epsilon $. It means that $4\epsilon<b'_2-b'_1<\delta $. Therefore, if $\epsilon \ge \frac{\delta }{4} $, the second condition of Definition 18 is not satisfied and there is no redundant LO. $\square $

Theorem 1Given the episode$\alpha $and$(r,s)\in RS$, the algorithmSMakeLoListonly finds all the non-redundantLOs of$\beta =\alpha \oplus (r,s)$.

Proof

To prove this theorem, we focus on the span of LOs in LOList of episodes. The proof includes three parts: (1) Occurrences extracted by the algorithm are an LO. Given $\alpha =G'_1\rightarrow G'_2\rightarrow \cdots \rightarrow G'_{k-1}$ and $G'=(r,s)\in RS$, we have $\beta =G'_1\rightarrow G'_2\rightarrow \cdots G'_{k-1}\rightarrow G'$. The proof is by contradiction: there is at least one extracted occurrence of $\beta $ that is not the latest occurrence. For this occurrence, assume there are the corresponding occurrences $O_{\alpha }$ and $O_{G}$ of $\alpha $ and G with spans $[r,r']$ and $[x,x']$ respectively. There are two cases: (a) The gap constraints have not been satisfied, which is impossible due to line 11 of the algorithm. (b) There is the other LO$Q_{\alpha }$ of the episode $\alpha $ with the span $[u,u']$ for the episode $\alpha $ that satisfies the gap constraints for $[x,x']$ and $u>r$, $u'>r'$. Otherwise, $[r,r']$ and $[x,x']$ could form an LO for $\beta $. So, we have:

$$\begin{aligned} {\left. \begin{aligned} r'+\delta \le x\le r'+\varDelta \\ u\ge r, u'>r' \\ u'+\delta \le x\le u'+\varDelta \end{aligned}\right\} } \rightarrow {r'+\delta<u'+\delta \le x\le r'+\varDelta <u'+\varDelta } \end{aligned}$$

(A.3)

Since $[r,r']$ is before $[u,u']$, $\forall [f,f']\in LOList(\alpha )$ that $r\le f\le u$, we have:

$$\begin{aligned} r'< f'< u'\;and\; r'+\delta<f'+\delta<u'+f\le x\le r'+\varDelta<f'+\varDelta <u'+\varDelta \nonumber \\ \end{aligned}$$

(A.4)

Since line 7 of the algorithm is satisfied for $O_{\alpha }$, so $[r,r']$ could not be the latest prefix occurrence. 2) The extracted LOs are non-redundant. According to Definition 18, a redundant LO satisfies three conditions. In lines 15 to 27, these conditions are checked. According to Lemma 3, it is sufficient that the immediately next and previous LOs are investigated. The first condition of Definition 18 is considered in line 19. The second and third conditions are also investigated in lines 20 and 21 respectively. So if all the conditions are satisfied, $LOList(\beta )[z-1]$ is redundant and is removed in line 22. 3) It should be proved that “all” of the non-redundant LOs are found. The proof is by contradiction: suppose there is at least a non-redundant LO$O_{\beta }$ of $\beta $, composed of $O_{\alpha }$ and $O_{G}$ with spans $[r,r']$ and $[x,x']$ respectively, which is not extracted. Since this LO is non-redundant, then the conditions of lines 19 to 22 are not satisfied. Since LOListRS(G) is complete, there are two cases for $O_{\alpha }$: a) $[r,r']\in LOList(\alpha )$: it is checked in the first while loop. If $[x,x']$ is not checked for $[r,r']$, it means that the other latest prefix occurrence has been found for it previously. So $[r,r']$ could not be the latest prefix occurrence. When $[x,x']$ is checked for $[r,r']$, if $[u,u']$ is after $[r,r']$ in $LOList(\alpha )$, then $u>r$. So we have $u'>r'$ according to Lemma 8. Since $[r,x']$ is not redundant, it means that $VI([x,x'],k+1)$ is not covered or $[x,x']$ extends concurrently or there is not an LO of $\beta $ with the span $[r,y']$ such that $y'<x'$. So if $[r,x']$ is not redundant, then $[u,x']$ is not redundant. Thus, the non-redundant LO with the span $[u,x']$ is formed. Therefore, $[r,r']$ could not be an LPO and $[r,x']$ is not an LO, which is in contradiction to the assumption. (b) If $[r,r']\notin LOList(\alpha )$, so there is the latest occurrence with the span $[u,r']$ such that $u>r$. Since $r'$ satisfies the gap constraints with x, the latest occurrence with the span $[u,r']$ also satisfies the gap constraints with $[x,x']$. So there is another valid occurrence with the span $[u,x']$ that $\exists j, 1\le j\le k-2$ that the starting interval of $G'_j$ in the span $[u,r']$ is greater than $[r,r']$. So the occurrence $O_{\beta }$ is not an LO. Therefore if $[r,r']\in LOList(\alpha )$, all the non-redundant LOs whose LPO is $[r,r']$, are extracted. $\square $

Lemma 9

Given the episode $\alpha $, $(r,s)\in RS$, $|ResourceType|=r$, $|LOList(\alpha )|=q$ and $|LOListRS(r,s)|=p$, time complexity of the algorithm SMakeLoList is $O(p(r+\frac{\delta }{\epsilon })+q)$ in the worst case and O(p) and O(q) in the best cases.

Proof

Generally, time complexity of the algorithm SMakeLOList is $O(k\% \times p\times r+q-f)$ where $0\le k\le 100$ and $0\le f\le q$. It means that when $k\%$ of LOListRS(r, s) have been traversed by the f elements of $LOList(\alpha )$, a member of LOListRS(r, s) is met that should be compared with the remaining $q-f$ elements of $LOList(\alpha )$. In the worst case, the first element of $LOList(\alpha )$ connects to all the $p-1$ elements of LOListRS(r, s) and the last element of LOListRS(r, s) connects to no element of LOListRS(r, s). Furthermore, for all the extracted occurrences, the functions CExtending and FindIndex are called. Time complexity of CExtending is O(r) . For the redundant LOs we have $ b_3+\delta \le b_1+\varDelta $. In addition, in [11], we proved that $ \varDelta -\delta <\delta $. So we have $ b_1<b_2<b_3<b_1+\delta $. On the other hand, in [11], we proved that if $ [x_1,x_1] $ and $ [y_1,y_1] $ are two consecutive starting intervals of (r, s) , then $y_1-x_1>\epsilon $. So time complexity of FindIndex is $ O(\frac{\delta }{\epsilon }) $. Therefore, time complexity is $O(p(r+\frac{\delta }{\epsilon })+q)$. In the best case, the first element of $LOList(\alpha )$ connects to all the members of LOListRS(r, s) or the first element of LOListRS(r, s) connects to no element of $LOList(\alpha )$. In these cases, the functions CExtending and FindIndex are not also called. Therefore, time complexity is O(p) and O(q) respectively. $\square $

Theorem 2Given the episode$\alpha =G'_1\rightarrow G'_2\rightarrow \cdots \rightarrow G'_{k}$and$G=(r,s)\in RS$, the algorithmCMakeLOListonly finds all the non-redundantLOs of$\beta =\alpha \odot G$.

Proof

The proof of the theorem includes three parts: (1) The occurrences extracted by the algorithm are an LO. According to the definition of the concurrent extension, we have $\beta =G'_1\rightarrow G'_2\rightarrow \cdots \rightarrow (G'_k\cup G=G')$. Since $LOList(\beta )$ is constructed based on $LOList(\alpha )$, so all the occurrences extracted by the algorithm satisfy the definition of LO. (2) The extracted LOs are non-redundant. According to Definition 18, a redundant LO satisfies three conditions. In lines 14 to 26, these conditions are checked. According to Lemma 3, it is sufficient that the immediately next and previous LOs are investigated. The first condition of Definition 18 is considered in line 18. The second and third conditions are also investigated in lines 19 and 21 respectively. So if all the conditions are satisfied, $LOList(\beta )[z-1]$ is redundant and is removed in line 22. (3) It should be proved that “all” of the non-redundant LOs are found. The proof is by contradiction: there is at least a non-redundant LOA of $\beta $ that is not extracted. According to the previous parts, all the extracted LOs are non-redundant. Then it means that the algorithm does not recognize the occurrence A as an LO. Since each LO of $\beta $ includes one LO of $\alpha $ and one LO of G, then it means that $LOList(\alpha )$ or LOListRS(G) is not complete or the algorithm could not find the occurrence A of $\beta $. Since $LOList(\alpha )$ and LOListRS(G) are complete and line 9 of the algorithm checks the concurrent extensions of $\alpha $ with G, A and all the possible LOs of $\beta $ are extracted. Therefore, A is extracted by the algorithm, which is in contradiction to the assumption. $\square $

Lemma 10

Given the episode $\alpha $, $G=(r,s)\in RS$, $|LOList(\alpha )|=q$ and $|LOListRS(r,s)|=p$ and $|ResourceType|=r$, time complexity of the algorithm CMakeLOList is $O(p(r+\frac{\delta }{\epsilon })+q)$ or $O(p+q(r+\frac{\delta }{\epsilon }))$ in the worst cases and $O(\min (p,q))$ in the best case.

Proof

In the best case, the corresponding element of $LOList(\alpha )[i]$ matches the corresponding element of LOListRS(G)[j] and both the counters i and j increase repeatedly. Furthermore, all the LOs are non-redundant and the functions CExtending and FindIndex are called for none of the identified LOs. So, time complexity is $O(\min (p,q))$. In the worst case, each element of $LOList(\alpha )$ is checked with the t elements of LOListRS(G) where $ 1\le t\le p $, then an LO is identified for each element of $LOList(\alpha )$ for which the functions CExtending and FindIndex are called in a similar way to Lemma 9. So time complexity is $O(q(r+\frac{\delta }{\epsilon })+p)$. In the other case, each element of LOListRS(G) is checked with the w elements of $LOList(\alpha )$ where $ 1\le w\le q $, then an LO is identified for each element of LOListRS(G) for which the functions CExtending and FindIndex are called. So time complexity is $O(p(r+\frac{\delta }{\epsilon })+q)$. $\square $

Lemma 11

Given $|ResourceType|=r$, time complexity of the algorithm CreateBranch (Algorithm 9) for the episode $\alpha $ is $O(r|CNG_{\alpha }|)$.

Proof

Algorithm 9 has a for loop that processes each CNG of $\alpha $ in each repeat. Since episodes are represented in the form of SAVE [11], we have $O(|RArray_{\alpha }[j]|)=r$ where $ 1\le j\le |CNG_{\alpha }|$. Thus, time complexity of reversing each CNG is O(r). So, time complexity of the algorithm is $O(r|CNG_{\alpha }|)$. $\square $

Lemma 12

Given the episodes $\alpha $ and $\beta $ and the threshold $\theta \in {\mathbb {R}}_{\ge 0}$ , if $\beta \sqsubseteq \alpha $ and $freq(\alpha )\ge \theta $, then $\theta \le freq(\alpha )\le freq(\beta )$ (the anti-monotonic constraint).

Proof

Since $\beta \sqsubseteq \alpha $, each occurrence of the episode $\alpha $ includes an occurrence of $\beta $. So $\forall O_i\in OSet^{N}_{M}(\alpha ), \exists O'_{i}\subseteq \; O_i\;that\;O'_i\in OSet^{N}_{M}(\beta )$ (see Lemma 2). Therefore, $|OSet^{N}_{M}(\alpha )|\le |OSet^{N}_{M}(\beta )|\;or\;freq(\alpha )\le freq(\beta )$. $\square $

Lemma 13

The function InsertInCPBT is only called for FC episodes.

Proof

According to lines 31 to 33 of Algorithm 8, when Flag is true, the function InsertInCPBT is called for the episode $\alpha $. At first, Flag is True. When there exists a serial extension or a concurrent extension of $\alpha $ whose frequency is equal to $\alpha $’s, Flag is set to False. So according to Definition 20, $\alpha $ is not FC. Therefore, if there are no such episodes, according to Lemma 12, the frequency of episodes extended from $\alpha $ is less than $\alpha $’s. Therefore, $\alpha $ is FC. It means that the value of Flag remains True and the function InsertInCPBT is called for it. $\square $

Lemma 14

Given the episode $\alpha $, if the function FindClosedFreqEpisode is not called for $\alpha $, then $\alpha $ is not a closed frequent episode.

Proof

The function FindClosedFreqEpisode is called by the function $AllClosed FreqEpisodes$ and itself. The function AllClosedFreqEpisodes calls it for $P=\{(r,s)|\;|LOListRS(r,s)|>0,\forall (r,s)\in RS\}$. The function $AllClosedFreqEpisodes$ does not call the function FindClosedFreqEpisode for the members of RS that are not frequent. If an episode is not frequent, then it could not also be closed frequent. The function FindClosedFreqEpisode is recursively called for the serial and concurrent extensions whose frequency is larger than the threshold value c (see lines 13 and 27 of Algorithm 8). So if this function is not called for an episode $\alpha $, it means that $freq(\alpha )$ is less than c and it is not a frequent episode. So it could not also be a closed frequent episode. $\square $

Theorem 3The algorithmAllClosedFreqEpisodesonly finds all the closed episodes.

Proof

The proof includes two parts: (1) all the extracted episodes are closed. The proof is by contradiction: assume there is an episode $\alpha $ such that $|CNG_{\alpha }|=k$ and it is not closed. It means that at least one of the scenarios below occurs:

There is at least one episode $\beta _1$ such that $\alpha =Prefix(\beta _1,k)$ and $freq(\alpha )=freq(\beta _1)$: Since $\alpha <\beta _1$ , then $\alpha $ is processed sooner. While processing $\alpha $, all the serial and concurrent extensions of $\alpha $ are generated. If the frequency of one of them is equal to $\alpha $’s, Flag is set to False in Algorithm 8. So $\alpha $ is not inserted in ${\textit{C}PBT}$. It means that such an episode cannot be found in ${\textit{C}PBT}$.
There is at least one episode $\beta _2$ such that $|CNG_{\beta }|=n $, $\alpha =Suffix(\beta _2,n-k+1)$ and $freq(\alpha )=freq(\beta _2)$. There are two cases: a) $\alpha <\beta _2$: In this case, $\alpha $ is inserted in ${\textit{C}PBT}$ sooner. Since $\alpha $ is FC and $\alpha =Suffix(\beta _2,n-k+1)$, $\beta _2$ is also FC and according to Lemma 13, it is inserted in ${\textit{C}PBT}$. While inserting $\beta _2$, the algorithm SearchInTree detects that $\alpha $ is absorbed by $\beta _2$. Therefore ${\textit{C}PBT}$ is updated and $\alpha $ is removed from it. (b) $\beta _2<\alpha $: Since $\beta _2$ is inserted in ${\textit{C}PBT}$ sooner, the function SearchInTree returns $-\,1$ and $\alpha $ is not inserted in ${\textit{C}PBT}$. Therefore, $\alpha $ does not exist in ${\textit{C}PBT}$. It means that all the episodes stored in ${\textit{C}PBT}$ are closed.

(2) All of the closed episodes are found. The proof is by contradiction: there exists at least one closed episode $\alpha $ which has not been found. There are two cases: (a) $\alpha $ has not been inserted in ${\textit{C}PBT}$. Since $\alpha $ is a closed episode, then it is FC. So when the function FindClosedFreqEpisode is called for $\alpha $, it is inserted in ${\textit{C}PBT}$. Therefore if the function InsertInCPBT has not been called for $\alpha $, it means that FindClosedFreqEpisode has not been called for it. Therefore $\alpha $ cannot be a closed frequent episode according to Lemma 14. So the function InsertInCPBT is called for $\alpha $. (2) $\alpha $ has been removed from ${\textit{C}PBT}$. It means that there is another FC episode $\beta $ whose suffix is $\alpha $ and $freq(\alpha )=freq(\beta )$. So, according to the definition of closed episodes, $\alpha $ is not a closed episode, which is in contradiction to the assumption. Therefore, all the closed episodes are extracted. $\square $

B Algorithms

In this appendix, we present some functions in a canonical form and explain them in detail.

1.1 B.1 Function CExtending

This function considers whether the third condition of Definition 18 is satisfied for an occurrence of the episode. As Algorithm 7 shows, the function receives (r, s), which is the last member of the last CNG in the episode, the corresponding entry of LOListRS(r, s) in the occurrence and the starting interval of the last CNG in the occurrence. It considers whether a concurrent event with (r, s) exists or not. The state of the concurrent event should be greater than (r, s) based on the order defined on RS. In lines 1 to 3, if the pointers Next and Previous are null, it means that there is no concurrent event for it. So it cannot extend concurrently. In lines 4 to 12, the list linked to the pointer Next is considered to find the concurrent event. In lines 13 to 21, the list linked to the pointer Previous is considered in a similar way to the pointer Next’s.

1.2 B.2 Function FindClosedFreqEpisode

The algorithm FindClosedFreqEpisode (Algorithm 8) receives the parameters $ \delta $, $ \varDelta $, $ \epsilon $, the thresholds $\theta $ and Level, the episode $\alpha $ and $LOList(\alpha )$ and forms the concurrent and serial extensions of $\alpha $ (lines 6 and 20) as the episode $\beta $ and computes $LOList(\beta )$ by calling the functions CMakeLOList and SMakeLOList (lines 7 and 21). Then the NO frequency of $\beta $ is computed by calling the function ComputeFreq (lines 8 and 22). If $freq(\beta )$ is above the threshold c (computed based on $\theta $ [11]), the tree is traversed further down by calling FindClosedFreqEpisode in lines 13 and 27 recursively with $\beta $ and $LOList(\beta )$ as its parameters. When the serial and concurrent extensions of $\alpha $ are constructed, it is checked (in lines 9 and 23) whether any of the super patterns $\beta $ formed from $\alpha $ has the same frequency as $\alpha $’s or not; if not, it means that $\alpha $ is FC. So the function InsertCPBT is called to insert the FC episode $\alpha $ in ${\textit{C}PBT}$ (line 32).

1.3 B.3 Function CreateBranch

The algorithm CreateBranch receives the FC episode $\alpha $ and its frequency, converts them into a branch of ${\textit{C}PBT}$ and returns a pointer to this branch. In lines 3 to 16, in the backward direction, CNGs of $\alpha $ are processed. In lines 4 and 5, the order of members of CNG is reversed. In lines 6 to 15, for each CNG, a Node is created and added to the end of the branch. Finally, $L_1$, which is a pointer to the first of the branch, is returned in line 17.

1.4 B.4 Function EpisodeAbsorbByTree

Algorithm 10 checks whether ${\textit{C}PBT}$ could absorb the episode $ \alpha $ or not. Finding at least one branch that absorbs $ \alpha $ is sufficient to omit it. In line 1, the function finds the children of the node R whose label includes $\alpha '.label$ and frequency is greater than $\alpha '.freq$. Note that $\alpha '$ is the pointer to the first node of the corresponding branch of the episode $\alpha $. In lines 2 to 4, if there are no such children, $\alpha '$ cannot be absorbed by ${\textit{C}PBT}$ and the function returns False. In lines 5 to 12, if the last node of $\alpha '$ is being checked, it should be considered whether there exists a member of SubSetChildren that satisfies the condition of equality of the frequency (see function CheckFreq in B.7). If such an episode is found, it means that $\alpha $ could be absorbed by ${\textit{C}PBT}$. So $\alpha $ is not a closed episode and the function returns True. In line 11, if there exists no such episode, the function returns False. In lines 12 to 19, the middle nodes of the branch $\alpha '$ are checked whether there exists a super-episode that absorbs $\alpha $. As soon as such a super-episode is found, the function returns True in line 15.

1.5 B.5 Function TreeAbsorbByEpisode

Algorithm 11 finds all the branches of ${\textit{C}PBT}$ that are absorbed by the episode $\alpha $ ($ \alpha ' $ is the corresponding branch of $ \alpha $). The path of these branches is completed in Path. The completed paths are added to PathList. Finally, PathList includes all the paths whose corresponding episodes should be removed from ${\textit{C}PBT}$. In line 1, the children of the node R whose label is a subset of $\alpha '.label$ and frequency is greater than or equal to $\alpha '.freq$ are found. All the found children are considered in lines 2 to 13. In lines 5 to 7, if an episode is found that $\alpha $ absorbs it, the corresponding Path of the episode is added to PathList and Path is updated. In lines 8 to 9, the middle nodes of $\alpha '$ are checked to find the episodes that could be absorbed by $\alpha $. In lines 10 and 11, since the last node of $\alpha '$ is met, Path is updated.

1.6 B.6 Function UpdateBranch

After the non-closed episodes of the tree are recognized by Algorithm 11, the corresponding branches of them should be updated. The function UpdateBranch (Algorithm 12) updates these branches based on the frequency of the episodes. The function receives Path and freq of a non-closed episode and the node R that the search starts from it towards down. In lines 2 to 3, the function starts the search of Path from the node R and decreases the frequency of the node corresponding to Path[1] by freq. If the frequency of the node is 0, it means that the frequency of its corresponding episode and all of its super-episodes is 0. So in lines 4 and 5, that node and its subtree are removed. Otherwise, this procedure is repeated for the remaining entries of Path.

1.7 B.7 Functions CheckFreq and ComputeNodeFreq

The function CheckFreq (Algorithm 13) is proposed to consider whether there is an episode in the sub-tree of the node n of ${\textit{C}PBT}$ whose frequency is equal to the frequency of the episode $\alpha $. This function receives $\alpha '$ (the corresponding branch of $\alpha $) and the node n of ${\textit{C}PBT}$. It returns True if such an episode exists in ${\textit{C}PBT}$. In lines 1 and 2, it is checked whether freq(Episode(n)) is equal to $freq(\alpha )$ or not. If not, the children of n are traversed by calling CheckFreq recursively in lines 4 to 8. As soon as a child with the frequency freq is found, the search is stopped and True is returned. The function ComputeNodeFreq (Algorithm 14) computes the frequency of Episode(n). For this purpose, in lines 2 to 4, the frequency of the node n decreases by the sum of the frequency of its children. It is clear that the function ComputeNodeFreq(n) computes freq(Episode(n)). If $ComputeNodeFreq(n)>0$, then Episode(n) has occurred in the stream.

1.8 B.8 Function ExtractClosedEpisodeFromCPBT

Algorithm 15 shows the function ExtractClosedEpisodeFromCPBT. The main loop of Algorithm 15 traverses ${\textit{C}PBT}$ until there is no node except the root of ${\textit{C}PBT}$. In line 3, the traverse starts from the most left child. In lines 6 to 11, the most left branch of ${\textit{C}PBT}$ is found. The corresponding episode of this branch is stored in the episode $\alpha $. Since episodes have been inserted in the backward direction in ${\textit{C}PBT}$, $\alpha $ is added to ClosedSet in reverse order in line 12. Furthermore, the branch should be updated. In lines 13 to 25, the frequency of all the nodes of the branch decreases by $\alpha $’s. In lines 15 to 18, if the frequency of a node is 0, then that node and its subtree are removed. Finally, in line 27, the algorithm returns ClosedSet, which includes all the closed frequent episodes.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Amiri, M., Mohammad-Khanli, L. & Mirandola, R. A new efficient approach for extracting the closed episodes for workload prediction in cloud. Computing 102, 141–200 (2020). https://doi.org/10.1007/s00607-019-00734-3

Download citation

Received: 09 October 2018
Accepted: 06 June 2019
Published: 13 June 2019
Issue Date: January 2020
DOI: https://doi.org/10.1007/s00607-019-00734-3

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A new efficient approach for extracting the closed episodes for workload prediction in cloud

Abstract

Access this article

Similar content being viewed by others

Trends and Future Perspective Challenges in Big Data

Big data analytics on Apache Spark

Big data preprocessing: methods and prospects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendices

A Proofs

Lemma 5

Proof

Proof

Proof

Lemma 6

Proof

Lemma 7

Proof

Lemma 8

Proof

Proof

Proof

Proof

Lemma 9

Proof

Proof

Lemma 10

Proof

Lemma 11

Proof

Lemma 12

Proof

Lemma 13

Proof

Lemma 14

Proof

Proof

B Algorithms

1.1 B.1 Function CExtending

1.2 B.2 Function FindClosedFreqEpisode

1.3 B.3 Function CreateBranch

1.4 B.4 Function EpisodeAbsorbByTree

1.5 B.5 Function TreeAbsorbByEpisode

1.6 B.6 Function UpdateBranch

1.7 B.7 Functions CheckFreq and ComputeNodeFreq

1.8 B.8 Function ExtractClosedEpisodeFromCPBT

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation