Abstract
While sequential pattern mining (SPM) is an import application in uncertain databases, it is challenging in efficiency and scalability. In this paper, we develop a dynamic programming (DP) approach to mine probabilistic frequent sequential patterns in distributed computing platform Spark. Directly applying the DP method to Spark is impractical because its memory-consuming characteristic may cause heavy JVM garbage collection overhead in Spark. Therefore, we design a memory-efficient distributed DP approach and use an extended prefix-tree to save intermediate results efficiently. The extensive experimental results in various scales prove that our method is orders of magnitude faster than straight-forward approaches.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Agarwal, A., Xie, B., Vovsha, I., Rambow, O., Passonneau, R.: Sentiment analysis of twitter data. In: Proceedings of the Workshop on Languages in Social Media, pp. 30–38 (2011)
Aggarwal, C.C., Yu, P.S.: A survey of uncertain data algorithms and applications. IEEE Trans. Knowl. Data Eng. 21(5), 609–623 (2009)
Agrawal, R., Srikant, R.: Fast algorithms for mining association rules in large databases. In: VLDB, pp. 487–499 (1994)
Bernecker, T., Kriegel, H.P., Renz, M., Verhein, F., Zuefle, A.: Probabilistic frequent itemset mining in uncertain databases. In: SIGKDD, pp. 119–128. ACM (2009)
Chen, C.C., Tseng, C.Y., Chen, M.S.: Highly scalable sequential pattern mining based on mapreduce model on the cloud. In: BigData Congress, pp. 310–317 (2013)
Gao, Y., Sun, Z., Wang, Y., Liu, X., Yan, J., Zeng, J.: A comparative study on parallel LDA algorithms in mapreduce framework. In: Cao, T., Lim, E.P., Zhou, Z.H., Ho, T.B., Cheung, David, Motoda, Hiroshi (eds.) PAKDD 2015. LNCS, vol. 9078, pp. 675–689. Springer, Heidelberg (2015)
Jestes, J., Cormode, G., Li, F., Yi, K.: Semantics of ranking queries for probabilistic data. IEEE Trans. Knowl. Data Eng. 23(12), 1903–1917 (2011)
Li, Y., Bailey, J., Kulik, L., Pei, J.: Mining probabilistic frequent spatio-temporal sequential patterns with gap constraints from uncertain databases. In: IEEE International Conference on Data Mining, pp. 448–457 (2013)
Miliaraki, I., Berberich, K., Gemulla, R., Zoupanos, S.: Mind the gap: large-scale frequent sequence mining. In: SIGKDD, pp. 797–808 (2013)
Muzammal, M., Raman, R.: Mining sequential patterns from probabilistic databases. In: PAKDD, pp. 210–221 (2011)
Wan, L., Chen, L., Zhang, C.: Mining frequent serial episodes over uncertain sequence data. In: EDBT, pp. 215–226 (2013)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI 2012 (2012)
Zhao, Z., Yan, D., Ng, W.: Mining probabilistically frequent sequential patterns in uncertain databases. In: EDBT, pp. 74–85 (2012)
Zhao, Z., Yan, D., Ng, W.: Mining probabilistically frequent sequential patterns in large uncertain databases. IEEE Trans. Knowl. Data Eng. 26, 1171–1184 (2013)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Ge, J., Xia, Y. (2016). Distributed Sequential Pattern Mining in Large Scale Uncertain Databases. In: Bailey, J., Khan, L., Washio, T., Dobbie, G., Huang, J., Wang, R. (eds) Advances in Knowledge Discovery and Data Mining. PAKDD 2016. Lecture Notes in Computer Science(), vol 9652. Springer, Cham. https://doi.org/10.1007/978-3-319-31750-2_2
Download citation
DOI: https://doi.org/10.1007/978-3-319-31750-2_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-31749-6
Online ISBN: 978-3-319-31750-2
eBook Packages: Computer ScienceComputer Science (R0)