Scalable and parallel sequential pattern mining using spark

Yu, Xiao; Li, Qing; Liu, Jin

doi:10.1007/s11280-018-0566-1

Scalable and parallel sequential pattern mining using spark

Published: 10 May 2018

Volume 22, pages 295–324, (2019)
Cite this article

World Wide Web Aims and scope Submit manuscript

Xiao Yu^1,2,
Qing Li² &
Jin Liu¹

893 Accesses
19 Citations
Explore all metrics

Abstract

The performance of the existing parallel sequential pattern mining algorithms is often unsatisfactory due to high IO overhead and imbalanced load among the computing nodes. To address such problems, this paper proposes two efficient parallel sequential pattern mining algorithms based on Spark, i.e., GSP-S (GSP algorithm based on Spark) and PrefixSpan-S (PrefixSpan algorithm based on Spark). For both algorithms, multiple MapReduce jobs are implemented to complete a mining task. To reduce IO overhead and take advantage of cluster memory, the first MapReduce job loads sequence database from the Hadoop Distributed File System (HDFS) into the Spark resilient distributed datasets (RDDs), and further MapReduce jobs read the database from the RDDs and store intermediate results back into the RDDs. Our findings suggest that a wise choice can be made between GSP-S and PrefixSpan-S, depending on the user-specified minimum support threshold. Moreover, theoretical analysis shows that GSP-S and PrefixSpan-S are sensitive to data distribution on the cluster. To further improve performance, we propose two database partition strategies to balance load among the computing nodes in a cluster. Experiment results demonstrate the high performance of GSP-S and PrefixSpan-S in terms of load-balancing, speedup and scalability.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aggarwal, C.-C., Han, J.: Frequent pattern mining. Springer.
Agrawal, R., Srikant, R.: Mining sequential pattern. In: 11th International Conference on Data Engineering, pp. 3–14. IEEE(1995)
Armbrust, M., Das, T., Davidson, A., Ghodsi, A., Or, A., Rosen, J., Zaharia, M.: Scaling spark in the real world: performance and usability. Proceedings of the VLDB Endowment. 8(12), 1840–1843 (2015)
Article Google Scholar
Ayres, J., Gehrke, J., Yiu, T., et al: Sequential pattern mining using a bitmap representation. In: Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 429–435(2002)
Baidu Geocoding: (2016). Available from: http://lbsyun.baidu.com/
Chen, C.-C., Tseng, C.-Y., Chen, M.-S.: Highly scalable sequential pattern mining based on mapreduce model on the cloud. In: 2013 I.E. International Congress on Big Data, pp. 310–317. IEEE (2013)
Hu, Y., Cheng-Kui Huang, T.: Knowledge discovery of weighted RFM sequential patterns from customer sequence databases. J. Syst. Softw., vol. 86, no. 3, pp. 779–788(2013)
Cong, S., Han, J., Padua, D.: Parallel mining of closed sequential patterns. In: KDD '05 Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, pp. 562–567(2005)
Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. Commun. ACM. 51(1), 107–113 (2008)
Article Google Scholar
Fournier-Viger, P., Wu, C.-W., Tseng, V.-S.: Mining maximal sequential patterns without candidate maintenance. In: International Conference on Advanced Data Mining and Applications, Springer, Berlin, Heidelberg, pp. 169–180(2013)
Guan, E.-Z., Chang, X.-Y., Wang, Z., Zhou, C.-G.: Mining maximal sequential patterns.In: Proc of the Second Int’l Conf. Neural Networks and Brain, pp. 525–528(2005)
Gurainik, V., Garg, N., Karypis, G.: Parallel tree projection algorithm for sequence mining. In: 7th International Euro-Par Conference on Parallel Processing, pp. 310–320(2001)
Hadoop Website, http://hadoop.apache.org/
Han, J., Pei, J., Mortazavi-Asl, B., et al.: FreeSpan: frequent pattern-projected sequential pattern mining. In: Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 355–359(2000)
Huang, J., Lin, S., Chen, M.: DPSP: distributed progressive sequential pattern mining on the cloud. Advances in Knowledge Discovery and Data Mining. 27–34 (2010)
Kessl, R.: Probabilistic static load-balancing of parallel mining of frequent sequences. IEEE Trans. Knowl. Data Eng. 28(5), 1299–1311 (2016)
Article Google Scholar
Leung, C.-K.-S., MacKinnon, R.-K., Jiang, F.: Finding efficiencies in frequent pattern mining from big uncertain data. World Wide Web. 20(3), 571–594 (2017)
Article Google Scholar
Li, C., Yang, Q., Wang, J., Li, M.: Efficient mining of gap-constrained subsequences and its various applications. ACM Trans. Knowl. Discov. Data. 6(1), 2:1–2:39 (2012)
Article Google Scholar
Liao, V.-C.-C., Chen, M.-S.: DFSP: a depth-first SPelling algorithm for sequential pattern mining of biological sequences. Knowl. Inf. Syst. 38(3), 623–639 (2014)
Article Google Scholar
Liu, C., Yao, L., Li, J., Zhou, R., He, Z.: Finding smallest k-compact tree set for keyword queries on graphs using mapreduce. World Wide Web. 19(3), 499–518 (2016)
Article Google Scholar
Lu, S., Li, C.: AprioriAdjust: an efficient algorithm for discovering the maximum sequential patterns. In: Proc. 2nd Int’l Workshop Knowl. Grid and Grid Intell(2004)
Luo, C., Chung, S. M.: Efficient mining of maximal sequential patterns using multiple samples. In: Proceedings of the 2005 SIAM International Conference on Data Mining, Society for Industrial and Applied Mathematics, pp. 415–426(2005)
Pei, J.: Mining sequential patterns by pattern-growth: the PrefixSpan approach. IEEE Computer Society. 16(11), 1424–1440 (2004)
Google Scholar
Pei, J., Han, J., Pinto, H.: PrefixSpan: mining sequential pattern efficiently by prefix-projected pattern growth. In: 17th international conference on data. Engineering. 215–224 (2001)
Pinto, H., Han, J., Pei, J., Wang, K., Chen, Q., Dayal, U.: Multi-dimensional sequential pattern mining. In CIKM Conference, pp. 81–88(2001)
Sabrina, P.-N.: Miltiple MapReduce and derivative projected database: new approach for supporting prefixspan scalability. In: 2015 I.E. International Conference on Data and Software Engineering, pp. 148–153. IEEE (2015)
Shintani, T., Kitsuregawa, M.: Mining algorithms for sequential patterns in parallel: hash based approach. In: Pacific-Asia Conference on Knowledge Discovery and Data Mining, Springer, Berlin, Heidelberg, pp. 283–294(1998)
SPMF: http://www.philippe-fournier-viger.com/spmf/index.php?link=datasets.php
Srikant, R., Agrawal, R.: Mining sequential patterns: generalizations and performance improvements. Advances in Database Technology — EDBT '96. 1057, 1–17 (1996)
Article Google Scholar
Wang, X.: Parallel sequential pattern mining by transcation decompostion. The International Conference on Fuzzy Systems and Knowledge Discovery. 4, 1746–1750 (2010)
Google Scholar
Wang, J., Han, J.: Bide:Efficientminingoffrequentclosedsequences. In: 20th International Conference on Data Engineering, pp. 79–90. IEEE (2004)
Wang, J., Han, J., Li, C.: Frequent closed sequence mining without candidate maintenance. TKDE. 19(8), 1042–1056 (2007)
Google Scholar
Wang, T., Zhang, D., Zhou, X., et al.: Mining personal frequent routes via road corner detection. IEEE Trans. Syst. 46(4), 445–458 (2016)
Google Scholar
Wei, Q.-Y., Liu, D., Duan, S.-L.: Distributed PrefixSpan algorithm based on MapReduce. In: 2012 International Symposium on Information Technology in Medicine and Education, pp. 901–904(2012)
Wu, C., Lai, C., Lo, Y.: An empirical study on mining sequential patterns in a grid computing environment. Expert Syst. Appl. 39(5), 5748–5757 (2012)
Article Google Scholar
Xin, J., Wang, Z., Chen, C., Ding, L., Wang, G., Zhao, Y.: ELM∗: distributed extreme learning machine with MapReduce. World Wide Web. 17(5), 1189–1204 (2014)
Article Google Scholar
Xun, Y., Zhang, J., Qin, X.: FiDoop: parallel Mining of Frequent Itemsets Using MapReduce. IEEE Transactions on Systems, Man, and Cybernetics: Systems. 46(3), 313–325 (2016)
Article Google Scholar
Yan, X., Han, J., Afshar, R.: Clospan:Mining closed sequential patterns in large datasets. In: SDM Conference, pp. 166–177(2003)
Yu, C.-C., Chen, Y.-L.: Mining sequential patterns from multidimensional sequence data. IEEE Trans. Knowl. Data Eng. 17(1), 136–140 (2005)
Article MathSciNet Google Scholar
Yu, D., Wu, W., Zheng, S., Zhu, Z.: BIDE-based ParallelMining of frequent closed sequences with MapReduce. In: Proceedings of the 12th International Conference on Algorithms and Architecturesfor Parallel Processing, pp.177–186(2012)
Yu, X., Liu, J., Ma, C., Li, B.: A MapReduc reinforeced distirbuted sequential pattern mining algorithm. Algorithms and Architectures for Parallel Processing. 9529, 183–197 (2015)
Article Google Scholar
Zaharia, M., et al.: Spark: cluster computing with working sets. HotCloud, pp. 10–10(2010)
Zaharia, M., et al: Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on Networked Systems Design and Implementation, USENIX Association(2012)
Zaki, M.: SPADE: an efficient algorithm for mining frequent sequences. Mach. Learn. 41(2), 31–60 (2001)
Article MATH Google Scholar
Zaki, M.J.: Parallel sequence mining on shared-memory machines. J. Parallel Distrib. Comput. 61(3), 401–426 (2001)
Article MATH Google Scholar
Zhang, C., Hu, K., Liu, H.: FMGSP: an efficient method of mining global sequential pattern. In: 4th International Conference on Fuzzy Systems and Knowledge Discovery, pp. 761–765(2007)
Zheng, Z., Wei, W., Liu, C., et al.: An effective contrast sequential pattern mining approach to taxpayer behavior analysis. World Wide Web-internet & Web Information Systems. 19(4), 633–651 (2016)
Google Scholar

Download references

Author information

Authors and Affiliations

State Key Lab. of Software Engineering, School of Computer Science, Wuhan University, Wuhan, 430072, China
Xiao Yu & Jin Liu
Department of Computer Science, City University of Hong Kong, Hong Kong, 999077, China
Xiao Yu & Qing Li

Authors

Xiao Yu
View author publications
You can also search for this author inPubMed Google Scholar
Qing Li
View author publications
You can also search for this author inPubMed Google Scholar
Jin Liu
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Jin Liu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yu, X., Li, Q. & Liu, J. Scalable and parallel sequential pattern mining using spark. World Wide Web 22, 295–324 (2019). https://doi.org/10.1007/s11280-018-0566-1

Download citation

Received: 14 November 2017
Revised: 20 February 2018
Accepted: 05 April 2018
Published: 10 May 2018
Issue Date: 15 January 2019
DOI: https://doi.org/10.1007/s11280-018-0566-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Scalable and parallel sequential pattern mining using spark

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A MapReduce Reinforced Distributed Sequential Pattern Mining Algorithm

SMASK: Parallel Probabilistic Privacy-Preserving Frequent Pattern Mining Technique for Big Data

A novel mapreduce algorithm for distributed mining of sequential patterns using co-occurrence information

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Scalable and parallel sequential pattern mining using spark

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A MapReduce Reinforced Distributed Sequential Pattern Mining Algorithm

SMASK: Parallel Probabilistic Privacy-Preserving Frequent Pattern Mining Technique for Big Data

A novel mapreduce algorithm for distributed mining of sequential patterns using co-occurrence information

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now