Distributed and scalable sequential pattern mining through stream processing

Chen, Chun-Chieh; Shuai, Hong-Han; Chen, Ming-Syan

doi:10.1007/s10115-017-1037-1

Distributed and scalable sequential pattern mining through stream processing

Regular Paper
Published: 20 March 2017

Volume 53, pages 365–390, (2017)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Chun-Chieh Chen^1,2,
Hong-Han Shuai³ &
Ming-Syan Chen^2,4

1069 Accesses
21 Citations
Explore all metrics

Abstract

Scalability is a primary issue in existing sequential pattern mining algorithms for dealing with a large amount of data. Previous work, namely sequential pattern mining on the cloud (SPAMC), has already addressed the scalability problem. It supports the MapReduce cloud computing architecture for mining frequent sequential patterns on large datasets. However, this existing algorithm does not address the iterative mining problem, which is the problem that reloading data incur additional costs. Furthermore, it did not study the load balancing problem. To remedy these problems, we devised a powerful sequential pattern mining algorithm, the sequential pattern mining in the cloud-uniform distributed lexical sequence tree algorithm (SPAMC-UDLT), exploiting MapReduce and streaming processes. SPAMC-UDLT dramatically improves overall performance without launching multiple MapReduce rounds and provides perfect load balancing across machines in the cloud. The results show that SPAMC-UDLT can significantly reduce execution time, achieves extremely high scalability, and provides much better load balancing than existing algorithms in the cloud.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

OpenMP, http://www.openmp.org/.
MPI, http://www.open-mpi.org/.
If the bitmap vector is extremely sparse, the word-aligned hybrid code (WAH) [44] can serve for our goal. Specifically, WAH is a run-length encoding for compressing input data to words, where ANDs can be efficiently performed on any two words, and thus the bitmap representations can still work in this situation.

References

Hadoop A (2012) http://hadoop.apache.org/
Hama A (2012) http://hama.apache.org/
Agrawal R, Srikant R (1995) Mining sequential patterns. In: Proceedings of the 11th international conference on data engineering (ICDE’95), pp 3–14
Ayres J, Flannick J, Gehrke J et al (2002) Sequential pattern mining using a bitmap representation. In: Proceedings of the 8th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’02), pp 429–435
Batal I, Valizadegan H, Cooper GF et al (2013) A temporal pattern mining approach for classifying electronic health record data. Trans Intell Syst Technol (TIST’13) 63:1–22
Google Scholar
Bu Y, Howe B, Balazinska M et al (2010) Haloop: efficient iterative data processing on large clusters. In: Proceedings of the VLDB endowment (PVLDB’10), pp 285–296
Chen CC, Tseng CY, Chen MS (2013) Highly scalable sequential pattern mining based on MapReduce model on the cloud. IEEE international congress on big data (BigData Congress’13), pp 310–317
Chen CC , Shuai HH, and Chen MS (2016) Appendix of distributed and scalable sequential pattern mining through stream processing. https://www.csie.ntu.edu.tw/~d96944011/kais2016/appendix
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. Commun ACM (CACM’08) 51:107–113
Ekanayake J, Li H, Zhang B et al (2010) Twister: a runtime for iterative MapReduce. In: Proceeding of the 19th ACM international symposium on high performance distributed computing (HPDC’10), pp 810–818
Fang W, Lu M, Xiao X et al (2009) Frequent itemset mining on graphics processors. In: Proceedings of the 5th international workshop on data management on new hardware (DaMoN’09), pp 34–42
Gomariz A, Campos M, Marin R et al (2013) ClaSP: an efficient algorithm for mining frequent closed sequences. In: Proceedings of the 17th Pacific-Asia conference on knowledge discovery and data mining (PAKDD’13), pp 50–61
Goodhope K, Koshy J, Kreps J et al (2012) Building LinkedIn’s real-time activity data pipeline. IEEE Data Eng Bull (Data Eng Bull’12) 35:33–45
Google Scholar
Guralnik V, Karypis G (2004) Parallel tree-projection-based sequence mining algorithms. Parallel Comput (PARALLEL COMPUT’04) 30:443–472
Article Google Scholar
Han J, Pei J, Mortazavi-Asl B et al (2000) FreeSpan: frequent pattern-projected sequential pattern mining. In: Proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’00), pp 355–359
Han J, Pei J, Yan X (2005) Sequential pattern mining by pattern-growth: principles and extension. Foundations and advances in data mining. Springer, Berlin
MATH Google Scholar
Ho J, Lukov L, Chawla S (2005) Sequential pattern mining with constraints on large protein databases. In: Proceedings of the 12th international conference on management of data (COMAD’05), pp 89–100
Huang JW, Tseng CY, Ou JC et al (2008) A general model for sequential pattern mining with a progressive database. IEEE Trans Knowl Data Eng (TKDE’08) 20:1153–1167
Article Google Scholar
Huang JW, Lin SC, Chen MS (2010) DPSP: distributed progressive sequential pattern mining on the cloud. 14th Pacific–Asia conference on knowledge discovery and data mining (PAKDD’10), pp 27–34
Isard M, Budiu M, Yu Y et al (2007) Dryad: distributed data-parallel programs from sequential building blocks. ACM SIGOPS Oper Syst Rev (SIGOPS’07) 41:59–72
Article Google Scholar
Ji X, Bailey J, Dong G (2007) Mining minimal distinguishing subsequence patterns with gap constraints. Knowl Inf Syst (KAIS’07) 11:259–286
Article Google Scholar
Kreps J, Narkhede N, Rao J (2011) Kafka: a distributed messaging system for log processing. NetDB workshop
Liao CC, Chen MS (2014) DFSP: a Depth-First SPelling algorithm for sequential pattern mining of biological sequences. Knowl Inf Syst (KAIS’14) 38:623–639
Article Google Scholar
Luo C, Chung S (2008) A scalable algorithm for mining maximal frequent sequences using a sample. Knowl Inf Syst (KAIS’08) 15:149–179
Article Google Scholar
Mabroukeh NR, Ezeife CI (2010) A taxonomy of sequential pattern mining algorithms. ACM Comput Surv (CSUR’10) 43:1–41
Article Google Scholar
Mane RV (2013) A comparative study of Spam and PrefixSpan sequential pattern mining algorithm for protein sequences. In: Proceedings of the 3rd international conference on advances in computing, communication, and control (ICAC3’13), pp 147–155
Miliaraki I, Berberich K, Gemulla R et al (2013) Mind the gap: large-scale frequent sequence mining. In: Proceedings of the 2013 ACM SIGMOD international conference on management of data (SIGMOD’13), pp 797–808
Papapetrou P, Kollios G, Sclaroff S et al (2009) Mining frequent arrangements of temporal intervals. Knowl Inf Syst (KAIS’09) 21:133–171
Article Google Scholar
Parimala M, Sathiyabama S (2012) SPMLS: an efficient sequential pattern mining algorithm with candidate generation and frequency testing. Int J Comput Sci Eng (IJCSE’12) 4:601–607
Google Scholar
Pei J, Han J, Mortazavi-asl B et al (2001) PrefixSpan: mining sequential patterns efficiently by prefix-projected pattern growth. In: Proceedings of the 7th ACM SIGKDD international conference on knowledge discovery and data mining (KDD’01), pp 215–224
Perer A, Wang F (2014) Frequence: interactive mining and visualization of temporal frequent event sequences. In: Proceedings of the 19th ACM international conference on intelligent user interfaces (IUI’14), pp 153–162
Sahli M, Mansour E, Kalnis P (2014) ACME: a scalable parallel system for extracting frequent patterns from a very long sequence. VLDB J (VLDBJ’14) 23:871–893
Article Google Scholar
Shie BE, Hsiao HF, Tseng V (2013) Efficient algorithms for discovering high utility user behavior patterns in mobile commerce environments. Knowl Inf Syst (KAIS’13) 37:363–387
Article Google Scholar
Srikant R, Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: Proceedings of the 5th international conference on extending database technology (EDBT’96), pp 3–17
Samza (2013) https://samza.incubator.apache.org/
Storm: distributed and fault–tolerant realtime computation (2012) http://storm.incubator.apache.org/
Spark: Lightning-fast cluster computing (2013) https://spark.incubator.apache.org/
S4: Distributed Stream Computing Platform (2010) https://incubator.apache.org/s4/
Twister: iterative MapReduce (2012) https://iterativemapreduce.org/
White Tom (2009) Hadoop: the definitive guide. O’Reilly Media, Newton
Google Scholar
Wang K, Xu Y, Yu JX (2004) Scalable sequential pattern mining for biological sequences. In: Proceedings of the 13th ACM international conference on information and knowledge management (CIKM’04), pp 178–187
Wang X, Wang J, Wang T et al (2010) Parallel sequential pattern mining by transaction decomposition. International conference on fuzzy systems and knowledge discovery (FSKD’10), pp 1746–1750
Weng L, Menczer F, Ahn YY (2013) Virality prediction and community structure in social networks. Sci Rep 3. doi:10.1038/srep02522
Wu K, Otoo EJ, Shoshani A (2002) Compressing bitmap indexes for faster search operations. In: Proceedings of 14th international conference on scientific and statistical database management (SSDBM’02), pp 99–108
Yu D, Wu W, Zheng S et al (2012) BIDE-based parallel mining of frequent closed sequences with MapReduce. In: Proceedings of the 12th international conference on algorithms and architectures for parallel processing (ICA3PP’12), pp 177–186
Yu D, Zhu Q, Shao J et al (2014) Parallel execution of data-intensive web services based on data-flow constructs and I/O operation ratio. Int J Database Theory Appl (IJDTA’14) 7:129–138
Article Google Scholar
Zaharia M, Chowdhury M, Das T et al (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation (NSDI’12), p 2
Zaharia M, Chowdhury M, Das T et al (2012) Discretized streams: an efficient and fault-tolerant model for stream processing on large clusters. In: Proceedings of the 4th USENIX conference on hot topics in cloud computing (HotCloud’12), pp 215–224
Zaki MJ (1998) Efficient enumeration of frequent sequences. In: Proceedings of the 7th ACM international conference on information and knowledge management (CIKM’98), pp 68–75
Zaki MJ (2001) Parallel sequence mining on shared-memory machines. J Parallel Distrib Comput (JPDC’01) 61:401–426
Article MATH Google Scholar
Zhao Q, Bhowmick SS (2003) Sequential pattern matching: a survey. ITechnical report CAIS Nayang Technological University Singapore, pp 1–26

Download references

Author information

Authors and Affiliations

Graduate Institute of Networking and Multimedia, National Taiwan University, Taipei, Taiwan
Chun-Chieh Chen
Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan
Chun-Chieh Chen & Ming-Syan Chen
Department of Electrical and Computer Engineering, National Chiao Tung University, Hsinchu, Taiwan
Hong-Han Shuai
Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan
Ming-Syan Chen

Authors

Chun-Chieh Chen
View author publications
You can also search for this author in PubMed Google Scholar
Hong-Han Shuai
View author publications
You can also search for this author in PubMed Google Scholar
Ming-Syan Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chun-Chieh Chen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Chen, CC., Shuai, HH. & Chen, MS. Distributed and scalable sequential pattern mining through stream processing. Knowl Inf Syst 53, 365–390 (2017). https://doi.org/10.1007/s10115-017-1037-1

Download citation

Received: 04 July 2015
Revised: 19 December 2016
Accepted: 01 March 2017
Published: 20 March 2017
Issue Date: November 2017
DOI: https://doi.org/10.1007/s10115-017-1037-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Distributed and scalable sequential pattern mining through stream processing

Abstract

Access this article

Similar content being viewed by others

Scalable and parallel sequential pattern mining using spark

Big Data Frequent Pattern Mining

A Survey of High Utility Pattern Mining Algorithms for Big Data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Distributed and scalable sequential pattern mining through stream processing

Abstract

Access this article

Similar content being viewed by others

Scalable and parallel sequential pattern mining using spark

Big Data Frequent Pattern Mining

A Survey of High Utility Pattern Mining Algorithms for Big Data

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation