DFSP: a Depth-First SPelling algorithm for sequential pattern mining of biological sequences

Liao, Vance Chiang-Chi; Chen, Ming-Syan

doi:10.1007/s10115-012-0602-x

DFSP: a Depth-First SPelling algorithm for sequential pattern mining of biological sequences

Regular Paper
Published: 26 January 2013

Volume 38, pages 623–639, (2014)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Vance Chiang-Chi Liao¹ &
Ming-Syan Chen¹

630 Accesses
21 Citations
Explore all metrics

Abstract

Scientific progress in recent years has led to the generation of huge amounts of biological data, most of which remains unanalyzed. Mining the data may provide insights into various realms of biology, such as finding co-occurring biosequences, which are essential for biological data mining and analysis. Data mining techniques like sequential pattern mining may reveal implicitly meaningful patterns among the DNA or protein sequences. If biologists hope to unlock the potential of sequential pattern mining in their field, it is necessary to move away from traditional sequential pattern mining algorithms, because they have difficulty handling a small number of items and long sequences in biological data, such as gene and protein sequences. To address the problem, we propose an approach called Depth-First SPelling (DFSP) algorithm for mining sequential patterns in biological sequences. The algorithm’s processing speed is faster than that of PrefixSpan, its leading competitor, and it is superior to other sequential pattern mining algorithms for biological sequences.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Achar A, Laxman S, Sastry PS (2011) A unified view of the apriori-based algorithms for frequent episode discovery. Knowl Inf Syst 31: 223–250
Google Scholar
Agrawal R and Srikant R (1995) Mining sequential patterns. Proceedings of the 11th international conference on data, engineering, pp 3–14
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215(3):403–410
Google Scholar
Aseervatham S, Osmani A, Viennet E (2006) bitSPADE: A lattice-based sequential pattern mining algorithm using bitmap representation. In: Proceedings of 6th international conference on data mining, pp 792–797
Ayres J, Gehrke J, Yiu T, Flannick J (2002), Sequential pattern mining using a bitmap representation. In: Proceedings of 8th ACM SIGKDD international conference on knowledge discovery and data mining, pp 429–435, July 2002
Bajcsy P, Han J, Liu L, Young J (2004) Survey of biodata analysis from a data mining perspective. Wang JTL, Zaki MJ, Toivonen HTT, and Shasha D (eds) Data Mining in Bioinformatics, Chapter 2. Springer, pp 9–39
BLAST, http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/
Chen M-S, Han J, Yu PS (1996) Data mining: an overview from database perspective. IEEE Trans Knowl Data Eng 5(1):866–883
Article Google Scholar
Chen Y-C, Peng W-C, Lee S-Y (2011) CEMiner—An efficient algorithms for mining closed patterns from interval-based data. In: Proceedings of the 11th IEEE international conference on data mining (ICDM). Vancouver, Canada, pp 121–130, Dec 11–14
Cheng H, Yan X, Han J (2004) Incspan: incremental mining of sequential patterns in large database. In: Proceedings of 10th ACM SIGKDD international conference on knowledge discovery and data mining, pp 527–532
Han J (2002) How can data mining help bio-data analysis? In: Proceedings of the workshop on data mining in bioinformatics (BIOKDD’02 with SIGKDD’02 conference. Edmonton, Canada), pp 1–4
Han J, Kamber M (2000) Data mining: concepts and techniques. Morgan Kaufmann, San Francisco
Google Scholar
Han J, Pei J, Mortazavi-Asl B, Chen Q, Dayal U, Hsu MC (2000) Freespan: frequent pattern-projected sequential pattern mining. In: Proceedings of the 6th ACM SIGKDD international conference on knowledge discovery and data mining, pp 355–359
Han J, Pei J, Yan X (2005) Sequential pattern mining by pattern-growth: principles and extensions. In: Chu WW, lIN TY (eds) Recent advances in data mining and granular computing. Springer, Berlin, pp 183–220
Hirosawa M, Totoki Y, Hoshida M, Ishikawa M (1995) Comprehensive study on iterative algorithms of multiple sequence alignment. Bioinformatics 11:13–18
Article Google Scholar
Ho C-C, Li H-F, Kuo F-F, Lee S-Y (2006) Incremental mining of sequential patterns over a stream sliding window. In: Proceedings of IEEE international workshop on mining evolving and streaming data (IWMESD-2006), pp 677–681, Dec 2006
Hsu C-M, Chen C-Y, Liu B-J (2006) MAGIIC-PRO: detecting functional signatures by efficient discovery of long patterns in protein sequences. Nucleic Acids Res, W356–W361
Hsu C-M, Chen CY, Liu BJ, Huang CC, Laio MH, Lin CC, Wu TL (2007) Identification of hot regions in protein-protein interactions by sequential pattern mining. BMC Bioinform 8(Suppl. 5):S8. doi:10.1186/1471-2105-8-S5-S8
Article Google Scholar
Huang J-W, Tseng C-Y, Ou J-C, Chen M-S (2008) A General Model for Sequential Pattern Mining with a Progressive Database. IEEE Trans Knowl Data Eng, 20: 1153–1167, 11 Feb 2008
Google Scholar
Jones N, Pevzner P (2004) An introduction to bioinformatics algorithms. MIT Press, Cambridge
Google Scholar
Lin M-Y, Lee S-Y (2004) Incremental update on sequential patterns in large databases by implicit merging and efficient counting. Inf Syst 29(5):385–404
Article MathSciNet Google Scholar
Luo C, Chung SM (2005) Efficient mining of maximal sequential patterns using multiple samples. In: Proceedings of the 5th SIAM international conference on data mining (SDM’05), pp 415–426
Mampaey M, Tatti N, Vreeken J (2011) Tell me what I need to know: succinctly summarizing data with itemsets. In: Proceedings of 17th ACM SIGKDD international conference on knowledge discovery and data mining, pp 573–581
Marascu A, Masseglia F (2006) Mining sequential patterns from data streams: a centroid approach. J Intell Inf Syst (JIIS) 27(3):291–307
Google Scholar
Nguyen S, Sun X, Orlowska M (2005) Improvements of IncSpan: incremental mining of sequential patterns in large database. In: Proceedings of the 9th Pacific-Asia conference on knowledge discovery and data mining, pp 442–451
Parthasarathy S, Zaki MJ, Ogihara M, Dwarkadas S (1999) Incremental and interactive sequence mining. In: Proceedings of the 8th international conference on information and, knowledge management, pp 251–258
Pei J, Han J, Asl B, Wang J, Pinto H, Chen Q, Dayal U, Hsu M (2004) Mining sequential patterns by pattern-growth: the PrefixSpan approach. IEEE Trans Knowl Data Eng, pp 1424–1440, Oct 2004
Pei J, Han J, Lu H, Nishio S, Tang S, Yang D (2001) H-mine: hyper-structure mining of frequent patterns in large databases. In: Proceedings of the 2001 IEEE international conference on data mining (ICDM’01), San Jose, California, pp 441–448, Nov 29–Dec 2
Pei J, Han J, Mortazavi-Asl B, Zhu H (2000) Mining access patterns efficiently from Web logs. In: Proceedings of the 2000 Pacific-Asia conference on knowledge discovery and data mining (PAKDD’00). Kyoto, Japan, April, pp 396–407
Pei J, Han J, Wang W (2007) Constraint-based sequential pattern mining: the pattern-growth methods. J Intell Inf Syst 28(2):133–160
Google Scholar
Rajpathak D, Chougule R, Bandyopadhyay P (2011) A domain-specific decision support system for knowledge discovery using association and text mining. Knowl Inf Syst, pp 405–432
Salam A, Sikandar Hayat Khayal M (2011) Mining top-k frequent patterns without minimum support threshold. Knowl Inf Syst, pp 57–86
Senkul P, Salin S (2011) Improving pattern quality in web usage mining by using semantic information. Knowl Inf Syst, pp 527–541
Srikant R and Agrawal R (1996) Mining sequential patterns: generalizations and performance improvements. In: Proceedings of 5th international conference extending database technology (EDBT), vol 1057, Springer, pp 3–17
Wang J, Han J (2004) Bide: efficient mining of frequent closed sequences. In: Proceedings of the 20th international conference on data, engineering, pp 79–91
Wang X, Li G, Jiang G, Shi Z (2011) Semantic trajectory-based event detection and event pattern mining. Knowl Inf Syst, pp 1–25
Wang K, Xu Y, Yu J (2004) Scalable sequential pattern mining for biological sequences. In: Proceedings of conference information and, knowledge management, pp 178–187
Yan X, Han J, Afshar R (2003) Clospan: Mining closed sequential patterns in large datasets. In: Proceedings of the 3rd SIAM international conference on data mining (SDM’03), pp 166–177, May 2003
Yang J, Wang W, Yu PS, Han J (2002) Mining long sequential patterns in a noisy environment, In: Proceedings 2002 ACM-SIGMOD I international conference. Management of data (SIGMOD ’02), pp 406–417, June 2002
Zaki MJ (1998) Efficient enumeration of frequent sequences. In: Proceedings of the 7th international conference on information and knowledge management, pp 68–75, Nov 1998
Zaki MJ (2001) SPADE: an efficient algorithm for mining frequent sequences. Machine learning 42(1/2):31–60
Article MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Electrical Engineering, National Taiwan University, Taipei, Taiwan
Vance Chiang-Chi Liao & Ming-Syan Chen

Authors

Vance Chiang-Chi Liao
View author publications
You can also search for this author in PubMed Google Scholar
Ming-Syan Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vance Chiang-Chi Liao.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Liao, V.CC., Chen, MS. DFSP: a Depth-First SPelling algorithm for sequential pattern mining of biological sequences. Knowl Inf Syst 38, 623–639 (2014). https://doi.org/10.1007/s10115-012-0602-x

Download citation

Received: 13 July 2011
Revised: 19 August 2012
Accepted: 19 December 2012
Published: 26 January 2013
Issue Date: March 2014
DOI: https://doi.org/10.1007/s10115-012-0602-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DFSP: a Depth-First SPelling algorithm for sequential pattern mining of biological sequences

Abstract

Access this article

Similar content being viewed by others

A new fast technique for pattern matching in biological sequences

Applications of Concurrent Sequential Patterns in Protein Data Mining

Strict pattern matching under non-overlapping condition

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

DFSP: a Depth-First SPelling algorithm for sequential pattern mining of biological sequences

Abstract

Access this article

Similar content being viewed by others

A new fast technique for pattern matching in biological sequences

Applications of Concurrent Sequential Patterns in Protein Data Mining

Strict pattern matching under non-overlapping condition

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation