Skip to main content
Log in

CONTOUR: an efficient algorithm for discovering discriminating subsequences

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

In recent years we have witnessed several applications of frequent sequence mining, such as feature selection for protein sequence classification and mining block correlations in storage systems. In typical applications such as clustering, it is not the complete set but only a subset of discriminating frequent subsequences which is of interest. One approach to discovering the subset of useful frequent subsequences is to apply any existing frequent sequence mining algorithm to find the complete set of frequent subsequences. Then, a subset of interesting subsequences can be further identified. Unfortunately, it is very time consuming to mine the complete set of frequent subsequences for large sequence databases. In this paper, we propose a new algorithm, CONTOUR, which efficiently mines a subset of high-quality subsequences directly in order to cluster the input sequences. We mainly focus on how to design some effective search space pruning methods to accelerate the mining process and discuss how to construct an accurate clustering algorithm based on the result of CONTOUR. We conducted an extensive performance study to evaluate the efficiency and scalability of CONTOUR, and the accuracy of the frequent subsequence-based clustering algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of 29th international conference on very large data bases, Berlin, pp 81–92

  • Aggarwal CC, Ta N, Wang J, Feng J, Zaki MJ (2007) XProj: a framework for projected structural clustering of XML documents. In: Proceedings of 13th ACM SIGKDD international conference on knowledge discovery and data mining, San Jose, pp 46–55

  • Agrawal R, Srikant R (1995) Mining sequential patterns. In: Proceedings of 11th international conference on data engineering, Taipei, pp 3–14

  • Ayres J, Gehrke J, Yiu T, Flannick J (2002) Sequential PAttern mining using a bitmap representation. In: Proceedings of 8th ACM SIGKDD international conference on knowledge discovery and data mining, Edmonton, pp 429–435

  • Bettini C, Wang X, Jajodia S (1998) Mining temporal relationships with multiple granularities in time sequences. Data Eng Bull 21(1): 32–38

    MathSciNet  Google Scholar 

  • Casas-Garriga G (2005) Summarizing sequential data with closed partial orders. In: Proceedings of 5th SIAM international conference on data mining, Newport Beach, pp 380–391

  • Cormen T, Leiserson C, Rivest R, Stein C (2001) Introduction to algorithms. MIT Press

  • Dalamagas T, Cheng T, Winkel K, Sellis T (2006) A methodology for clustering XML documents by structure. Inform Syst 31(3): 187–228

    Article  Google Scholar 

  • Deshpande M, Karypis G (2002) Evaluation of techniques for classifying biological sequences. In: Proceedings of 6th Pacific-Asia conference on advances in knowledge discovery and data mining, Taipei, pp 417–431

  • Garofalakis M, Rastogi R, Shim K (1999) SPIRIT: sequential PAttern mining with regular expression constraints. In: Proceedings of 25th international conference on very large data bases, Edinburgh, pp 223–234

  • Guralnik V, Karypis G (2001) A scalable algorithm for clustering sequential data. In: Proceedings of 1st IEEE international conference on data mining, San Jose, pp 179–186

  • Han J, Dong G, Yin Y (1999) Efficient mining of partial periodic patterns in time series database. In: Proceedings of 15th international conference on data engineering, Sydney, pp 106–115

  • Han J, Pei J, Mortazavi-Asl B, Chen Q, Dayal U, Hsu MC (2000) FreeSpan: frequent pattern-projected sequential pattern mining. In: Proceedings of 6th ACM SIGKDD international conference on knowledge discovery and data mining, Boston, pp 355–359

  • Ji X, Bailey J, Dong G (2005) Mining minimal distinguishing subsequence patterns with gap constraints. In: Proceedings of 5th IEEE international conference on data mining, Houston, pp 194–201

  • Li C, Wang J (2008) Efficiently mining closed subsequences with gap constraints. In: Proceedings of 2008 SIAM international conference on data mining, Atlanta

  • Li Z, Chen Z, Srinivasan S, Zhou Y (2004) C-Miner: mining block correlations in storage systems. In: Proceedings of USENIX conference on file and storage technologies, San Francisco, pp 173–186

  • Li Z, Lu S, Myagmar S, Zhou Y (2006) CP-Miner: finding copy-paste and related bugs in large-scale software code. IEEE Trans Software Eng 32(3): 176–192

    Article  Google Scholar 

  • Mannila H, Toivonen H, Verkamo AI (1995) Discovering frequent episodes in sequences. In: Proceedings of 1st international conference on knowledge discovery and data mining, Montreal

  • Masseglia F, Cathala F, Poncelet P (1998) The psp approach for mining sequential patterns. In: Proceedings of 2nd european symposium on principles of data mining and knowledge discovery, Nantes, pp 176–184

  • Ozden B, Ramaswamy S, Silberschatz A (1998) Cyclic association rules. In: Proceedings of 14th international conference on data engineering, Orlando, pp 412–421

  • Pei J, Han J, Mortazavi-Asl B, Chen Q, Dayal U, Hsu MC (2001) PrefixSpan: mining sequential patterns efficiently by prefix-projected pattern growth. In: Proceedings of 17th international conference on data engineering, Heidelberg, pp 215–224

  • Pei J, Han J, Wang W (2002) Constraint-based sequential pattern mining in large databases. In: Proceedings of 2002 ACM CIKM international conference on information and knowledge management, McLean, pp 18–25

  • Pei J, Liu J, Wang H, Wang K, Yu PS, Wang J (2005) Efficiently mining frequent closed partial orders. In: Proceedings of 5th IEEE international conference on data mining, Houston, pp 753–756

  • Seno M, Karypis G (2002) SLPMiner: An algorithm for finding frequent sequential patterns using length-decreasing support constraint. In: Proceedings of 2nd IEEE international conference on data mining, Maebashi City, pp 418–425

  • She R, Chen F, Wang K, Ester M et al. (2003) Frequent-subsequence-based prediction of outer membrane proteins. In: Proceedings of 9th ACM SIGKDD international conference on knowledge discovery and data mining, Washington, pp 236–245

  • Srikant R, Agrawal R (1996) Mining sequential patterns:generalizations and performance improvements. In: Proceedings of 5th international conference on extending database technology, Avignon, pp 3–17

  • Wang J, Han J (2004) BIDE: Efficient mining of frequent closed sequences. In: Proceedings of 20th international conference on data engineering, Boston, pp 79–90

  • Wang J, Karypis G (2004) SUMMARY: Efficiently summarizing transactions for clustering. In: Proceedings of 4th IEEE international conference on data mining, Brighton, pp 241–248

  • Wang J, Karypis G (2005) HARMONY: Efficiently mining the best rules for classification. In: Proceedings of 5th SIAM international conference on data mining, Newport Beach, pp 205–216

  • Yan X, Han J, Afshar R (2003) CloSpan: mining closed sequential patterns in large databases. In: Proceedings of 3rd SIAM international conference on data mining, San Francisco

  • Yang J, Wang W (2003) CLUSEQ: efficient and effective sequence clustering. In: Proceedings of 19th international conference on data engineering, Bangalore, pp 101–112

  • Yang J, Yu PS, Wang W, Han J (2002) Mining long sequential patterns in a noisy environment. In: Proceedings of 2002 ACM SIGMOD international conference on management of data, Madison, pp 406–417

  • Zaki M (2001) SPADE: An efficient algorithm for mining frequent sequences. Mach Learn 42: 31–60

    Article  MATH  Google Scholar 

  • Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of 1996 ACM SIGMOD international conference on management of data, Montreal, pp 103–114

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianyong Wang.

Additional information

Responsible editor: M. J. Zaki.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, J., Zhang, Y., Zhou, L. et al. CONTOUR: an efficient algorithm for discovering discriminating subsequences. Data Min Knowl Disc 18, 1–29 (2009). https://doi.org/10.1007/s10618-008-0100-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-008-0100-7

Keywords

Navigation