CONTOUR: an efficient algorithm for discovering discriminating subsequences

Wang, Jianyong; Zhang, Yuzhou; Zhou, Lizhu; Karypis, George; Aggarwal, Charu C.

doi:10.1007/s10618-008-0100-7

CONTOUR: an efficient algorithm for discovering discriminating subsequences

Published: 30 May 2008

Volume 18, pages 1–29, (2009)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Jianyong Wang¹,
Yuzhou Zhang¹,
Lizhu Zhou¹,
George Karypis² &
…
Charu C. Aggarwal³

295 Accesses
8 Citations
Explore all metrics

Abstract

In recent years we have witnessed several applications of frequent sequence mining, such as feature selection for protein sequence classification and mining block correlations in storage systems. In typical applications such as clustering, it is not the complete set but only a subset of discriminating frequent subsequences which is of interest. One approach to discovering the subset of useful frequent subsequences is to apply any existing frequent sequence mining algorithm to find the complete set of frequent subsequences. Then, a subset of interesting subsequences can be further identified. Unfortunately, it is very time consuming to mine the complete set of frequent subsequences for large sequence databases. In this paper, we propose a new algorithm, CONTOUR, which efficiently mines a subset of high-quality subsequences directly in order to cluster the input sequences. We mainly focus on how to design some effective search space pruning methods to accelerate the mining process and discuss how to construct an accurate clustering algorithm based on the result of CONTOUR. We conducted an extensive performance study to evaluate the efficiency and scalability of CONTOUR, and the accuracy of the frequent subsequence-based clustering algorithm.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FCloSM, FGenSM: two efficient algorithms for mining frequent closed and generator sequences using the local pruning strategy

Article 17 February 2017

NetHAPP: High Average Utility Periodic Gapped Sequential Pattern Mining

WS-Miner: A Fast Weighted Sequential Pattern Mining Algorithm

References

Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of 29th international conference on very large data bases, Berlin, pp 81–92
Aggarwal CC, Ta N, Wang J, Feng J, Zaki MJ (2007) XProj: a framework for projected structural clustering of XML documents. In: Proceedings of 13th ACM SIGKDD international conference on knowledge discovery and data mining, San Jose, pp 46–55
Agrawal R, Srikant R (1995) Mining sequential patterns. In: Proceedings of 11th international conference on data engineering, Taipei, pp 3–14
Ayres J, Gehrke J, Yiu T, Flannick J (2002) Sequential PAttern mining using a bitmap representation. In: Proceedings of 8th ACM SIGKDD international conference on knowledge discovery and data mining, Edmonton, pp 429–435
Bettini C, Wang X, Jajodia S (1998) Mining temporal relationships with multiple granularities in time sequences. Data Eng Bull 21(1): 32–38
MathSciNet Google Scholar
Casas-Garriga G (2005) Summarizing sequential data with closed partial orders. In: Proceedings of 5th SIAM international conference on data mining, Newport Beach, pp 380–391
Cormen T, Leiserson C, Rivest R, Stein C (2001) Introduction to algorithms. MIT Press
Dalamagas T, Cheng T, Winkel K, Sellis T (2006) A methodology for clustering XML documents by structure. Inform Syst 31(3): 187–228
Article Google Scholar
Deshpande M, Karypis G (2002) Evaluation of techniques for classifying biological sequences. In: Proceedings of 6th Pacific-Asia conference on advances in knowledge discovery and data mining, Taipei, pp 417–431
Garofalakis M, Rastogi R, Shim K (1999) SPIRIT: sequential PAttern mining with regular expression constraints. In: Proceedings of 25th international conference on very large data bases, Edinburgh, pp 223–234
Guralnik V, Karypis G (2001) A scalable algorithm for clustering sequential data. In: Proceedings of 1st IEEE international conference on data mining, San Jose, pp 179–186
Han J, Dong G, Yin Y (1999) Efficient mining of partial periodic patterns in time series database. In: Proceedings of 15th international conference on data engineering, Sydney, pp 106–115
Han J, Pei J, Mortazavi-Asl B, Chen Q, Dayal U, Hsu MC (2000) FreeSpan: frequent pattern-projected sequential pattern mining. In: Proceedings of 6th ACM SIGKDD international conference on knowledge discovery and data mining, Boston, pp 355–359
Ji X, Bailey J, Dong G (2005) Mining minimal distinguishing subsequence patterns with gap constraints. In: Proceedings of 5th IEEE international conference on data mining, Houston, pp 194–201
Li C, Wang J (2008) Efficiently mining closed subsequences with gap constraints. In: Proceedings of 2008 SIAM international conference on data mining, Atlanta
Li Z, Chen Z, Srinivasan S, Zhou Y (2004) C-Miner: mining block correlations in storage systems. In: Proceedings of USENIX conference on file and storage technologies, San Francisco, pp 173–186
Li Z, Lu S, Myagmar S, Zhou Y (2006) CP-Miner: finding copy-paste and related bugs in large-scale software code. IEEE Trans Software Eng 32(3): 176–192
Article Google Scholar
Mannila H, Toivonen H, Verkamo AI (1995) Discovering frequent episodes in sequences. In: Proceedings of 1st international conference on knowledge discovery and data mining, Montreal
Masseglia F, Cathala F, Poncelet P (1998) The psp approach for mining sequential patterns. In: Proceedings of 2nd european symposium on principles of data mining and knowledge discovery, Nantes, pp 176–184
Ozden B, Ramaswamy S, Silberschatz A (1998) Cyclic association rules. In: Proceedings of 14th international conference on data engineering, Orlando, pp 412–421
Pei J, Han J, Mortazavi-Asl B, Chen Q, Dayal U, Hsu MC (2001) PrefixSpan: mining sequential patterns efficiently by prefix-projected pattern growth. In: Proceedings of 17th international conference on data engineering, Heidelberg, pp 215–224
Pei J, Han J, Wang W (2002) Constraint-based sequential pattern mining in large databases. In: Proceedings of 2002 ACM CIKM international conference on information and knowledge management, McLean, pp 18–25
Pei J, Liu J, Wang H, Wang K, Yu PS, Wang J (2005) Efficiently mining frequent closed partial orders. In: Proceedings of 5th IEEE international conference on data mining, Houston, pp 753–756
Seno M, Karypis G (2002) SLPMiner: An algorithm for finding frequent sequential patterns using length-decreasing support constraint. In: Proceedings of 2nd IEEE international conference on data mining, Maebashi City, pp 418–425
She R, Chen F, Wang K, Ester M et al. (2003) Frequent-subsequence-based prediction of outer membrane proteins. In: Proceedings of 9th ACM SIGKDD international conference on knowledge discovery and data mining, Washington, pp 236–245
Srikant R, Agrawal R (1996) Mining sequential patterns:generalizations and performance improvements. In: Proceedings of 5th international conference on extending database technology, Avignon, pp 3–17
Wang J, Han J (2004) BIDE: Efficient mining of frequent closed sequences. In: Proceedings of 20th international conference on data engineering, Boston, pp 79–90
Wang J, Karypis G (2004) SUMMARY: Efficiently summarizing transactions for clustering. In: Proceedings of 4th IEEE international conference on data mining, Brighton, pp 241–248
Wang J, Karypis G (2005) HARMONY: Efficiently mining the best rules for classification. In: Proceedings of 5th SIAM international conference on data mining, Newport Beach, pp 205–216
Yan X, Han J, Afshar R (2003) CloSpan: mining closed sequential patterns in large databases. In: Proceedings of 3rd SIAM international conference on data mining, San Francisco
Yang J, Wang W (2003) CLUSEQ: efficient and effective sequence clustering. In: Proceedings of 19th international conference on data engineering, Bangalore, pp 101–112
Yang J, Yu PS, Wang W, Han J (2002) Mining long sequential patterns in a noisy environment. In: Proceedings of 2002 ACM SIGMOD international conference on management of data, Madison, pp 406–417
Zaki M (2001) SPADE: An efficient algorithm for mining frequent sequences. Mach Learn 42: 31–60
Article MATH Google Scholar
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of 1996 ACM SIGMOD international conference on management of data, Montreal, pp 103–114

Download references

Author information

Authors and Affiliations

Tsinghua University, Beijing, 100084, China
Jianyong Wang, Yuzhou Zhang & Lizhu Zhou
University of Minnesota, Minneapolis, MN, 55455, USA
George Karypis
IBM T.J. Watson Research Center, Hawthorne, NY, 10532, USA
Charu C. Aggarwal

Authors

Jianyong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yuzhou Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Lizhu Zhou
View author publications
You can also search for this author in PubMed Google Scholar
George Karypis
View author publications
You can also search for this author in PubMed Google Scholar
Charu C. Aggarwal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jianyong Wang.

Additional information

Responsible editor: M. J. Zaki.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, J., Zhang, Y., Zhou, L. et al. CONTOUR: an efficient algorithm for discovering discriminating subsequences. Data Min Knowl Disc 18, 1–29 (2009). https://doi.org/10.1007/s10618-008-0100-7

Download citation

Received: 13 October 2007
Accepted: 12 May 2008
Published: 30 May 2008
Issue Date: February 2009
DOI: https://doi.org/10.1007/s10618-008-0100-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CONTOUR: an efficient algorithm for discovering discriminating subsequences

Abstract

Access this article

Similar content being viewed by others

FCloSM, FGenSM: two efficient algorithms for mining frequent closed and generator sequences using the local pruning strategy

NetHAPP: High Average Utility Periodic Gapped Sequential Pattern Mining

WS-Miner: A Fast Weighted Sequential Pattern Mining Algorithm

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

CONTOUR: an efficient algorithm for discovering discriminating subsequences

Abstract

Access this article

Similar content being viewed by others

FCloSM, FGenSM: two efficient algorithms for mining frequent closed and generator sequences using the local pruning strategy

NetHAPP: High Average Utility Periodic Gapped Sequential Pattern Mining

WS-Miner: A Fast Weighted Sequential Pattern Mining Algorithm

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation