Abstract
In recent years we have witnessed several applications of frequent sequence mining, such as feature selection for protein sequence classification and mining block correlations in storage systems. In typical applications such as clustering, it is not the complete set but only a subset of discriminating frequent subsequences which is of interest. One approach to discovering the subset of useful frequent subsequences is to apply any existing frequent sequence mining algorithm to find the complete set of frequent subsequences. Then, a subset of interesting subsequences can be further identified. Unfortunately, it is very time consuming to mine the complete set of frequent subsequences for large sequence databases. In this paper, we propose a new algorithm, CONTOUR, which efficiently mines a subset of high-quality subsequences directly in order to cluster the input sequences. We mainly focus on how to design some effective search space pruning methods to accelerate the mining process and discuss how to construct an accurate clustering algorithm based on the result of CONTOUR. We conducted an extensive performance study to evaluate the efficiency and scalability of CONTOUR, and the accuracy of the frequent subsequence-based clustering algorithm.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of 29th international conference on very large data bases, Berlin, pp 81–92
Aggarwal CC, Ta N, Wang J, Feng J, Zaki MJ (2007) XProj: a framework for projected structural clustering of XML documents. In: Proceedings of 13th ACM SIGKDD international conference on knowledge discovery and data mining, San Jose, pp 46–55
Agrawal R, Srikant R (1995) Mining sequential patterns. In: Proceedings of 11th international conference on data engineering, Taipei, pp 3–14
Ayres J, Gehrke J, Yiu T, Flannick J (2002) Sequential PAttern mining using a bitmap representation. In: Proceedings of 8th ACM SIGKDD international conference on knowledge discovery and data mining, Edmonton, pp 429–435
Bettini C, Wang X, Jajodia S (1998) Mining temporal relationships with multiple granularities in time sequences. Data Eng Bull 21(1): 32–38
Casas-Garriga G (2005) Summarizing sequential data with closed partial orders. In: Proceedings of 5th SIAM international conference on data mining, Newport Beach, pp 380–391
Cormen T, Leiserson C, Rivest R, Stein C (2001) Introduction to algorithms. MIT Press
Dalamagas T, Cheng T, Winkel K, Sellis T (2006) A methodology for clustering XML documents by structure. Inform Syst 31(3): 187–228
Deshpande M, Karypis G (2002) Evaluation of techniques for classifying biological sequences. In: Proceedings of 6th Pacific-Asia conference on advances in knowledge discovery and data mining, Taipei, pp 417–431
Garofalakis M, Rastogi R, Shim K (1999) SPIRIT: sequential PAttern mining with regular expression constraints. In: Proceedings of 25th international conference on very large data bases, Edinburgh, pp 223–234
Guralnik V, Karypis G (2001) A scalable algorithm for clustering sequential data. In: Proceedings of 1st IEEE international conference on data mining, San Jose, pp 179–186
Han J, Dong G, Yin Y (1999) Efficient mining of partial periodic patterns in time series database. In: Proceedings of 15th international conference on data engineering, Sydney, pp 106–115
Han J, Pei J, Mortazavi-Asl B, Chen Q, Dayal U, Hsu MC (2000) FreeSpan: frequent pattern-projected sequential pattern mining. In: Proceedings of 6th ACM SIGKDD international conference on knowledge discovery and data mining, Boston, pp 355–359
Ji X, Bailey J, Dong G (2005) Mining minimal distinguishing subsequence patterns with gap constraints. In: Proceedings of 5th IEEE international conference on data mining, Houston, pp 194–201
Li C, Wang J (2008) Efficiently mining closed subsequences with gap constraints. In: Proceedings of 2008 SIAM international conference on data mining, Atlanta
Li Z, Chen Z, Srinivasan S, Zhou Y (2004) C-Miner: mining block correlations in storage systems. In: Proceedings of USENIX conference on file and storage technologies, San Francisco, pp 173–186
Li Z, Lu S, Myagmar S, Zhou Y (2006) CP-Miner: finding copy-paste and related bugs in large-scale software code. IEEE Trans Software Eng 32(3): 176–192
Mannila H, Toivonen H, Verkamo AI (1995) Discovering frequent episodes in sequences. In: Proceedings of 1st international conference on knowledge discovery and data mining, Montreal
Masseglia F, Cathala F, Poncelet P (1998) The psp approach for mining sequential patterns. In: Proceedings of 2nd european symposium on principles of data mining and knowledge discovery, Nantes, pp 176–184
Ozden B, Ramaswamy S, Silberschatz A (1998) Cyclic association rules. In: Proceedings of 14th international conference on data engineering, Orlando, pp 412–421
Pei J, Han J, Mortazavi-Asl B, Chen Q, Dayal U, Hsu MC (2001) PrefixSpan: mining sequential patterns efficiently by prefix-projected pattern growth. In: Proceedings of 17th international conference on data engineering, Heidelberg, pp 215–224
Pei J, Han J, Wang W (2002) Constraint-based sequential pattern mining in large databases. In: Proceedings of 2002 ACM CIKM international conference on information and knowledge management, McLean, pp 18–25
Pei J, Liu J, Wang H, Wang K, Yu PS, Wang J (2005) Efficiently mining frequent closed partial orders. In: Proceedings of 5th IEEE international conference on data mining, Houston, pp 753–756
Seno M, Karypis G (2002) SLPMiner: An algorithm for finding frequent sequential patterns using length-decreasing support constraint. In: Proceedings of 2nd IEEE international conference on data mining, Maebashi City, pp 418–425
She R, Chen F, Wang K, Ester M et al. (2003) Frequent-subsequence-based prediction of outer membrane proteins. In: Proceedings of 9th ACM SIGKDD international conference on knowledge discovery and data mining, Washington, pp 236–245
Srikant R, Agrawal R (1996) Mining sequential patterns:generalizations and performance improvements. In: Proceedings of 5th international conference on extending database technology, Avignon, pp 3–17
Wang J, Han J (2004) BIDE: Efficient mining of frequent closed sequences. In: Proceedings of 20th international conference on data engineering, Boston, pp 79–90
Wang J, Karypis G (2004) SUMMARY: Efficiently summarizing transactions for clustering. In: Proceedings of 4th IEEE international conference on data mining, Brighton, pp 241–248
Wang J, Karypis G (2005) HARMONY: Efficiently mining the best rules for classification. In: Proceedings of 5th SIAM international conference on data mining, Newport Beach, pp 205–216
Yan X, Han J, Afshar R (2003) CloSpan: mining closed sequential patterns in large databases. In: Proceedings of 3rd SIAM international conference on data mining, San Francisco
Yang J, Wang W (2003) CLUSEQ: efficient and effective sequence clustering. In: Proceedings of 19th international conference on data engineering, Bangalore, pp 101–112
Yang J, Yu PS, Wang W, Han J (2002) Mining long sequential patterns in a noisy environment. In: Proceedings of 2002 ACM SIGMOD international conference on management of data, Madison, pp 406–417
Zaki M (2001) SPADE: An efficient algorithm for mining frequent sequences. Mach Learn 42: 31–60
Zhang T, Ramakrishnan R, Livny M (1996) BIRCH: an efficient data clustering method for very large databases. In: Proceedings of 1996 ACM SIGMOD international conference on management of data, Montreal, pp 103–114
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: M. J. Zaki.
Rights and permissions
About this article
Cite this article
Wang, J., Zhang, Y., Zhou, L. et al. CONTOUR: an efficient algorithm for discovering discriminating subsequences. Data Min Knowl Disc 18, 1–29 (2009). https://doi.org/10.1007/s10618-008-0100-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-008-0100-7