Abstract
Measuring the similarity between categorical sequences is a fundamental process in many data mining applications. A key issue is extracting and making use of significant features hidden behind the chronological and structural dependencies found in these sequences. Almost all existing algorithms designed to perform this task are based on the matching of patterns in chronological order, but such sequences often have similar structural features in chronologically different order. In this paper we propose SCS, a novel, effective and domain-independent method for measuring the similarity between categorical sequences, based on an original pattern matching scheme that makes it possible to capture chronological and non-chronological dependencies. SCS captures significant patterns that represent the natural structure of sequences, and reduces the influence of those which are merely noise. It constitutes an effective approach to measuring the similarity between data in the form of categorical sequences, such as biological sequences, natural language texts, speech recognition data, certain types of network transactions, and retail transactions. To show its effectiveness, we have tested SCS extensively on a range of data sets from different application fields, and compared the results with those obtained by various mainstream algorithms. The results obtained show that SCS produces results that are often competitive with domain-specific similarity approaches.
Similar content being viewed by others
References
Altschul SF, Gish W et al (1990) Basic local alignment search tool. J Mol Biol 215: 403–410
Amir A, Lewenstein M et al (2004) Faster algorithms for string matching with k mismatches. J Algorithms 50: 257–275
Aslam JA, Frost M (2003) An information-theoretic measure for document similarity. In: Proceedings of the 26th annual international conference on research and development in information retrieval, pp 449–450
Berry MW, Fierro RD (1996) Low-rank orthogonal decompositions for information retrieval applications. Numer Linear Algebra Appl 1: 1–27
Bogan-Marta A, Laskaris N et al (2005) A novel efficient protein similarity measure based on n-gram modeling. CIMED 2005
Brand M (2006) Fast low-rank modifications of the thin singular value decomposition. Linear Algebra Appl 415: 20
Cai K, Chen C et al (2004) Efficient similarity matching for categorical sequence based on dynamic partition. In: International conference on software engineering and applications, Cambridge, pp 13–18
Chim H, Deng X (2007) A new suffix tree similarity measure for document clustering. In: Proceedings of the 16th international conference on World Wide Web, pp 121–130
Cieslak D, Chawla N (2009) A framework for monitoring classifiers’ performance: when and why failure occurs?. Knowl Inf Syst 18: 83–108
Ganapathiraju M, Klein-Seetharaman J et al (2004) Characterization of protein secondary structure using latent semantic analysis. IEEE Signal Process Mag 21(3): 78–87
Golub GH, Van Loan, Charles F (1996) Matrix computations. In: Johns Hopkins studies in mathematical sciences. The Johns Hopkins University Press, Baltimore, pp 694
Horst S (1999) Symbols and computation: a critique of the computational theory of mind. Minds Mach 9: 347–381
Karlin S, Ghandour G (1985) Comparative statistics for DNA and protein sequences: single sequence analysis. Proc Natl Acad Sci USA 82: 5800–5804
Kelil A, Wang S et al (2008) CLUSS2: an alignment-independent algorithm for clustering protein families with multiple biological functions. IJCBDD 1: 122–140
Kelil A, Wang S et al (2007a) CLUSS: clustering of protein sequences based on a new similarity measure. BMC Bioinform 8: 286
Kelil A, Wang S et al (2007b) A new alignment-independent algorithm for clustering protein sequences. BIBE 415(1): 20–30
Kohonen T (1985) Median strings. Pattern Recogn Lett 3: 309–313
Kondrak G (2005) N-gram similarity and distance. In: Proceedings of the 12th conference on string processing and information retrieval, pp 115–126
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl 10: 707–710
Li C, Lu Y (2007) Similarity measurement of Web sessions by sequence alignment. In: International conference on network and parallel computing workshops, pp 716–720
Loiselle S, Rouat J et al (2005) Exploration of rank order coding with spiking neural networks for speech recognition. In: Proceedings of the IEEE international Joint Conference On Neural Networks, pp 2076–2080
Mhamdi F, Rakotomalala R et al (2006) A hierarchical n-grams extraction approach for classification problem. In: Proceedings of the IEEE international conference on signal-image technology and internet-based systems, Tunisia, pp 310–321
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48: 443–453
Oh S, Kim J (2004) A hierarchical clustering algorithm for categorical sequence data. Inf Process Lett 91: 135–140
Song W, Park S (2009) Latent semantic analysis for vector space expansion and fuzzy logic-based genetic clustering. Knowl Inf Syst
Suen CY (1979) N-gram statistics for natural language understanding and text processing. IEEE TPAMI PAMI-1: 164–172
Tatusov RL, Fedorova ND et al (2003) The COG database: an updated version includes eukaryotes. BMC Bioinform 4: 41
Wan X (2007) A novel document similarity measure based on earth mover’s distance. Inf Sci 177: 3718–3730
Wu KP, Lin HN et al (2003) A new similarity measure among protein sequences. Proc IEEE Comput Soc Bioinform Conf 2: 347–352
Wu X, Kumar V et al (2008) Top 10 algorithms in data mining. Knowl Inform Syst 14: 1–37
Author information
Authors and Affiliations
Corresponding author
Additional information
Qingshan Jiang was supported by the National Natural Science Foundation of China.
Shengrui Wang and Ryszard Brzezinski was supported by research grants from the Natural Sciences and Engineering Research Council of Canada.
Rights and permissions
About this article
Cite this article
Kelil, A., Wang, S., Jiang, Q. et al. A general measure of similarity for categorical sequences. Knowl Inf Syst 24, 197–220 (2010). https://doi.org/10.1007/s10115-009-0237-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-009-0237-8