Skip to main content
Log in

A general measure of similarity for categorical sequences

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Measuring the similarity between categorical sequences is a fundamental process in many data mining applications. A key issue is extracting and making use of significant features hidden behind the chronological and structural dependencies found in these sequences. Almost all existing algorithms designed to perform this task are based on the matching of patterns in chronological order, but such sequences often have similar structural features in chronologically different order. In this paper we propose SCS, a novel, effective and domain-independent method for measuring the similarity between categorical sequences, based on an original pattern matching scheme that makes it possible to capture chronological and non-chronological dependencies. SCS captures significant patterns that represent the natural structure of sequences, and reduces the influence of those which are merely noise. It constitutes an effective approach to measuring the similarity between data in the form of categorical sequences, such as biological sequences, natural language texts, speech recognition data, certain types of network transactions, and retail transactions. To show its effectiveness, we have tested SCS extensively on a range of data sets from different application fields, and compared the results with those obtained by various mainstream algorithms. The results obtained show that SCS produces results that are often competitive with domain-specific similarity approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Altschul SF, Gish W et al (1990) Basic local alignment search tool. J Mol Biol 215: 403–410

    Google Scholar 

  2. Amir A, Lewenstein M et al (2004) Faster algorithms for string matching with k mismatches. J Algorithms 50: 257–275

    Article  MATH  MathSciNet  Google Scholar 

  3. Aslam JA, Frost M (2003) An information-theoretic measure for document similarity. In: Proceedings of the 26th annual international conference on research and development in information retrieval, pp 449–450

  4. Berry MW, Fierro RD (1996) Low-rank orthogonal decompositions for information retrieval applications. Numer Linear Algebra Appl 1: 1–27

    Google Scholar 

  5. Bogan-Marta A, Laskaris N et al (2005) A novel efficient protein similarity measure based on n-gram modeling. CIMED 2005

  6. Brand M (2006) Fast low-rank modifications of the thin singular value decomposition. Linear Algebra Appl 415: 20

    Article  MATH  MathSciNet  Google Scholar 

  7. Cai K, Chen C et al (2004) Efficient similarity matching for categorical sequence based on dynamic partition. In: International conference on software engineering and applications, Cambridge, pp 13–18

  8. Chim H, Deng X (2007) A new suffix tree similarity measure for document clustering. In: Proceedings of the 16th international conference on World Wide Web, pp 121–130

  9. Cieslak D, Chawla N (2009) A framework for monitoring classifiers’ performance: when and why failure occurs?. Knowl Inf Syst 18: 83–108

    Article  Google Scholar 

  10. Ganapathiraju M, Klein-Seetharaman J et al (2004) Characterization of protein secondary structure using latent semantic analysis. IEEE Signal Process Mag 21(3): 78–87

    Article  Google Scholar 

  11. Golub GH, Van Loan, Charles F (1996) Matrix computations. In: Johns Hopkins studies in mathematical sciences. The Johns Hopkins University Press, Baltimore, pp 694

  12. Horst S (1999) Symbols and computation: a critique of the computational theory of mind. Minds Mach 9: 347–381

    Article  Google Scholar 

  13. Karlin S, Ghandour G (1985) Comparative statistics for DNA and protein sequences: single sequence analysis. Proc Natl Acad Sci USA 82: 5800–5804

    Article  Google Scholar 

  14. Kelil A, Wang S et al (2008) CLUSS2: an alignment-independent algorithm for clustering protein families with multiple biological functions. IJCBDD 1: 122–140

    Article  Google Scholar 

  15. Kelil A, Wang S et al (2007a) CLUSS: clustering of protein sequences based on a new similarity measure. BMC Bioinform 8: 286

    Article  Google Scholar 

  16. Kelil A, Wang S et al (2007b) A new alignment-independent algorithm for clustering protein sequences. BIBE 415(1): 20–30

    Google Scholar 

  17. Kohonen T (1985) Median strings. Pattern Recogn Lett 3: 309–313

    Article  Google Scholar 

  18. Kondrak G (2005) N-gram similarity and distance. In: Proceedings of the 12th conference on string processing and information retrieval, pp 115–126

  19. Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl 10: 707–710

    MathSciNet  Google Scholar 

  20. Li C, Lu Y (2007) Similarity measurement of Web sessions by sequence alignment. In: International conference on network and parallel computing workshops, pp 716–720

  21. Loiselle S, Rouat J et al (2005) Exploration of rank order coding with spiking neural networks for speech recognition. In: Proceedings of the IEEE international Joint Conference On Neural Networks, pp 2076–2080

  22. Mhamdi F, Rakotomalala R et al (2006) A hierarchical n-grams extraction approach for classification problem. In: Proceedings of the IEEE international conference on signal-image technology and internet-based systems, Tunisia, pp 310–321

  23. Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48: 443–453

    Article  Google Scholar 

  24. Oh S, Kim J (2004) A hierarchical clustering algorithm for categorical sequence data. Inf Process Lett 91: 135–140

    Article  MATH  MathSciNet  Google Scholar 

  25. Song W, Park S (2009) Latent semantic analysis for vector space expansion and fuzzy logic-based genetic clustering. Knowl Inf Syst

  26. Suen CY (1979) N-gram statistics for natural language understanding and text processing. IEEE TPAMI PAMI-1: 164–172

    Google Scholar 

  27. Tatusov RL, Fedorova ND et al (2003) The COG database: an updated version includes eukaryotes. BMC Bioinform 4: 41

    Article  Google Scholar 

  28. Wan X (2007) A novel document similarity measure based on earth mover’s distance. Inf Sci 177: 3718–3730

    Article  Google Scholar 

  29. Wu KP, Lin HN et al (2003) A new similarity measure among protein sequences. Proc IEEE Comput Soc Bioinform Conf 2: 347–352

    Google Scholar 

  30. Wu X, Kumar V et al (2008) Top 10 algorithms in data mining. Knowl Inform Syst 14: 1–37

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Abdellali Kelil.

Additional information

Qingshan Jiang was supported by the National Natural Science Foundation of China.

Shengrui Wang and Ryszard Brzezinski was supported by research grants from the Natural Sciences and Engineering Research Council of Canada.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kelil, A., Wang, S., Jiang, Q. et al. A general measure of similarity for categorical sequences. Knowl Inf Syst 24, 197–220 (2010). https://doi.org/10.1007/s10115-009-0237-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-009-0237-8

Keywords

Navigation