A general measure of similarity for categorical sequences

Kelil, Abdellali; Wang, Shengrui; Jiang, Qingshan; Brzezinski, Ryszard

doi:10.1007/s10115-009-0237-8

A general measure of similarity for categorical sequences

Regular Paper
Published: 06 August 2009

Volume 24, pages 197–220, (2010)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Abdellali Kelil^1,4,
Shengrui Wang^1,4,
Qingshan Jiang² &
…
Ryszard Brzezinski^3,4

279 Accesses
7 Citations
Explore all metrics

Abstract

Measuring the similarity between categorical sequences is a fundamental process in many data mining applications. A key issue is extracting and making use of significant features hidden behind the chronological and structural dependencies found in these sequences. Almost all existing algorithms designed to perform this task are based on the matching of patterns in chronological order, but such sequences often have similar structural features in chronologically different order. In this paper we propose SCS, a novel, effective and domain-independent method for measuring the similarity between categorical sequences, based on an original pattern matching scheme that makes it possible to capture chronological and non-chronological dependencies. SCS captures significant patterns that represent the natural structure of sequences, and reduces the influence of those which are merely noise. It constitutes an effective approach to measuring the similarity between data in the form of categorical sequences, such as biological sequences, natural language texts, speech recognition data, certain types of network transactions, and retail transactions. To show its effectiveness, we have tested SCS extensively on a range of data sets from different application fields, and compared the results with those obtained by various mainstream algorithms. The results obtained show that SCS produces results that are often competitive with domain-specific similarity approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

On measuring similarity for sequences of itemsets

Article 20 July 2014

Elias Egho, Chedy Raïssi, … Amedeo Napoli

Distance, Similarity and Sequence Comparison

Three Narratives of Sequence Analysis

References

Altschul SF, Gish W et al (1990) Basic local alignment search tool. J Mol Biol 215: 403–410
Google Scholar
Amir A, Lewenstein M et al (2004) Faster algorithms for string matching with k mismatches. J Algorithms 50: 257–275
Article MATH MathSciNet Google Scholar
Aslam JA, Frost M (2003) An information-theoretic measure for document similarity. In: Proceedings of the 26th annual international conference on research and development in information retrieval, pp 449–450
Berry MW, Fierro RD (1996) Low-rank orthogonal decompositions for information retrieval applications. Numer Linear Algebra Appl 1: 1–27
Google Scholar
Bogan-Marta A, Laskaris N et al (2005) A novel efficient protein similarity measure based on n-gram modeling. CIMED 2005
Brand M (2006) Fast low-rank modifications of the thin singular value decomposition. Linear Algebra Appl 415: 20
Article MATH MathSciNet Google Scholar
Cai K, Chen C et al (2004) Efficient similarity matching for categorical sequence based on dynamic partition. In: International conference on software engineering and applications, Cambridge, pp 13–18
Chim H, Deng X (2007) A new suffix tree similarity measure for document clustering. In: Proceedings of the 16th international conference on World Wide Web, pp 121–130
Cieslak D, Chawla N (2009) A framework for monitoring classifiers’ performance: when and why failure occurs?. Knowl Inf Syst 18: 83–108
Article Google Scholar
Ganapathiraju M, Klein-Seetharaman J et al (2004) Characterization of protein secondary structure using latent semantic analysis. IEEE Signal Process Mag 21(3): 78–87
Article Google Scholar
Golub GH, Van Loan, Charles F (1996) Matrix computations. In: Johns Hopkins studies in mathematical sciences. The Johns Hopkins University Press, Baltimore, pp 694
Horst S (1999) Symbols and computation: a critique of the computational theory of mind. Minds Mach 9: 347–381
Article Google Scholar
Karlin S, Ghandour G (1985) Comparative statistics for DNA and protein sequences: single sequence analysis. Proc Natl Acad Sci USA 82: 5800–5804
Article Google Scholar
Kelil A, Wang S et al (2008) CLUSS2: an alignment-independent algorithm for clustering protein families with multiple biological functions. IJCBDD 1: 122–140
Article Google Scholar
Kelil A, Wang S et al (2007a) CLUSS: clustering of protein sequences based on a new similarity measure. BMC Bioinform 8: 286
Article Google Scholar
Kelil A, Wang S et al (2007b) A new alignment-independent algorithm for clustering protein sequences. BIBE 415(1): 20–30
Google Scholar
Kohonen T (1985) Median strings. Pattern Recogn Lett 3: 309–313
Article Google Scholar
Kondrak G (2005) N-gram similarity and distance. In: Proceedings of the 12th conference on string processing and information retrieval, pp 115–126
Levenshtein VI (1966) Binary codes capable of correcting deletions, insertions, and reversals. Sov Phys Dokl 10: 707–710
MathSciNet Google Scholar
Li C, Lu Y (2007) Similarity measurement of Web sessions by sequence alignment. In: International conference on network and parallel computing workshops, pp 716–720
Loiselle S, Rouat J et al (2005) Exploration of rank order coding with spiking neural networks for speech recognition. In: Proceedings of the IEEE international Joint Conference On Neural Networks, pp 2076–2080
Mhamdi F, Rakotomalala R et al (2006) A hierarchical n-grams extraction approach for classification problem. In: Proceedings of the IEEE international conference on signal-image technology and internet-based systems, Tunisia, pp 310–321
Needleman SB, Wunsch CD (1970) A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 48: 443–453
Article Google Scholar
Oh S, Kim J (2004) A hierarchical clustering algorithm for categorical sequence data. Inf Process Lett 91: 135–140
Article MATH MathSciNet Google Scholar
Song W, Park S (2009) Latent semantic analysis for vector space expansion and fuzzy logic-based genetic clustering. Knowl Inf Syst
Suen CY (1979) N-gram statistics for natural language understanding and text processing. IEEE TPAMI PAMI-1: 164–172
Google Scholar
Tatusov RL, Fedorova ND et al (2003) The COG database: an updated version includes eukaryotes. BMC Bioinform 4: 41
Article Google Scholar
Wan X (2007) A novel document similarity measure based on earth mover’s distance. Inf Sci 177: 3718–3730
Article Google Scholar
Wu KP, Lin HN et al (2003) A new similarity measure among protein sequences. Proc IEEE Comput Soc Bioinform Conf 2: 347–352
Google Scholar
Wu X, Kumar V et al (2008) Top 10 algorithms in data mining. Knowl Inform Syst 14: 1–37
Article Google Scholar

Download references

Author information

Authors and Affiliations

ProspectUS Laboratory, Department of Computer Science, University of Sherbrooke, Sherbrooke, QC, Canada
Abdellali Kelil & Shengrui Wang
School of Software, Xiamen University, 361005, Xiamen, China
Qingshan Jiang
Microbiology and Biotechnology Laboratory, Department of Biology, University of Sherbrooke, Sherbrooke, QC, Canada
Ryszard Brzezinski
Faculty of Sciences, University of Sherbrooke, Sherbrooke, QC, J1H 3Z3, Canada
Abdellali Kelil, Shengrui Wang & Ryszard Brzezinski

Authors

Abdellali Kelil
View author publications
You can also search for this author in PubMed Google Scholar
Shengrui Wang
View author publications
You can also search for this author in PubMed Google Scholar
Qingshan Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Ryszard Brzezinski
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abdellali Kelil.

Additional information

Qingshan Jiang was supported by the National Natural Science Foundation of China.

Shengrui Wang and Ryszard Brzezinski was supported by research grants from the Natural Sciences and Engineering Research Council of Canada.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kelil, A., Wang, S., Jiang, Q. et al. A general measure of similarity for categorical sequences. Knowl Inf Syst 24, 197–220 (2010). https://doi.org/10.1007/s10115-009-0237-8

Download citation

Received: 31 December 2008
Revised: 09 April 2009
Accepted: 09 May 2009
Published: 06 August 2009
Issue Date: August 2010
DOI: https://doi.org/10.1007/s10115-009-0237-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A general measure of similarity for categorical sequences

Abstract

Access this article

Similar content being viewed by others

On measuring similarity for sequences of itemsets

Distance, Similarity and Sequence Comparison

Three Narratives of Sequence Analysis

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A general measure of similarity for categorical sequences

Abstract

Access this article

Similar content being viewed by others

On measuring similarity for sequences of itemsets

Distance, Similarity and Sequence Comparison

Three Narratives of Sequence Analysis

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation