String clustering and statistical validation of clusters

Sebban, M.; Landraud-Lamole, A. M.

doi:10.1007/3-540-64575-6_59

M. Sebban¹ &
A. M. Landraud-Lamole¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1418))

Included in the following conference series:

Conference of the Canadian Society for Computational Studies of Intelligence

205 Accesses

Abstract

In this article we present a new string clustering algorithm and a statistical validation of discovered clusters. They are obtained by searching for common structures of strings and by grouping those sharing wide words. The application of a statistical test, estimating cluster homogeneity, allows to find automatically the number of classes. We apply our method to extract key-structures in 40 biological sequences of about 100 characters each in order to build clusters, and to find three original families again.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

M. Dscayoff. Atlas of protein sequences and structure. Nat. Biomed. Res. Found, 1978.
Google Scholar
R. Karp, R. Miller, and L. Rosenberg. Rapid identification of repeated patterns in strings, trees and arrays. In Proceedings 4th Annu. ACM Symp. Theory of Computer, pages 125–136, 1972.
Google Scholar
D. Knuth. The Art of Computer Programming. vols 1,2,3, Reading, MA: AddisonWesley, 1973.
Google Scholar
A. Landraud-Lamole, J. Avril, and P. Chretienne. An algorithm for finding a common structure shared by a family of strings. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11:890–895, 1989.
Article Google Scholar
V. Levenshtein. Binary code capable of correcting deletions, insertions and reversals. Soviet Physics-Doklady 10, 10:707–710, 1966.
Google Scholar
scA. Marzal and E. Vidal. Computation of normalized edit distance and applications. IEEE Trans. PAMI 15, 9:926–932, 1993.
Google Scholar
F. Preparata and M. Shamos. Computational Geometry. Springer-Verlag, 1985.
Google Scholar
E. Ristad and P. Yianilos. Learning string edit distance. Research Report CSTR-532-96, 1997.
Google Scholar
M. Sebban. Modéles Théoriques en Reconnaissance de Formes et Architecture Hybride pour Machine Perceptive. PhD thesis, Université Lyon 1, 1996.
Google Scholar
C. Wagner and M. Fischer. The string to string correction problem. J.A.C.M., 21:168–173,1974.
Google Scholar

Download references

Author information

Authors and Affiliations

Equipe RAPID Reconnaissance, Apprentissage et Perception Intelligente à partir de Données UFR Sciences, Campus de Fouillole, 97159, Pointe à Pitre, France
M. Sebban & A. M. Landraud-Lamole

Authors

M. Sebban
View author publications
You can also search for this author in PubMed Google Scholar
A. M. Landraud-Lamole
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Robert E. Mercer Eric Neufeld

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sebban, M., Landraud-Lamole, A.M. (1998). String clustering and statistical validation of clusters. In: Mercer, R.E., Neufeld, E. (eds) Advances in Artificial Intelligence. Canadian AI 1998. Lecture Notes in Computer Science, vol 1418. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-64575-6_59

Download citation

DOI: https://doi.org/10.1007/3-540-64575-6_59
Published: 29 July 2005
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-64575-7
Online ISBN: 978-3-540-69349-9
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics