Abstract
In this article we present a new string clustering algorithm and a statistical validation of discovered clusters. They are obtained by searching for common structures of strings and by grouping those sharing wide words. The application of a statistical test, estimating cluster homogeneity, allows to find automatically the number of classes. We apply our method to extract key-structures in 40 biological sequences of about 100 characters each in order to build clusters, and to find three original families again.
Preview
Unable to display preview. Download preview PDF.
References
M. Dscayoff. Atlas of protein sequences and structure. Nat. Biomed. Res. Found, 1978.
R. Karp, R. Miller, and L. Rosenberg. Rapid identification of repeated patterns in strings, trees and arrays. In Proceedings 4th Annu. ACM Symp. Theory of Computer, pages 125–136, 1972.
D. Knuth. The Art of Computer Programming. vols 1,2,3, Reading, MA: AddisonWesley, 1973.
A. Landraud-Lamole, J. Avril, and P. Chretienne. An algorithm for finding a common structure shared by a family of strings. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11:890–895, 1989.
V. Levenshtein. Binary code capable of correcting deletions, insertions and reversals. Soviet Physics-Doklady 10, 10:707–710, 1966.
scA. Marzal and E. Vidal. Computation of normalized edit distance and applications. IEEE Trans. PAMI 15, 9:926–932, 1993.
F. Preparata and M. Shamos. Computational Geometry. Springer-Verlag, 1985.
E. Ristad and P. Yianilos. Learning string edit distance. Research Report CSTR-532-96, 1997.
M. Sebban. Modéles Théoriques en Reconnaissance de Formes et Architecture Hybride pour Machine Perceptive. PhD thesis, Université Lyon 1, 1996.
C. Wagner and M. Fischer. The string to string correction problem. J.A.C.M., 21:168–173,1974.
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 1998 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sebban, M., Landraud-Lamole, A.M. (1998). String clustering and statistical validation of clusters. In: Mercer, R.E., Neufeld, E. (eds) Advances in Artificial Intelligence. Canadian AI 1998. Lecture Notes in Computer Science, vol 1418. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-64575-6_59
Download citation
DOI: https://doi.org/10.1007/3-540-64575-6_59
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-64575-7
Online ISBN: 978-3-540-69349-9
eBook Packages: Springer Book Archive