Skip to main content

String clustering and statistical validation of clusters

  • Posters
  • Conference paper
  • First Online:
Advances in Artificial Intelligence (Canadian AI 1998)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 1418))

  • 205 Accesses

Abstract

In this article we present a new string clustering algorithm and a statistical validation of discovered clusters. They are obtained by searching for common structures of strings and by grouping those sharing wide words. The application of a statistical test, estimating cluster homogeneity, allows to find automatically the number of classes. We apply our method to extract key-structures in 40 biological sequences of about 100 characters each in order to build clusters, and to find three original families again.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. M. Dscayoff. Atlas of protein sequences and structure. Nat. Biomed. Res. Found, 1978.

    Google Scholar 

  2. R. Karp, R. Miller, and L. Rosenberg. Rapid identification of repeated patterns in strings, trees and arrays. In Proceedings 4th Annu. ACM Symp. Theory of Computer, pages 125–136, 1972.

    Google Scholar 

  3. D. Knuth. The Art of Computer Programming. vols 1,2,3, Reading, MA: AddisonWesley, 1973.

    Google Scholar 

  4. A. Landraud-Lamole, J. Avril, and P. Chretienne. An algorithm for finding a common structure shared by a family of strings. IEEE Transactions on Pattern Analysis and Machine Intelligence, 11:890–895, 1989.

    Article  Google Scholar 

  5. V. Levenshtein. Binary code capable of correcting deletions, insertions and reversals. Soviet Physics-Doklady 10, 10:707–710, 1966.

    Google Scholar 

  6. scA. Marzal and E. Vidal. Computation of normalized edit distance and applications. IEEE Trans. PAMI 15, 9:926–932, 1993.

    Google Scholar 

  7. F. Preparata and M. Shamos. Computational Geometry. Springer-Verlag, 1985.

    Google Scholar 

  8. E. Ristad and P. Yianilos. Learning string edit distance. Research Report CSTR-532-96, 1997.

    Google Scholar 

  9. M. Sebban. Modéles Théoriques en Reconnaissance de Formes et Architecture Hybride pour Machine Perceptive. PhD thesis, Université Lyon 1, 1996.

    Google Scholar 

  10. C. Wagner and M. Fischer. The string to string correction problem. J.A.C.M., 21:168–173,1974.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Robert E. Mercer Eric Neufeld

Rights and permissions

Reprints and permissions

Copyright information

© 1998 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Sebban, M., Landraud-Lamole, A.M. (1998). String clustering and statistical validation of clusters. In: Mercer, R.E., Neufeld, E. (eds) Advances in Artificial Intelligence. Canadian AI 1998. Lecture Notes in Computer Science, vol 1418. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-64575-6_59

Download citation

  • DOI: https://doi.org/10.1007/3-540-64575-6_59

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-64575-7

  • Online ISBN: 978-3-540-69349-9

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics