Skip to main content

MEDITE: A Unilingual Textual Aligner

  • Conference paper
Advances in Natural Language Processing (FinTAL 2006)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4139))

Included in the following conference series:

Abstract

This paper addresses a problem of natural language text alignment, from a humanities discipline called textual genetic criticism where different text versions must be compared. The paper shows that this task is hard because such versions can be very different and texts with a lot of internal repetitions present specific difficulties. MEDITE is a natural language text aligner that compares texts written in the same language. It detects modifications at character level, as opposed to related applications which either remain at word level or give poor results at character level. The detection of moved blocks in the text, induced by our formalism based on edit distance with moves, is introduced. The algorithm is closely related to sequence alignment in bioinformatics as similar building blocks are used and applied to this natural language processing task. A benchmark analysis has been carried out to compare MEDITE with other aligners and it shows that our approach is superior to existing ones especially in hard cases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Ganascia, J.G., Fenoglio, I., Lebrave, J.L.: Manuscrits, genèse et documents numérisés. EDITE: une étude informatisée du travail de l’écrivain. Document Numérique 8, 91–110 (2004)

    Article  Google Scholar 

  2. Ganascia, J.G., Bourdaillet, J.: Alignements unilingues avec MEDITE. In: Huitièmes Journées Internationales d’Analyse Statistique des Données Textuelles (to appear, 2006)

    Google Scholar 

  3. Deppman, J., Ferrer, D., Groden, M. (eds.): Genetic Criticism - Texts and Avant-textes. University of Pennsylvania Press (2004)

    Google Scholar 

  4. Hay, L. (ed.): Essais de critique génétique. Flammarion, coll. Textes et Manuscrits (1979)

    Google Scholar 

  5. de Biasi, P.M.: La Génétique des Textes. Nathan Université (2000)

    Google Scholar 

  6. Hunt, J.W., McIlroy, M.D.: An Algorithm for Differential File Comparison. Technical Report CSTR 41, Bell Laboratories, Murray Hill, NJ (1976)

    Google Scholar 

  7. Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)

    MATH  Google Scholar 

  8. Smith, T.F., Waterman, M.S.: Identification of Common Molecular Subsequences. Journal of Molecular Biology 147, 195–197 (1981)

    Article  Google Scholar 

  9. Needleman, S., Wunsch, C.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48, 443–453 (1970)

    Article  Google Scholar 

  10. Smit, A.F.: Identification of a new, abundant superfamily of mammalian LTR- transposons. Nucleic Acids Res. 21, 1863–1872 (1993)

    Article  Google Scholar 

  11. Tichy, W.F.: The String-to-String Correction Problem with Block Moves. ACM Trans. Comput. Syst. 2, 309–321 (1984)

    Article  Google Scholar 

  12. Lopresti, D.P., Tomkins, A.: Block Edit Models for Approximate String Matching. Theor. Comput. Sci. 181, 159–179 (1997)

    Article  MATH  MathSciNet  Google Scholar 

  13. Shapira, D., Storer, J.A.: Edit distance with move operations. In: Apostolico, A., Takeda, M. (eds.) CPM 2002. LNCS, vol. 2373, pp. 85–98. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  14. Kaplan, H., Shafrir, N.: The greedy algorithm for edit distance with moves. Information Processing Letters 97, 23–27 (2006)

    Article  MATH  MathSciNet  Google Scholar 

  15. Delcher, A.L., Kasif, S., Fleischmann, R.D., Peterson, J., White, O., Salzberg, S.L.: Alignment of whole genomes. Nucl. Acids. Res. 27, 2369–2376 (1999)

    Article  Google Scholar 

  16. Bray, N., Dubchak, I., Pachter, L.: AVID: A Global Alignment Program. Genome Res. 13, 97–102 (2003)

    Article  Google Scholar 

  17. Darling, A.C., Mau, B., Blattner, F.R., Perna, N.T.: Mauve: Multiple Alignment of Conserved Genomic Sequence With Rearrangements. Genome Res. 14, 1394–1403 (2004)

    Article  Google Scholar 

  18. Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computer Biology. Cambridge University Press, Cambridge (1997)

    Book  MATH  Google Scholar 

  19. Batzoglou, S.: The many faces of sequence alignment. Briefings in Bioinformatics 6, 6–22 (2005)

    Article  Google Scholar 

  20. Lita, L., Rogati, M., Lavie, A.: BLANC: Learning Evaluation Metrics for MT. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver, British Columbia, Canada, Association for Computational Linguistics, pp. 740–747 (2005)

    Google Scholar 

  21. Raghava, G., Searle, S.M., Audley, P.C., Barber, J.D., Barton, G.J.: OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics 4 (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bourdaillet, J., Ganascia, JG. (2006). MEDITE: A Unilingual Textual Aligner. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds) Advances in Natural Language Processing. FinTAL 2006. Lecture Notes in Computer Science(), vol 4139. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11816508_46

Download citation

  • DOI: https://doi.org/10.1007/11816508_46

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-37334-6

  • Online ISBN: 978-3-540-37336-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics