Abstract
This paper addresses a problem of natural language text alignment, from a humanities discipline called textual genetic criticism where different text versions must be compared. The paper shows that this task is hard because such versions can be very different and texts with a lot of internal repetitions present specific difficulties. MEDITE is a natural language text aligner that compares texts written in the same language. It detects modifications at character level, as opposed to related applications which either remain at word level or give poor results at character level. The detection of moved blocks in the text, induced by our formalism based on edit distance with moves, is introduced. The algorithm is closely related to sequence alignment in bioinformatics as similar building blocks are used and applied to this natural language processing task. A benchmark analysis has been carried out to compare MEDITE with other aligners and it shows that our approach is superior to existing ones especially in hard cases.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Ganascia, J.G., Fenoglio, I., Lebrave, J.L.: Manuscrits, genèse et documents numérisés. EDITE: une étude informatisée du travail de l’écrivain. Document Numérique 8, 91–110 (2004)
Ganascia, J.G., Bourdaillet, J.: Alignements unilingues avec MEDITE. In: Huitièmes Journées Internationales d’Analyse Statistique des Données Textuelles (to appear, 2006)
Deppman, J., Ferrer, D., Groden, M. (eds.): Genetic Criticism - Texts and Avant-textes. University of Pennsylvania Press (2004)
Hay, L. (ed.): Essais de critique génétique. Flammarion, coll. Textes et Manuscrits (1979)
de Biasi, P.M.: La Génétique des Textes. Nathan Université (2000)
Hunt, J.W., McIlroy, M.D.: An Algorithm for Differential File Comparison. Technical Report CSTR 41, Bell Laboratories, Murray Hill, NJ (1976)
Manning, C.D., Schütze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (1999)
Smith, T.F., Waterman, M.S.: Identification of Common Molecular Subsequences. Journal of Molecular Biology 147, 195–197 (1981)
Needleman, S., Wunsch, C.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48, 443–453 (1970)
Smit, A.F.: Identification of a new, abundant superfamily of mammalian LTR- transposons. Nucleic Acids Res. 21, 1863–1872 (1993)
Tichy, W.F.: The String-to-String Correction Problem with Block Moves. ACM Trans. Comput. Syst. 2, 309–321 (1984)
Lopresti, D.P., Tomkins, A.: Block Edit Models for Approximate String Matching. Theor. Comput. Sci. 181, 159–179 (1997)
Shapira, D., Storer, J.A.: Edit distance with move operations. In: Apostolico, A., Takeda, M. (eds.) CPM 2002. LNCS, vol. 2373, pp. 85–98. Springer, Heidelberg (2002)
Kaplan, H., Shafrir, N.: The greedy algorithm for edit distance with moves. Information Processing Letters 97, 23–27 (2006)
Delcher, A.L., Kasif, S., Fleischmann, R.D., Peterson, J., White, O., Salzberg, S.L.: Alignment of whole genomes. Nucl. Acids. Res. 27, 2369–2376 (1999)
Bray, N., Dubchak, I., Pachter, L.: AVID: A Global Alignment Program. Genome Res. 13, 97–102 (2003)
Darling, A.C., Mau, B., Blattner, F.R., Perna, N.T.: Mauve: Multiple Alignment of Conserved Genomic Sequence With Rearrangements. Genome Res. 14, 1394–1403 (2004)
Gusfield, D.: Algorithms on Strings, Trees and Sequences: Computer Science and Computer Biology. Cambridge University Press, Cambridge (1997)
Batzoglou, S.: The many faces of sequence alignment. Briefings in Bioinformatics 6, 6–22 (2005)
Lita, L., Rogati, M., Lavie, A.: BLANC: Learning Evaluation Metrics for MT. In: Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver, British Columbia, Canada, Association for Computational Linguistics, pp. 740–747 (2005)
Raghava, G., Searle, S.M., Audley, P.C., Barber, J.D., Barton, G.J.: OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics 4 (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Bourdaillet, J., Ganascia, JG. (2006). MEDITE: A Unilingual Textual Aligner. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds) Advances in Natural Language Processing. FinTAL 2006. Lecture Notes in Computer Science(), vol 4139. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11816508_46
Download citation
DOI: https://doi.org/10.1007/11816508_46
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-37334-6
Online ISBN: 978-3-540-37336-0
eBook Packages: Computer ScienceComputer Science (R0)