Abstract
Aligned parallel corpora are very important linguistic resources useful in many text processing tasks such as machine translation, word sense disambiguation, dictionary compilation, etc. Nevertheless, there are few available linguistic resources of this type, especially for fiction texts, due to the difficulties in collecting the texts and high cost of manual alignment. In this paper, we describe an automatically aligned English-Spanish parallel corpus of fiction texts and evaluate our method of alignment that uses linguistic data-namely, on the usage of existing bilingual dictionaries-to calculate word similarity. The method is based on the simple idea: if a meaningful word is present in the source text then one of its dictionary translations should be present in the target text. Experimental results of alignment at paragraph level are described.
Work done under partial support of Mexican Government (CONACyT, SNI) and National Polytechnic Institute, Mexico (SIP, COFAA, PIFI).
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Brown, P.F., Lai, J.C., Mercer, R.L.: Aligning Sentences in Parallel Corpora. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, California, pp. 169–176 (1991)
Chen, S.: Aligning sentences in bilingual corpora using lexical information. In: Proceeding of ACL 1993, pp. 9–16 (1993)
Cowie, J., Guthrie, J.A., Guthrie, L.: Lexical disambiguation using simulated annealing. In: Proc. of the International Conference on Computational Linguistics, pp. 359–365 (1992)
Kit, C., Webster, J.J., Sin, K.K., Pan, H., Li, H.: Clause alignment for Hong Kong legal texts: A lexical-based approach. International Journal of Corpus Linguistics 9(1), 29–51 (2004)
Gale, W.A., Church, K.W.: A program for Aligning Sentences in Bilingual Corpora. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, California (1991)
Gelbukh, A., Sidorov, G.: Approach to construction of automatic morphological analysis systems for inflective languages with little effort. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 215–220. Springer, Heidelberg (2003)
Gelbukh, A., Sidorov, G., Han, S.Y.: On Some Optimization Heuristics for Lesk-Like WSD Algorithms. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB 2005. LNCS, vol. 3513, pp. 402–405. Springer, Heidelberg (2005)
McEnery, A.M., Oakes, M.P.: Sentence and word alignment in the CRATER project. In: Thomas, J., Short, M. (eds.) Using Corpora for Language Research, London, pp. 211–231 (1996)
Mikhailov, M.: Two Approaches to Automated Text Aligning of Parallel Fiction Texts. Across Languages and Cultures 2(1), 87–96 (2001)
Kay, M., Roscheisen, M.: Text-translation alignment. Computational Linguistics 19(1), 121–142 (1993)
Langlais, P., Simard, M., Veronis, J.: Methods and practical issues in evaluation alignment techniques. In: Proceeding of Coling-ACL 1998 (1998)
Li, W., Sun, M.: Automatic Image Annotation based on WordNet and Hierarchical Ensembles. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 551–563. Springer, Heidelberg (2006)
Meyers, A., Kosaka, M., Grishman, R.: A Multilingual Procedure for Dictionary-Based Sentence Alignment. In: Proceedings of AMTA 1998: Machine Translation and the Information Soup, pp. 187–198 (1998)
Velásquez, F., Gelbukh, A., Sidorov, G.: AGME: un sistema de análisis y generación de la morfología del español. In: Proc. Of Workshop Multilingual information access and natural language processing of IBERAMIA 2002 (8th Iberoamerican conference on Artificial Intelligence), Sevilla, España, November 12, pp. 1–6 (2002)
Villaseñor Pineda, L., Massé Márquez, J.A., Pineda Cortés, L.A.: Towards a Multimodal Dialogue Coding Scheme. In: Gelbukh, A. (ed.) Proc. of CICLing 2000 Computational Linguistics and Intelligent Text Processing, IPN, Mexico, pp. 551–563 (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Gelbukh, A., Sidorov, G., Vera-Félix, J.Á. (2006). Paragraph-Level Alignment of an English-Spanish Parallel Corpus of Fiction Texts Using Bilingual Dictionaries. In: Sojka, P., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2006. Lecture Notes in Computer Science(), vol 4188. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11846406_8
Download citation
DOI: https://doi.org/10.1007/11846406_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-39090-9
Online ISBN: 978-3-540-39091-6
eBook Packages: Computer ScienceComputer Science (R0)