Paragraph-Level Alignment of an English-Spanish Parallel Corpus of Fiction Texts Using Bilingual Dictionaries

Gelbukh, Alexander; Sidorov, Grigori; Vera-Félix, José Ángel

doi:10.1007/11846406_8

Alexander Gelbukh²¹,
Grigori Sidorov²¹ &
José Ángel Vera-Félix²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4188))

Included in the following conference series:

International Conference on Text, Speech and Dialogue

1134 Accesses
2 Citations
2 Altmetric

Abstract

Aligned parallel corpora are very important linguistic resources useful in many text processing tasks such as machine translation, word sense disambiguation, dictionary compilation, etc. Nevertheless, there are few available linguistic resources of this type, especially for fiction texts, due to the difficulties in collecting the texts and high cost of manual alignment. In this paper, we describe an automatically aligned English-Spanish parallel corpus of fiction texts and evaluate our method of alignment that uses linguistic data-namely, on the usage of existing bilingual dictionaries-to calculate word similarity. The method is based on the simple idea: if a meaningful word is present in the source text then one of its dictionary translations should be present in the target text. Experimental results of alignment at paragraph level are described.

Work done under partial support of Mexican Government (CONACyT, SNI) and National Polytechnic Institute, Mexico (SIP, COFAA, PIFI).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Aligning Sentences Between Comparable Texts of Different Styles

TamSiPara: A Tamil – Sinhala Parallel Corpus

Cross-Lingual Plagiarism Detection Method

References

Brown, P.F., Lai, J.C., Mercer, R.L.: Aligning Sentences in Parallel Corpora. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, California, pp. 169–176 (1991)
Google Scholar
Chen, S.: Aligning sentences in bilingual corpora using lexical information. In: Proceeding of ACL 1993, pp. 9–16 (1993)
Google Scholar
Cowie, J., Guthrie, J.A., Guthrie, L.: Lexical disambiguation using simulated annealing. In: Proc. of the International Conference on Computational Linguistics, pp. 359–365 (1992)
Google Scholar
Kit, C., Webster, J.J., Sin, K.K., Pan, H., Li, H.: Clause alignment for Hong Kong legal texts: A lexical-based approach. International Journal of Corpus Linguistics 9(1), 29–51 (2004)
Article Google Scholar
Gale, W.A., Church, K.W.: A program for Aligning Sentences in Bilingual Corpora. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, California (1991)
Google Scholar
Gelbukh, A., Sidorov, G.: Approach to construction of automatic morphological analysis systems for inflective languages with little effort. In: Gelbukh, A. (ed.) CICLing 2003. LNCS, vol. 2588, pp. 215–220. Springer, Heidelberg (2003)
Chapter Google Scholar
Gelbukh, A., Sidorov, G., Han, S.Y.: On Some Optimization Heuristics for Lesk-Like WSD Algorithms. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB 2005. LNCS, vol. 3513, pp. 402–405. Springer, Heidelberg (2005)
Chapter Google Scholar
McEnery, A.M., Oakes, M.P.: Sentence and word alignment in the CRATER project. In: Thomas, J., Short, M. (eds.) Using Corpora for Language Research, London, pp. 211–231 (1996)
Google Scholar
Mikhailov, M.: Two Approaches to Automated Text Aligning of Parallel Fiction Texts. Across Languages and Cultures 2(1), 87–96 (2001)
Article Google Scholar
Kay, M., Roscheisen, M.: Text-translation alignment. Computational Linguistics 19(1), 121–142 (1993)
Google Scholar
Langlais, P., Simard, M., Veronis, J.: Methods and practical issues in evaluation alignment techniques. In: Proceeding of Coling-ACL 1998 (1998)
Google Scholar
Li, W., Sun, M.: Automatic Image Annotation based on WordNet and Hierarchical Ensembles. In: Gelbukh, A. (ed.) CICLing 2006. LNCS, vol. 3878, pp. 551–563. Springer, Heidelberg (2006)
Chapter Google Scholar
Meyers, A., Kosaka, M., Grishman, R.: A Multilingual Procedure for Dictionary-Based Sentence Alignment. In: Proceedings of AMTA 1998: Machine Translation and the Information Soup, pp. 187–198 (1998)
Google Scholar
Velásquez, F., Gelbukh, A., Sidorov, G.: AGME: un sistema de análisis y generación de la morfología del español. In: Proc. Of Workshop Multilingual information access and natural language processing of IBERAMIA 2002 (8th Iberoamerican conference on Artificial Intelligence), Sevilla, España, November 12, pp. 1–6 (2002)
Google Scholar
Villaseñor Pineda, L., Massé Márquez, J.A., Pineda Cortés, L.A.: Towards a Multimodal Dialogue Coding Scheme. In: Gelbukh, A. (ed.) Proc. of CICLing 2000 Computational Linguistics and Intelligent Text Processing, IPN, Mexico, pp. 551–563 (2000)
Google Scholar

Download references

Author information

Authors and Affiliations

Natural Language and Text Processing Laboratory, Center for Research in Computer Science, National Polytechnic Institute, Av. Juan Dios Batiz, s/n, Zacatenco, 07738, Mexico City, Mexico
Alexander Gelbukh, Grigori Sidorov & José Ángel Vera-Félix

Authors

Alexander Gelbukh
View author publications
You can also search for this author in PubMed Google Scholar
Grigori Sidorov
View author publications
You can also search for this author in PubMed Google Scholar
José Ángel Vera-Félix
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Informatics, Masaryk University, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Masaryk University, Botanická 68a, CZ-602 00, Brno, Czech Republic
Ivan Kopeček
Faculty of Informatics, Department of Computer Graphics and Design, Masaryk University, Botanická 68a, 60200, Brno, Czech Republic
Karel Pala

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Gelbukh, A., Sidorov, G., Vera-Félix, J.Á. (2006). Paragraph-Level Alignment of an English-Spanish Parallel Corpus of Fiction Texts Using Bilingual Dictionaries. In: Sojka, P., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2006. Lecture Notes in Computer Science(), vol 4188. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11846406_8

Download citation

DOI: https://doi.org/10.1007/11846406_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-39090-9
Online ISBN: 978-3-540-39091-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics