Skip to main content

A Bilingual Corpus of Novels Aligned at Paragraph Level

  • Conference paper
Advances in Natural Language Processing (FinTAL 2006)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4139))

Included in the following conference series:

  • 1651 Accesses


The paper presents a bilingual English-Spanish parallel corpus aligned at the paragraph level. The corpus consists of twelve large novels found in Internet and converted into text format with manual correction of formatting problems and errors. We used a dictionary-based algorithm for automatic alignment of the corpus. Evaluation of the results of alignment is given. There are very few available resources as far as parallel fiction texts are concerned, while they are non-trivial case of alignment of a considerable size. Usually, approaches for automatic alignment that are based on linguistic data are applied for texts in the restricted areas, like laws, manuals, etc. It is not obvious that these methods are applicable for fiction texts because these texts have much more cases of non-literal translation than the texts in the restricted areas. We show that the results of alignment for fiction texts using dictionary based method are good, namely, produce state of art precision value.

The work was done under partial support of Mexican Government (CONACyT, SNI) and National Polytechnic Institute, Mexico (CGPI, COFAA, PIFI).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions


Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others


  1. Brown, P.F., Lai, J.C., Mercer, R.L.: Aligning Sentences in Parallel Corpora. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, California, pp. 169–176 (1991)

    Google Scholar 

  2. Chen, S.: Aligning sentences in bilingual corpora using lexical information. In: Proceeding of ACL 1993, pp. 9–16 (1993)

    Google Scholar 

  3. Kit, C., Webster, J.J., Sin, K.K., Pan, H., Li, H.: Clause alignment for Hong Kong legal texts: A lexical-based approach. International Journal of Corpus Linguistics 9(1), 29–51 (2004)

    Article  Google Scholar 

  4. Gale, W.A., Church, K.W.: A program for Aligning Sentences in Bilingual Corpora. In: Proceedings of the 29th Annual Meeting of the Association for Computational Linguistics, Berkeley, California (1991)

    Google Scholar 

  5. Gelbukh, Alexander, Sidorov, G.: Approach to construction of automatic morphological analysis systems for inflective languages with little effort. LNCS, vol. 2588, pp. 215–220. Springer, Heidelberg (2003)

    Google Scholar 

  6. Gelbukh, A., Sidorov, G., Han, S.-Y.: On Some Optimization Heuristics for Lesk-Like WSD Algorithms. In: Montoyo, A., Muńoz, R., Métais, E. (eds.) NLDB 2005. LNCS, vol. 3513, pp. 402–405. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  7. McEnery, A.M., Oakes, M.P.: Sentence and word alignment in the CRATER project. In: Thomas, J., Short, M. (eds.) Using Corpora for Language Research, London, pp. 211–231 (1996)

    Google Scholar 

  8. Mikhailov, M.: Two Approaches to Automated Text Aligning of Parallel Fiction Texts. Across Languages and Cultures 2(1), 87–96 (2001)

    Article  Google Scholar 

  9. Kay, M., Roscheisen, M.: Text-translation alignment. Computational Linguistics 19(1), 121–142 (1993)

    Google Scholar 

  10. Langlais, P., Simard, M., Veronis, J.: Methods and practical issues in evaluation alignment techniques. In: Proceeding of Coling-ACL 1998 (1998)

    Google Scholar 

  11. Meyers, A., Kosaka, M., Grishman, R.: A multilingual procedure for dictionary-based sentence alignment. In: Farwell, D., Gerber, L., Hovy, E. (eds.) AMTA 1998. LNCS, vol. 1529, pp. 187–198. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  12. Velásquez, F., Gelbukh, A., Sidorov, G.: AGME: un sistema de análisis y generación de la morfología del español. In: Garijo, F.J., Riquelme, J.-C., Toro, M. (eds.) IBERAMIA 2002. LNCS, vol. 2527. Springer, Heidelberg (2002)

    Google Scholar 

Download references

Author information

Authors and Affiliations


Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Gelbukh, A., Sidorov, G., Vera-Félix, J.Á. (2006). A Bilingual Corpus of Novels Aligned at Paragraph Level. In: Salakoski, T., Ginter, F., Pyysalo, S., Pahikkala, T. (eds) Advances in Natural Language Processing. FinTAL 2006. Lecture Notes in Computer Science(), vol 4139. Springer, Berlin, Heidelberg.

Download citation

  • DOI:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-37334-6

  • Online ISBN: 978-3-540-37336-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics