Skip to main content

Paragraph Alignment for English-Vietnamese Parallel E-Books

  • Conference paper
Knowledge and Systems Engineering

Part of the book series: Advances in Intelligent Systems and Computing ((AISC,volume 245))

Abstract

Parallel corpora are among the most important linguistic resources used in multilingual language processing such as statistical machine translation, cross language information retrieval, and so on. Manually constructing such corpora takes a very high cost while there are many available parallel e-books containing a large number of parallel texts. This paper focuses on the task of aligning paragraphs of English-Vietnamese parallel e-books. A new method for this alignment is proposed. By doing an experiment we have collected an English-Vietnamese parallel corpus which contains nearly 40,000 sentence pairs aligned at the paragraph level.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 129.00
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Brown, P.F., Cocke, J., Pietra, S.A.D., Pietra, V.J.D., Jelinek, F., Lafferty, J.D., Mercer, R.L., Roossin, P.S.: A statistical approach to machine translation. Comput. Linguist. 16(2), 79–85 (1990)

    Google Scholar 

  2. Davis, M.W., Dunning, T.: A trec evaluation of query translation methods for multi-lingual text retrieval. In: TREC (1995)

    Google Scholar 

  3. Hung, L.Q., Cuong, L.A.: Extracting parallel texts from the web. In: Proceedings of the 2010 Second International Conference on Knowledge and Systems Engineering, KSE 2010, pp. 147–151. IEEE Computer Society, Washington, DC (2010)

    Chapter  Google Scholar 

  4. Zhao, B., Vogel, S.: Adaptive parallel sentences mining from web bilingual news collection. In: Proceedings of the 2002 IEEE International Conference on Data Mining, ICDM 2002, pp. 745–748. IEEE Computer Society, Washington, DC (2002)

    Chapter  Google Scholar 

  5. Resnik, P., Smith, N.A.: The web as a parallel corpus. Comput. Linguist. 29(3), 349–380 (2003)

    Article  Google Scholar 

  6. Collier, N., Ono, K., Hirakawa, H.: An experiment in hybrid dictionary and statistical sentence alignment. In: Proceedings of the 17th International Conference on Computational Linguistics, vol. 1, pp. 268–274. Association for Computational Linguistics (1998)

    Google Scholar 

  7. Tay, R., Ibrahim, T.: Research on paragraph alignment technology in chinese-uighur bilingual corpus. Journal of Xinjiang University (Natural Science Edition) 1, 021 (2010)

    Google Scholar 

  8. Rasooli, M.S., Kashefi, O., Minaei-Bidgoli, B.: Extracting parallel paragraphs and sentences from english-persian translated documents. In: Salem, M.V.M., Shaalan, K., Oroumchian, F., Shakery, A., Khelalfa, H. (eds.) AIRS 2011. LNCS, vol. 7097, pp. 574–583. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  9. Gupta, A., Pala, K.: A generic and robust algorithm for paragraph alignment and its impact on sentence alignment in parallel corpora (2012)

    Google Scholar 

  10. Brown, P.F., Lai, J.C., Mercer, R.L.: Aligning sentences in parallel corpora. In: Proceedings of the 29th Annual Meeting on Association for Computational Linguistics, ACL 1991, pp. 169–176. Association for Computational Linguistics, Stroudsburg (1991)

    Chapter  Google Scholar 

  11. Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpora. Comput. Linguist. 19(1), 75–102 (1993)

    Google Scholar 

  12. Chen, S.F.: Aligning sentences in bilingual corpora using lexical information. In: Proceedings of the 31st Annual Meeting on Association for Computational Linguistics, ACL 1993, pp. 9–16. Association for Computational Linguistics, Stroudsburg (1993)

    Chapter  Google Scholar 

  13. Meyers, A., Kosaka, M., Grishman, R.: A multilingual procedure for dictionary-based sentence alignment. In: Farwell, D., Gerber, L., Hovy, E. (eds.) AMTA 1998. LNCS (LNAI), vol. 1529, pp. 187–198. Springer, Heidelberg (1998)

    Chapter  Google Scholar 

  14. Gelbukh, A., Sidorov, G., Vera-Félix, J.Á.: Paragraph-level alignment of an english-spanish parallel corpus of fiction texts using bilingual dictionaries. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 61–67. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  15. Sennrich, R., Volk, M.: Mt-based sentence alignment for ocr-generated parallel texts. In: The Ninth Conference of the Association for Machine Translation in the Americas (AMTA 2010), Denver, Colorado (2010)

    Google Scholar 

  16. Sennrich, R., Volk, M.: Iterative, mt-based sentence alignment of parallel texts (2011)

    Google Scholar 

  17. Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpora. Computational Linguistics 19(1), 75–102 (1993)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Le, QH., Nguyen, DC., Pham, DH., Le, AC., Huynh, VN. (2014). Paragraph Alignment for English-Vietnamese Parallel E-Books. In: Huynh, V., Denoeux, T., Tran, D., Le, A., Pham, S. (eds) Knowledge and Systems Engineering. Advances in Intelligent Systems and Computing, vol 245. Springer, Cham. https://doi.org/10.1007/978-3-319-02821-7_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-02821-7_23

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-02820-0

  • Online ISBN: 978-3-319-02821-7

  • eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics