Abstract
Parallel corpora are among the most important linguistic resources used in multilingual language processing such as statistical machine translation, cross language information retrieval, and so on. Manually constructing such corpora takes a very high cost while there are many available parallel e-books containing a large number of parallel texts. This paper focuses on the task of aligning paragraphs of English-Vietnamese parallel e-books. A new method for this alignment is proposed. By doing an experiment we have collected an English-Vietnamese parallel corpus which contains nearly 40,000 sentence pairs aligned at the paragraph level.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Brown, P.F., Cocke, J., Pietra, S.A.D., Pietra, V.J.D., Jelinek, F., Lafferty, J.D., Mercer, R.L., Roossin, P.S.: A statistical approach to machine translation. Comput. Linguist. 16(2), 79–85 (1990)
Davis, M.W., Dunning, T.: A trec evaluation of query translation methods for multi-lingual text retrieval. In: TREC (1995)
Hung, L.Q., Cuong, L.A.: Extracting parallel texts from the web. In: Proceedings of the 2010 Second International Conference on Knowledge and Systems Engineering, KSE 2010, pp. 147–151. IEEE Computer Society, Washington, DC (2010)
Zhao, B., Vogel, S.: Adaptive parallel sentences mining from web bilingual news collection. In: Proceedings of the 2002 IEEE International Conference on Data Mining, ICDM 2002, pp. 745–748. IEEE Computer Society, Washington, DC (2002)
Resnik, P., Smith, N.A.: The web as a parallel corpus. Comput. Linguist. 29(3), 349–380 (2003)
Collier, N., Ono, K., Hirakawa, H.: An experiment in hybrid dictionary and statistical sentence alignment. In: Proceedings of the 17th International Conference on Computational Linguistics, vol. 1, pp. 268–274. Association for Computational Linguistics (1998)
Tay, R., Ibrahim, T.: Research on paragraph alignment technology in chinese-uighur bilingual corpus. Journal of Xinjiang University (Natural Science Edition) 1, 021 (2010)
Rasooli, M.S., Kashefi, O., Minaei-Bidgoli, B.: Extracting parallel paragraphs and sentences from english-persian translated documents. In: Salem, M.V.M., Shaalan, K., Oroumchian, F., Shakery, A., Khelalfa, H. (eds.) AIRS 2011. LNCS, vol. 7097, pp. 574–583. Springer, Heidelberg (2011)
Gupta, A., Pala, K.: A generic and robust algorithm for paragraph alignment and its impact on sentence alignment in parallel corpora (2012)
Brown, P.F., Lai, J.C., Mercer, R.L.: Aligning sentences in parallel corpora. In: Proceedings of the 29th Annual Meeting on Association for Computational Linguistics, ACL 1991, pp. 169–176. Association for Computational Linguistics, Stroudsburg (1991)
Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpora. Comput. Linguist. 19(1), 75–102 (1993)
Chen, S.F.: Aligning sentences in bilingual corpora using lexical information. In: Proceedings of the 31st Annual Meeting on Association for Computational Linguistics, ACL 1993, pp. 9–16. Association for Computational Linguistics, Stroudsburg (1993)
Meyers, A., Kosaka, M., Grishman, R.: A multilingual procedure for dictionary-based sentence alignment. In: Farwell, D., Gerber, L., Hovy, E. (eds.) AMTA 1998. LNCS (LNAI), vol. 1529, pp. 187–198. Springer, Heidelberg (1998)
Gelbukh, A., Sidorov, G., Vera-Félix, J.Á.: Paragraph-level alignment of an english-spanish parallel corpus of fiction texts using bilingual dictionaries. In: Sojka, P., Kopeček, I., Pala, K. (eds.) TSD 2006. LNCS (LNAI), vol. 4188, pp. 61–67. Springer, Heidelberg (2006)
Sennrich, R., Volk, M.: Mt-based sentence alignment for ocr-generated parallel texts. In: The Ninth Conference of the Association for Machine Translation in the Americas (AMTA 2010), Denver, Colorado (2010)
Sennrich, R., Volk, M.: Iterative, mt-based sentence alignment of parallel texts (2011)
Gale, W.A., Church, K.W.: A program for aligning sentences in bilingual corpora. Computational Linguistics 19(1), 75–102 (1993)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Le, QH., Nguyen, DC., Pham, DH., Le, AC., Huynh, VN. (2014). Paragraph Alignment for English-Vietnamese Parallel E-Books. In: Huynh, V., Denoeux, T., Tran, D., Le, A., Pham, S. (eds) Knowledge and Systems Engineering. Advances in Intelligent Systems and Computing, vol 245. Springer, Cham. https://doi.org/10.1007/978-3-319-02821-7_23
Download citation
DOI: https://doi.org/10.1007/978-3-319-02821-7_23
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-02820-0
Online ISBN: 978-3-319-02821-7
eBook Packages: EngineeringEngineering (R0)