Abstract
Multilingual corpora are valuable resources for cross-language information retrieval and are available in many language pairs. However the Persian language does not have rich multilingual resources due to some of its special features and difficulties in constructing the corpora. In this study, we build a Persian-English comparable corpus from two independent news collections: BBC News in English and Hamshahri news in Persian. We use the similarity of the document topics and their publication dates to align the documents in these sets. We tried several alternatives for constructing the comparable corpora and assessed the quality of the corpora using different criteria. Evaluation results show the high quality of the aligned documents and using the Persian-English comparable corpus for extracting translation knowledge seems promising.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
AleAhmad, A., Amiri, H., Darrudi, E., Rahgozar, M., Oroumchian, F.: Hamshahri: A standard Persian text collection. Knowledge-Based Systems 22(5), 382–387 (2009)
Bekavac, B., Osenova, P., Simov, K., Tadić, M.: Making monolingual corpora comparable: a case study of Bulgarian and Croatian. In: LREC, pp. 1187–1190 (2004)
Bijankhan, M.: Role of language corpora in writing grammar: introducing a computer software. Iranian Journal of Linguistics (38), 38–67 (2004)
Braschler, M., Schäuble, P.: Multilingual information retrieval based on document alignment techniques. In: Nikolaou, C., Stephanidis, C. (eds.) ECDL 1998. LNCS, vol. 1513, pp. 183–197. Springer, Heidelberg (1998)
Collier, N., Kumano, A., Hirakawa, H.: An application of local relevance feedback for building comparable corpora from news article matching. NII. J. (Natl. Inst. Inform.) 5, 9–23 (2003)
Davis, M.W.: On the effective use of large parallel corpora in cross-language text retrieval. Cross-language Information Retrieval, 11–22 (1998)
Dimitrova, L., Ide, N., Petkevic, V., Erjavec, T., Kaalep, H.J., Tufis, D.: Multext-east: parallel and comparable corpora and lexicons for six central and eastern european languages. In: ACL, pp. 315–319 (1998)
Ghayoomi, M., Momtazi, S., Bijankhan, M.: A study of corpus development for Persian. International Journal of Asian Language Processing 20(1), 17–33 (2010)
Karimi, S.: Machine Transliteration of Proper Names between English and Persian. Ph.D. thesis, RMIT University, Melbourne, Victoria, Australia (2008)
Koskenniemi, K.: Two-level morphology: A general computational model for word-form recognition and production. Publications of the Department of General Linguistics, University of Helsinki 11 (1983)
Lafferty, J., Zhai, C.: Document language models, query models, and risk minimization for information retrieval. In: SIGIR, pp. 111–119 (2001)
McNamee, P., Mayfield, J.: Comparing cross-language query expansion techniques by degrading translation resources. In: SIGIR, pp. 159–166 (2002)
Miangah, T.M.: Constructing a Large-Scale English-Persian Parallel Corpus. Meta: Translators’ Journal 54(1), 181–188 (2009)
Munteanu, D., Marcu, D.: Improving machine translation performance by exploiting non-parallel corpora. Comput. Linguist. 31(4), 477–504 (2005)
Oard, D., Diekema, A.: Cross-language information retrieval. Annual Review of Information Science and Technology 33, 223–256 (1998)
Pilevar, M.T., Feili, H.: PersianSMT: A first attempt to english-persian statistical machine translation. In: JADT (2010)
Pirkola, A., Leppanen, E., Järvelin, K.: The RATF formula (Kwok’s formula): exploiting average term frequency in cross-language retrieval. Information Research 7(2) (2002)
Resnik, P.: Mining the web for bilingual text. In: ACL, pp. 527–534 (1999)
Robertson, S.E., Walker, S.: Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In: SIGIR, pp. 232–241 (1994)
Sharoff, S.: Creating general-purpose corpora using automated search engine queries. In: WaCky! Working Papers on the Web as Corpus (2006)
Sheridan, P., Ballerini, J.P.: Experiments in multilingual information retrieval using the spider system. In: SIGIR, pp. 58–65 (1996)
Steinberger, R., Pouliquen, B., Ignat, C.: Navigating multilingual news collections using automatically extracted information. CIT 13(4), 257–264 (2005)
Talvensaari, T., Laurikkala, J., Järvelin, K., Juhola, M.: Creating and exploiting a comparable corpus in cross-language information retrieval. TOIS 25(4) (2007)
Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M., Laurikkala, J.: Focused web crawling in the acquisition of comparable corpora. Information Retrieval 11, 427–445 (2008)
Tao, T., Zhai, C.X.: Mining comparable bilingual text corpora for cross-language information integration. In: SIGKDD, pp. 691–696 (2005)
Utsuro, T., Horiuchi, T., Chiba, Y., Hamamoto, T.: Semi-automatic compilation of bilingual lexicon entries from cross-lingually relevant news articles on WWW news sites. In: Richardson, S.D. (ed.) AMTA 2002. LNCS (LNAI), vol. 2499, pp. 165–176. Springer, Heidelberg (2002)
Yang, C.C., Li, W., et al.: Building parallel corpora by automatic title alignment using length-based and text-based approaches. Information Processing & Management 40(6), 939–955 (2004)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Baradaran Hashemi, H., Shakery, A., Faili, H. (2010). Creating a Persian-English Comparable Corpus. In: Agosti, M., Ferro, N., Peters, C., de Rijke, M., Smeaton, A. (eds) Multilingual and Multimodal Information Access Evaluation. CLEF 2010. Lecture Notes in Computer Science, vol 6360. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15998-5_5
Download citation
DOI: https://doi.org/10.1007/978-3-642-15998-5_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-15997-8
Online ISBN: 978-3-642-15998-5
eBook Packages: Computer ScienceComputer Science (R0)