Building a multi-domain comparable corpus using a learning to rank method†

RAZIEH RAHIMI; AZADEH SHAKERY; JAVID DADASHKARIMI; MOZHDEH ARIANNEZHAD; MOSTAFA DEHGHANI; HOSSEIN NASR ESFAHANI

doi:10.1017/S1351324916000164

Building a multi-domain comparable corpus using a learning to rank method†

Published online by Cambridge University Press: 15 June 2016

RAZIEH RAHIMI ,

AZADEH SHAKERY ,

JAVID DADASHKARIMI ,

MOZHDEH ARIANNEZHAD ,

MOSTAFA DEHGHANI and

HOSSEIN NASR ESFAHANI

Show author details

RAZIEH RAHIMI: Affiliation:
School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran e-mails: razrahimi@ut.ac.ir, shakery@ut.ac.ir, dadashkarimi@ut.ac.ir, m.ariannezhad@ut.ac.ir, h_nasr@ut.ac.ir
AZADEH SHAKERY: Affiliation:
School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran e-mails: razrahimi@ut.ac.ir, shakery@ut.ac.ir, dadashkarimi@ut.ac.ir, m.ariannezhad@ut.ac.ir, h_nasr@ut.ac.ir School of Computer Science, Institute for Research in Fundamental Sciences (IPM), Tehran, Iran
JAVID DADASHKARIMI: Affiliation:
School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran e-mails: razrahimi@ut.ac.ir, shakery@ut.ac.ir, dadashkarimi@ut.ac.ir, m.ariannezhad@ut.ac.ir, h_nasr@ut.ac.ir
MOZHDEH ARIANNEZHAD: Affiliation:
School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran e-mails: razrahimi@ut.ac.ir, shakery@ut.ac.ir, dadashkarimi@ut.ac.ir, m.ariannezhad@ut.ac.ir, h_nasr@ut.ac.ir
MOSTAFA DEHGHANI: Affiliation:
Institute for Logic, Language and Computation, University of Amsterdam, Amsterdam, The Netherlands e-mail: dehghani@uva.nl
HOSSEIN NASR ESFAHANI: Affiliation:
School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran e-mails: razrahimi@ut.ac.ir, shakery@ut.ac.ir, dadashkarimi@ut.ac.ir, m.ariannezhad@ut.ac.ir, h_nasr@ut.ac.ir

Article contents

Abstract
Footnotes
References

Get access

Rights & Permissions

Abstract

Comparable corpora are key translation resources for both languages and domains with limited linguistic resources. The existing approaches for building comparable corpora are mostly based on ranking candidate documents in the target language for each source document using a cross-lingual retrieval model. These approaches also exploit other evidence of document similarity, such as proper names and publication dates, to build more reliable alignments. However, the importance of each evidence in the scores of candidate target documents is determined heuristically. In this paper, we employ a learning to rank method for ranking candidate target documents with respect to each source document. The ranking model is constructed by defining each evidence for similarity of bilingual documents as a feature whose weight is learned automatically. Learning feature weights can significantly improve the quality of alignments, because the reliability of features depends on the characteristics of both source and target languages of a comparable corpus. We also propose a method to generate appropriate training data for the task of building comparable corpora. We employed the proposed learning-based approach to build a multi-domain English–Persian comparable corpus which covers twelve different domains obtained from Open Directory Project. Experimental results show that the created alignments have high degrees of comparability. Comparison with existing approaches for building comparable corpora shows that our learning-based approach improves both quality and coverage of alignments.

Type: Articles
Information: Natural Language Engineering , Volume 22 , Issue 4: Machine Translation Using Comparable Corpora , July 2016 , pp. 627 - 653

DOI: https://doi.org/10.1017/S1351324916000164 [Opens in a new window]
Copyright: Copyright © Cambridge University Press 2016

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Footnotes

†

This research was in part supported by a grant from Institute for Research in Fundamental Sciences (No. CS1393-4-43).

References

AbduI-Rauf, S., and Schwenk, H. 2009. On the use of comparable corpora to improve SMT performance. In Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics, EACL'09, Stroudsburg, PA, USA, Association for Computational Linguistics, pp. 16–23.Google Scholar

Agirre, E., Di Nunzio, G. M., Ferro, N., Mandl, T., and Peters, C. 2009. Clef 2008: ad hoc track overview. In Proceedings of the 9th Cross-language Evaluation Forum Conference on Evaluating Systems for Multilingual and Multimodal Information Access, CLEF'08, Berlin, Heidelberg, Springer-Verlag, pp. 15–37.Google Scholar

Aker, A., Kanoulas, E. and Gaizauskas, R. 2012. A light way to collect comparable corpora from the web. In Chair, N. C. C., Choukri, K., Declerck, T., Doan, M. U., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., and Piperidis, S. (eds.), Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), Istanbul, Turkey. European Language Resources Association (ELRA).Google Scholar

Aker, A., Paramita, M. and Gaizauskas, R. 2013. Extracting bilingual terminologies from comparable corpora. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Soa, Bulgaria: Association for Computational Linguistics, pp. 402–411.Google Scholar

Aker, A., Paramita, M. L., Pinnis, M. and Gaizauskas, R. 2014. Bilingual dictionaries for all EU languages. In The Ninth International Conference on Language Resources and Evaluation (LREC'14). Reykjavik, Iceland: European Language Resources Association (ELRA).Google Scholar

AleAhmad, A., Amiri, H., Darrudi, E., Rahgozar, M., and Oroumchian, F. 2009. Hamshahri: a standard Persian text collection. Knowledge Based Systems 22 (5): 382–387.Google Scholar

Azarbonyad, H., Shakery, A. and Faili, H. 2012. Using learning to rank approach for parallel corpora based cross-language information retrieval. In Proceedings of 20th European Conference on Artificial Intelligence (ECAI), Montpellier, France, pp. 79–84.Google Scholar

Braschler, M. and Schäuble, P. 1998. Multilingual information retrieval based on document alignment techniques. In Proceedings of the Second European Conference on Research and Advanced Technology for Digital Libraries, ECDL'98, London, UK: Springer-Verlag, pp. 183–197.Google Scholar

Cortes, C. and Vapnik, V. 1995. Support-vector networks. Machine Learning 20 (3): 273–297.Google Scholar

Dadashkarimi, J., Shakery, A. and Heshaam, F. 2014. A probabilistic translation method for dictionary-based cross-lingual information retrieval in agglutinative languages. In Proceedings of the 3th Conference on Computational Linguistic, CLConference'14, Tehran, Iran.Google Scholar

Darwish, K. and Oard, D. W. 2003. Probabilistic structured query methods. In Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, SIGIR'03, New York, NY, USA. ACM, pp. 338–344.Google Scholar

Fang, H., Tao, T. and Zhai, C. (2011). Diagnostic evaluation of information retrieval models. ACM Transactions on Information Systems 29 (2): 7:1–7:42.Google Scholar

Ferro, N. and Peters, C. 2009. Clef 2009 ad hoc track overview: TEL and Persian tasks. In Proceedings of the 10th Cross-Language Evaluation Forum Conference on Multilingual Information Access Evaluation: Text Retrieval Experiments, CLEF'09, Berlin, Heidelberg, Springer-Verlag, pp. 13–35.Google Scholar

Finkel, J. R., Grenager, T. and Manning, C. 2005. Incorporating non-local information into information extraction systems by gibbs sampling. In Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics, ACL'05, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 363–370.Google Scholar

Fung, P. and Cheung, P. 2004. Multi-level bootstrapping for extracting parallel sentences from a quasi-comparable corpus. In Proceedings of the 20th International Conference on Computational Linguistics, COLING'04, Stroudsburg, PA, USA. Association for Computational Linguistics.Google Scholar

Garera, N., Callison-Burch, C., and Yarowsky, D. 2009. Improving translation lexicon induction from monolingual corpora via dependency contexts and part-of-speech equivalences. In Proceedings of the Thirteenth Conference on Computational Natural Language Learning, CoNLL'09, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 129–137.Google Scholar

Gaussier, E., Renders, J.-M., Matveeva, I., Goutte, C., and Déjean, H. 2004. A geometric view on bilingual lexicon extraction from comparable corpora. In Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics, ACL'04, Stroudsburg, PA, USA: Association for Computational Linguistics, pp. 527–534.Google Scholar

Hashemi, H. B. and Shakery, A. 2014. Mining a Persian-English comparable corpus for cross-language information retrieval. Information Processing and Management 50 (2): 384–398.Google Scholar

Hashemi, H. B., Shakery, A. and Faili, H. 2010. Creating a Persian-English comparable corpus. In Proceedings of the 2010 International Conference on Multilingual and Multimodal Information Access Evaluation: Cross-Language Evaluation Forum, CLEF'10, Berlin, Heidelberg, Springer-Verlag, pp. 27–39.Google Scholar

Huang, D., Zhao, L., Li, L. and Yu, H. 2010. Mining large-scale comparable corpora from Chinese-English news collections. In Proceedings of the 23rd International Conference on Computational Linguistics: Posters, Beijing, China. COLING'10, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 472–480.Google Scholar

Joachims, T. 2002. Optimizing search engines using clickthrough data. In Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD'02, New York, NY, USA. ACM, pp. 133–142.Google Scholar

Li, H. and Hirst, G. 2011. Learning to Rank for Information Retrieval and Natural Language Processing. G - Reference, Information and Interdisciplinary Subjects Series. California, USA: Morgan & Claypool Publishers.Google Scholar

McNamee, P., Mayfield, J., and Nicholas, C. 2009. Translation corpus source and size in bilingual retrieval. In Proceedings of Human Language Technologies: The 2009 Annual Conference of the North American Chapter of the Association for Computational Linguistics, Companion Volume: Short Papers, NAACL-Short'09 Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 25–28.Google Scholar

Munteanu, D. S. and Marcu, D. 2005. Improving machine translation performance by exploiting non-parallel corpora. Computational Linguistics 31 (4): 477–504.Google Scholar

Munteanu, D. S. and Marcu, D. 2006. Extracting parallel sub-sentential fragments from non-parallel corpora. In Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics, ACL-44, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 81–88.Google Scholar

Nie, J.-Y. 2010. Cross-Language Information Retrieval. Synthesis Lectures on Human Language Technologies. California, USA: Morgan & Claypool Publishers.Google Scholar

Och, F. J. and Ney, H. 2003. A systematic comparison of various statistical alignment models. Computational Linguistics 29 (1): 19–51.Google Scholar

Pal, S., Pakray, P. and Naskar, K. S. 2014. Automatic Building and Using Parallel Resources for SMT from Comparable Corpora. In Proceedings of the 3rd Workshop on Hybrid Approaches to Machine Translation (HyTra) at EACL. Association for Computational Linguistics, pp. 48–57.Google Scholar

Paramita, M. L., Guthrie, D., Kanoulas, E., Gaizauskas, R., Clough, P., and Sanderson, M. 2013. Methods for collection and evaluation of comparable documents. In Sharoff, S., Rapp, R., Zweigenbaum, P., and Fung, P. (eds), Building and Using Comparable Corpora, pp. 93–112. Berlin Heidelberg: Springer.Google Scholar

Pilevar, M. T., Faili, H. and Pilevar, A. H. 2011. TEP: Tehran English-Persian parallel corpus. In Proceedings of the 12th international conference on Computational Linguistics and Intelligent Text Processing - Volume Part II, CICLing'11, Berlin, Heidelberg, Springer-Verlag, pp. 68–79.Google Scholar

Pomikálek, J. 2011. Removing boilerplate and duplicate content from web corpora. PhD thesis, Masaryk university, Faculty of informatics, Brno, Czech Republic.Google Scholar

Rahimi, R. and Shakery, A. 2013. A language modeling approach for extracting translation knowledge from comparable corpora. In Proceedings of the 35th European conference on Advances in Information Retrieval, ECIR'13, Berlin, Heidelberg. Springer-Verlag, pp. 606–617.Google Scholar

Rahimi, R., Shakery, A. and King, I. 2016. Extracting translations from comparable corpora for cross-language information retrieval using the language modeling framework. Information Processing & Management 52 (2): 299–318.Google Scholar

Rahimi, Z. and Shakery, A. 2011. Topic based creation of a Persian-English comparable corpus. In Proceedings of the 7th Asia Conference on Information Retrieval Technology, AIRS'11, Berlin, Heidelberg, Springer-Verlag, pp. 458–469.Google Scholar

Saad, M., Langlois, D. and Smaïli, K. 2013. Extracting comparable articles from wikipedia and measuring their comparabilities. Procedia - Social and Behavioral Sciences 95: 40–47.Google Scholar

Sheridan, P. and Ballerini, J. P. 1996. Experiments in multilingual information retrieval using the spider system. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'96, New York, NY, USA. ACM, pp. 58–65.Google Scholar

Skadia, I., Aker, A., Mastropavlos, N., Su, F., Tufi, D., Verlic, M., Vasijevs, A., Babych, B., Clough, P., Gaizauskas, R., Glaros, N., Paramita, M. L., and Pinnis, M. (2012). Collecting and using comparable corpora for statistical machine translation. In Chair, N. C. C., Choukri, K., Declerck, T., Doan, M. U., Maegaard, B., Mariani, J., Moreno, A., Odijk, J., and Piperidis, S. (eds.), In Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC'12), Istanbul, Turkey: European Language Resources Association 824 (ELRA).Google Scholar

Smith, J. R., Quirk, C. and Toutanova, K. 2010. Extracting parallel sentences from comparable corpora using document level alignment. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, HLT'10, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 403–411.Google Scholar

Strötgen, J., Gertz, M. and Junghans, C. 2011. An event-centric model for multilingual document similarity. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'11, New York, NY, USA, ACM, pp. 953–962.Google Scholar

Talvensaari, T., Laurikkala, J., Järvelin, K., Juhola, M., and Keskustalo, H. 2007. Creating and exploiting a comparable corpus in cross-language information retrieval. ACM Transactions on Information Systems 25 (4), 1.Google Scholar

Talvensaari, T., Pirkola, A., Järvelin, K., Juhola, M., and Laurikkala, J. 2008. Focused web crawling in the acquisition of comparable corpora. Information Retrieval 11 (5): 427–445.Google Scholar

Tao, T. and Zhai, C. 2005. Mining comparable bilingual text corpora for cross-language information integration. In Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery in Data Mining, KDD'05, New York, NY, USA: ACM, pp. 691–696.Google Scholar

Ture, F., Elsayed, T. and Lin, J. 2011. No free lunch: brute force vs. locality-sensitive hashing for cross-lingual pairwise similarity. In Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR'11, New York, NY, USA, ACM, pp. 943–952.Google Scholar

Vulić, I. and Moens, M.-F. 2012. Detecting highly confident word translations from comparable corpora without any prior knowledge. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL'12, Stroudsburg, PA, USA. Association for Computational Linguistics, pp. 449–459.Google Scholar

Xu, J. and Weischedel, R. 2005. Empirical studies on the impact of lexical resources on CLIR performance. Information Processing & Management 41 (3): 475–487.Google Scholar

Zhai, C. and Lafferty, J. 2001. Model-based feedback in the language modeling approach to information retrieval. In Proceedings of the 10th International Conference on Information and Knowledge Management, CIKM'01, New York, NY, USA, ACM, pp. 403–410.Google Scholar

Article contents

Building a multi-domain comparable corpus using a learning to rank method†

Abstract

Access options

Footnotes

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests