Skip to main content
Log in

A cost-effective lexical acquisition process for large-scale thesaurus translation

  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Thesauri and controlled vocabularies facilitate access to digital collections by explicitly representing the underlying principles of organization. Translation of such resources into multiple languages is an important component for providing multilingual access. However, the specificity of vocabulary terms in most thesauri precludes fully-automatic translation using general-domain lexical resources. In this paper, we present an efficient process for leveraging human translations to construct domain-specific lexical resources. This process is illustrated on a thesaurus of 56,000 concepts used to catalog a large archive of oral histories. We elicited human translations on a small subset of concepts, induced a probabilistic phrase dictionary from these translations, and used the resulting resource to automatically translate the rest of the thesaurus. Two separate evaluations demonstrate the acceptability of the automatic translations and the cost-effectiveness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. Even after a term is assumed to be translated, there will be keyword phrases containing that term which contain other high translation value terms not yet translated. In some cases, the sum of the translation value of the untranslated terms will be high enough to warrant addition of the keyword phrase to the prioritized list, despite the already translated term.

References

  • Chun, C., & Wenlin, L. (2002). The translation of agricultural multilingual thesaurus. In Proceedings of the Third Asian Conference for Information Technology in Agriculture.

  • Čmejrek, M., Cuřín, J., Havelka, J., Hajič, J., & Kuboň, V. (2004). Prague Czech-English Dependency Treebank: Syntactically annotated resources for machine translation. In Proceedings of LREC 2004.

  • Déjean, H., Gaussier, E., Renders, J.-M., & Sadat, F. (2005). Automatic processing of multilingual medical terminology: Applications to thesaurus enrichment and cross-language information retrieval. Artificial Intelligence in Medicine, 33(2), 111–124.

    Google Scholar 

  • Echizen-ya, H., Araki, K., Momouchi, Y. (2006). Automatic extraction of bilingual word pairs using inductive chain learning in various languages. Information Processing and Management, 42(5), 1294–1315.

    Article  Google Scholar 

  • Frederking, R., Nirenburg, S., Farwell, D., Helmreich, S., Hovy, E., Knight, K., Beale, S., Domashnev, C., Attardo, D., Grannes, D., & Brown, R. (1994). The Pangloss Mark III machine translation system. In Proceedings of the 1st AMTA Conference.

  • Gustman, S., Soergel, D., Oard, D. W., Byrne, W. J., Picheny, M., Ramabhadran, B., & Greenberg, D. (2002). Supporting access to large digital oral history archives. In Proceedings of JCDL 2002 (pp. 18–27).

  • Kaji, H., & Aizono, T. (1996). Extracting word correspondences from bilingual corpora based on word co-occurrence information. In Proceedings of COLING 1996 (pp. 23–28).

  • Murray, G. C., Dorr, B., Lin, J., Hajič, J., & Pecina, P. (2006a). Leveraging recurrent phrase structure in large-scale ontology translation. In Proceedings of EAMT 2006.

  • Murray, G. C., Dorr, B., Lin, J., Hajič, J., & Pecina, P. (2006b). Leveraging reusability: Cost-effective lexical acquisition for large-scale ontology translation. In Proceedings of COLING/ACL 2006 (pp. 945–952).

  • Olsen, M., Dorr, B., & Thomas, S. (1998). Enhancing automatic acquisition of thematic structure in a large-scale lexicon for Mandarin Chinese. In Proceedings of the Third Conference of the Association for Machine Translation in the Americas (AMTA ’98).

  • Papineni, K., Roukos, S., Ward, T., & Zhu, W.-J. (2002). BLEU: A method for automatic evaluation of machine translation. In Proceedings of ACL 2002 (pp. 311–318).

  • Rapp, R. (1999). Automatic identification of word translations from unrelated English and German Corpora. In Proceedings of ACL 1999 (pp. 519–526).

  • Sabarís, M., Alonso, J., Dafonte, C., & Arcay, B. (2001). Multilingual authoring through an artificial language. In Proceedings of MT Summit VIII.

  • Sadat, F., Yoshikawa, M., & Uemura, S. (2003). Enhancing cross-language information retrieval by an automatic acquisition of bilingual terminology from comparable corpora. In Proceedings of SIGIR 2003 (pp. 397–398).

  • Snover, M., Dorr, B. J., Schwartz, R., Makhoul, J., Micciulla, L., & Weischedel, R. (2005). A study of translation error rate with targeted human annotation. Technical Report LAMP-TR-126/CS-TR-4755/UMIACS-TR-2005-58, University of Maryland, College Park.

  • Tanaka, K., & Iwasaki, H. (1996). Extraction of lexical translations from non-aligned corpora. In Proceedings of COLING 1996 (pp. 580–585).

  • USC. (2006). USC Shoah Foundation Institute for Visual History and Education.

Download references

Acknowledgements

Our thanks to Doug Oard for helpful discussions; to our Czech informants; and to Soumya Bhat for her programming efforts. This work was supported in part by NSF IIS Award 0122466 and NSF CISE RI Award EIA0130422. Additional support also came from grants of the MSMT CR #1P05ME786, #LC536 and #MSM0021620838, and the Grant Agency of the Czech Republic #GA405/06/0589. The first author would like to thank Esther and Kiri for their kind support.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jimmy Lin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lin, J., Murray, G.C., Dorr, B.J. et al. A cost-effective lexical acquisition process for large-scale thesaurus translation. Lang Resources & Evaluation 43, 27–40 (2009). https://doi.org/10.1007/s10579-008-9074-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-008-9074-8

Keywords

Navigation