Skip to main content
Log in

Collocation Dictionary Optimization Using WordNet and k-Nearest Neighbor Learning

  • Published:
Machine Translation

Abstract

In machine translation, collocation dictionaries are important for selecting accurate target words. However, if the dictionary size is too large it can decrease the efficiency of translation. This paper presents a method to develop a compact collocation dictionary for transitive verb–object pairs in English–Korean machine translation without losing translation accuracy. We use WordNet to calculate the semantic distance between words, and k-nearestneighbor learning to select the translations. The entries in the dictionary are minimized to balance the trade-off between translation accuracy and time. We have performed several experiments on a selected set of verbs extracted from a raw corpus of over 3 million words. The results show that in real-time translation environments the size of a collocation dictionary can be reduced up to 40% of its original size without significant decrease in its accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Aha, David, Dennis Kibler, and Marc Albert: 1991, ‘Instance-Based Learning Algorithms’, Machine Learning 6, 37–66.

    Google Scholar 

  • Breiman Leo and Philip Spector: 1992, ‘Submodel Selection and Evaluation in Regression: The X-Random Case’, International Statistic Review 60, 291–319.

    Google Scholar 

  • Cherkassky, Vladimir and Filip Mulier: 1998, Learning from Data: Concepts, Theory, and Methods, John Wiley, New York.

    Google Scholar 

  • Cover, Thomas M. and Peter E. Hart: 1967, ‘Nearest Neighbor Pattern Classification’, IEEE Transactions on Information Theory 13, 21–27.

    Google Scholar 

  • Dagan, Ido and Alon Itai: 1994, ‘Word Sense Disambiguation Using a Second Language Monolingual Corpus’, Computational Linguistics 20, 563–596.

    Google Scholar 

  • Dagan, Ido, Lillian Lee, and Fernando Pereira: 1997, ‘Similarity-Based Methods for Word Sense Disambiguation’, 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, Madrid, Spain, pp. 56–63.

  • Dagan, Ido, Lillian Lee and Fernando C. N. Fereira: 1999, ‘Similarity-Based Models of Word Cooccurrence Probabilities’, Machine Learning 34, 43–69.

    Google Scholar 

  • Dorr, Bonnie J., Joseph Garman, and Amy Weinberg: 1995, ‘From Syntactic Encodings to Thematic Roles: Building Lexical Entries for Interlingual MT’, Machine Translation 9, 71–100.

    Google Scholar 

  • Fellbaum, Chistiane: 1998, WordNet: An Electronic Lexical Database, The MIT Press, Cambridge, MA.

    Google Scholar 

  • Karov, Yael and Shimon Edelman: 1998, ‘Similarity-Based Word Sense Disambiguation’, Computational Linguistics 24, 41–59.

    Google Scholar 

  • Kim, Nari and Yung Taek Kim: 1994, ‘Determining Target Expression Using Parameterized Collocations from Corpus in Korean-English Machine Translation’, Pacific Rim International Conference on Artificial Intelligence, Beijing, China, pp. 732–736.

  • Kim, Yuscop and Yung Taek Kim: 1998, ‘Semantic Implementation Based on Extended Idiom for English to Korean Machine Translation’, The Asia-Pacific Association for Machine Translation Journal 21, 23–39.

    Google Scholar 

  • Landauer, Thomas K., Peter W. Foltz, and Darrell Laham: 1998, ‘An Introduction to Latent Semantic Analysis’, Discourse Process 25, 259–284.

    Google Scholar 

  • McCallum, Andrew Kachites: 1996, ‘Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering’, http://www.cs.cmu.edu/~mccallum/bow.

  • Miller, George A., Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine Miller: 1990, ‘Introduction to WordNet: An On-Line Lexical Database’, International Journal of Lexicography 3, 235–244.

    Google Scholar 

  • Mitchell, Tom M.: 1997, Machine Learning, McGraw-Hill, New York.

    Google Scholar 

  • Resnik, Phillip: 1995, ‘Disambiguating Noun Groupings with Respect to WordNet Senses’, Proceedings of the Third Workshop on Very Large Corpora, Cambridge, MA, pp. 54–68.

  • Richardson Ray, Alan F. Smeaton, and John Murphy: 1994, ‘Using WordNet as a Knowledge Base for Measuring Semantic Similarity between Words’, Working Paper CA-1294, School of Computer Applications, Dublin City University.

  • Riloff, Ellen: 1996, ‘An Empirical Study of Automated Dictionary Construction for Information Extraction in Three Domains’, Artificial Intelligence 85, 101–134.

    Google Scholar 

  • SNU: 2001, ‘E-Tran 2001’, http://nlp.snu.ac.lcr/E-Tran2001 [in Korean].

  • Shin, Dong Ho: 1999, ‘A Study on Content-Based Information Retrieval System Using LSA’, MS Thesis, Seoul National University.

  • Sinclair, John (ed.): 1997, Collins Cobuild English Dictionary, Collins, London.

    Google Scholar 

  • Soderland, Stephen, David Fisher, Jonathan Aseltine, and Wendy Lehnert: 1995, ‘CRYSTAL: Inducing a Conceptual Dictionary’, The 1995 International Joint Conference for Artificial Intelligence, Montreal, Canada, pp. 1314–1319.

  • Takenobu, Tokunaga, Iwayama Makoto, and Tanaka Hozumi: 1995, ‘Automatic Thesaurus Construction Based on Grammatical Relations’, The 1995 International Joint Conference for Artificial Intelligence, Montreal, Canada, pp. 1308–1313.

  • Yarowsky, David: 1992, ‘Word-Sense Disambiguation Using Statistical Models of Roget's Categories Trained on Large Corpora’, Proceedings of the fifteenth [sic] International Conference on Computational Linguistics, COLING-92, Nantes, France, pp. 454–460.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, Y., Zhang, BT. & Kim, Y.T. Collocation Dictionary Optimization Using WordNet and k-Nearest Neighbor Learning. Machine Translation 16, 89–108 (2001). https://doi.org/10.1023/A:1014540107013

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1014540107013

Navigation