Abstract
In machine translation, collocation dictionaries are important for selecting accurate target words. However, if the dictionary size is too large it can decrease the efficiency of translation. This paper presents a method to develop a compact collocation dictionary for transitive verb–object pairs in English–Korean machine translation without losing translation accuracy. We use WordNet to calculate the semantic distance between words, and k-nearestneighbor learning to select the translations. The entries in the dictionary are minimized to balance the trade-off between translation accuracy and time. We have performed several experiments on a selected set of verbs extracted from a raw corpus of over 3 million words. The results show that in real-time translation environments the size of a collocation dictionary can be reduced up to 40% of its original size without significant decrease in its accuracy.
Similar content being viewed by others
References
Aha, David, Dennis Kibler, and Marc Albert: 1991, ‘Instance-Based Learning Algorithms’, Machine Learning 6, 37–66.
Breiman Leo and Philip Spector: 1992, ‘Submodel Selection and Evaluation in Regression: The X-Random Case’, International Statistic Review 60, 291–319.
Cherkassky, Vladimir and Filip Mulier: 1998, Learning from Data: Concepts, Theory, and Methods, John Wiley, New York.
Cover, Thomas M. and Peter E. Hart: 1967, ‘Nearest Neighbor Pattern Classification’, IEEE Transactions on Information Theory 13, 21–27.
Dagan, Ido and Alon Itai: 1994, ‘Word Sense Disambiguation Using a Second Language Monolingual Corpus’, Computational Linguistics 20, 563–596.
Dagan, Ido, Lillian Lee, and Fernando Pereira: 1997, ‘Similarity-Based Methods for Word Sense Disambiguation’, 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, Madrid, Spain, pp. 56–63.
Dagan, Ido, Lillian Lee and Fernando C. N. Fereira: 1999, ‘Similarity-Based Models of Word Cooccurrence Probabilities’, Machine Learning 34, 43–69.
Dorr, Bonnie J., Joseph Garman, and Amy Weinberg: 1995, ‘From Syntactic Encodings to Thematic Roles: Building Lexical Entries for Interlingual MT’, Machine Translation 9, 71–100.
Fellbaum, Chistiane: 1998, WordNet: An Electronic Lexical Database, The MIT Press, Cambridge, MA.
Karov, Yael and Shimon Edelman: 1998, ‘Similarity-Based Word Sense Disambiguation’, Computational Linguistics 24, 41–59.
Kim, Nari and Yung Taek Kim: 1994, ‘Determining Target Expression Using Parameterized Collocations from Corpus in Korean-English Machine Translation’, Pacific Rim International Conference on Artificial Intelligence, Beijing, China, pp. 732–736.
Kim, Yuscop and Yung Taek Kim: 1998, ‘Semantic Implementation Based on Extended Idiom for English to Korean Machine Translation’, The Asia-Pacific Association for Machine Translation Journal 21, 23–39.
Landauer, Thomas K., Peter W. Foltz, and Darrell Laham: 1998, ‘An Introduction to Latent Semantic Analysis’, Discourse Process 25, 259–284.
McCallum, Andrew Kachites: 1996, ‘Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering’, http://www.cs.cmu.edu/~mccallum/bow.
Miller, George A., Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine Miller: 1990, ‘Introduction to WordNet: An On-Line Lexical Database’, International Journal of Lexicography 3, 235–244.
Mitchell, Tom M.: 1997, Machine Learning, McGraw-Hill, New York.
Resnik, Phillip: 1995, ‘Disambiguating Noun Groupings with Respect to WordNet Senses’, Proceedings of the Third Workshop on Very Large Corpora, Cambridge, MA, pp. 54–68.
Richardson Ray, Alan F. Smeaton, and John Murphy: 1994, ‘Using WordNet as a Knowledge Base for Measuring Semantic Similarity between Words’, Working Paper CA-1294, School of Computer Applications, Dublin City University.
Riloff, Ellen: 1996, ‘An Empirical Study of Automated Dictionary Construction for Information Extraction in Three Domains’, Artificial Intelligence 85, 101–134.
SNU: 2001, ‘E-Tran 2001’, http://nlp.snu.ac.lcr/E-Tran2001 [in Korean].
Shin, Dong Ho: 1999, ‘A Study on Content-Based Information Retrieval System Using LSA’, MS Thesis, Seoul National University.
Sinclair, John (ed.): 1997, Collins Cobuild English Dictionary, Collins, London.
Soderland, Stephen, David Fisher, Jonathan Aseltine, and Wendy Lehnert: 1995, ‘CRYSTAL: Inducing a Conceptual Dictionary’, The 1995 International Joint Conference for Artificial Intelligence, Montreal, Canada, pp. 1314–1319.
Takenobu, Tokunaga, Iwayama Makoto, and Tanaka Hozumi: 1995, ‘Automatic Thesaurus Construction Based on Grammatical Relations’, The 1995 International Joint Conference for Artificial Intelligence, Montreal, Canada, pp. 1308–1313.
Yarowsky, David: 1992, ‘Word-Sense Disambiguation Using Statistical Models of Roget's Categories Trained on Large Corpora’, Proceedings of the fifteenth [sic] International Conference on Computational Linguistics, COLING-92, Nantes, France, pp. 454–460.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Kim, Y., Zhang, BT. & Kim, Y.T. Collocation Dictionary Optimization Using WordNet and k-Nearest Neighbor Learning. Machine Translation 16, 89–108 (2001). https://doi.org/10.1023/A:1014540107013
Issue Date:
DOI: https://doi.org/10.1023/A:1014540107013