Collocation Dictionary Optimization Using WordNet and k-Nearest Neighbor Learning

Kim, Yuseop; Zhang, Byoung-Tak; Kim, Yung Taek

doi:10.1023/A:1014540107013

Collocation Dictionary Optimization Using WordNet and k-Nearest Neighbor Learning

Published: June 2001

Volume 16, pages 89–108, (2001)
Cite this article

Machine Translation

Yuseop Kim¹,
Byoung-Tak Zhang¹ &
Yung Taek Kim¹

134 Accesses
4 Citations
3 Altmetric
Explore all metrics

Abstract

In machine translation, collocation dictionaries are important for selecting accurate target words. However, if the dictionary size is too large it can decrease the efficiency of translation. This paper presents a method to develop a compact collocation dictionary for transitive verb–object pairs in English–Korean machine translation without losing translation accuracy. We use WordNet to calculate the semantic distance between words, and k-nearestneighbor learning to select the translations. The entries in the dictionary are minimized to balance the trade-off between translation accuracy and time. We have performed several experiments on a selected set of verbs extracted from a raw corpus of over 3 million words. The results show that in real-time translation environments the size of a collocation dictionary can be reduced up to 40% of its original size without significant decrease in its accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Aha, David, Dennis Kibler, and Marc Albert: 1991, ‘Instance-Based Learning Algorithms’, Machine Learning 6, 37–66.
Google Scholar
Breiman Leo and Philip Spector: 1992, ‘Submodel Selection and Evaluation in Regression: The X-Random Case’, International Statistic Review 60, 291–319.
Google Scholar
Cherkassky, Vladimir and Filip Mulier: 1998, Learning from Data: Concepts, Theory, and Methods, John Wiley, New York.
Google Scholar
Cover, Thomas M. and Peter E. Hart: 1967, ‘Nearest Neighbor Pattern Classification’, IEEE Transactions on Information Theory 13, 21–27.
Google Scholar
Dagan, Ido and Alon Itai: 1994, ‘Word Sense Disambiguation Using a Second Language Monolingual Corpus’, Computational Linguistics 20, 563–596.
Google Scholar
Dagan, Ido, Lillian Lee, and Fernando Pereira: 1997, ‘Similarity-Based Methods for Word Sense Disambiguation’, 35th Annual Meeting of the Association for Computational Linguistics and 8th Conference of the European Chapter of the Association for Computational Linguistics, Madrid, Spain, pp. 56–63.
Dagan, Ido, Lillian Lee and Fernando C. N. Fereira: 1999, ‘Similarity-Based Models of Word Cooccurrence Probabilities’, Machine Learning 34, 43–69.
Google Scholar
Dorr, Bonnie J., Joseph Garman, and Amy Weinberg: 1995, ‘From Syntactic Encodings to Thematic Roles: Building Lexical Entries for Interlingual MT’, Machine Translation 9, 71–100.
Google Scholar
Fellbaum, Chistiane: 1998, WordNet: An Electronic Lexical Database, The MIT Press, Cambridge, MA.
Google Scholar
Karov, Yael and Shimon Edelman: 1998, ‘Similarity-Based Word Sense Disambiguation’, Computational Linguistics 24, 41–59.
Google Scholar
Kim, Nari and Yung Taek Kim: 1994, ‘Determining Target Expression Using Parameterized Collocations from Corpus in Korean-English Machine Translation’, Pacific Rim International Conference on Artificial Intelligence, Beijing, China, pp. 732–736.
Kim, Yuscop and Yung Taek Kim: 1998, ‘Semantic Implementation Based on Extended Idiom for English to Korean Machine Translation’, The Asia-Pacific Association for Machine Translation Journal 21, 23–39.
Google Scholar
Landauer, Thomas K., Peter W. Foltz, and Darrell Laham: 1998, ‘An Introduction to Latent Semantic Analysis’, Discourse Process 25, 259–284.
Google Scholar
McCallum, Andrew Kachites: 1996, ‘Bow: A Toolkit for Statistical Language Modeling, Text Retrieval, Classification and Clustering’, http://www.cs.cmu.edu/~mccallum/bow.
Miller, George A., Richard Beckwith, Christiane Fellbaum, Derek Gross, and Katherine Miller: 1990, ‘Introduction to WordNet: An On-Line Lexical Database’, International Journal of Lexicography 3, 235–244.
Google Scholar
Mitchell, Tom M.: 1997, Machine Learning, McGraw-Hill, New York.
Google Scholar
Resnik, Phillip: 1995, ‘Disambiguating Noun Groupings with Respect to WordNet Senses’, Proceedings of the Third Workshop on Very Large Corpora, Cambridge, MA, pp. 54–68.
Richardson Ray, Alan F. Smeaton, and John Murphy: 1994, ‘Using WordNet as a Knowledge Base for Measuring Semantic Similarity between Words’, Working Paper CA-1294, School of Computer Applications, Dublin City University.
Riloff, Ellen: 1996, ‘An Empirical Study of Automated Dictionary Construction for Information Extraction in Three Domains’, Artificial Intelligence 85, 101–134.
Google Scholar
SNU: 2001, ‘E-Tran 2001’, http://nlp.snu.ac.lcr/E-Tran2001 [in Korean].
Shin, Dong Ho: 1999, ‘A Study on Content-Based Information Retrieval System Using LSA’, MS Thesis, Seoul National University.
Sinclair, John (ed.): 1997, Collins Cobuild English Dictionary, Collins, London.
Google Scholar
Soderland, Stephen, David Fisher, Jonathan Aseltine, and Wendy Lehnert: 1995, ‘CRYSTAL: Inducing a Conceptual Dictionary’, The 1995 International Joint Conference for Artificial Intelligence, Montreal, Canada, pp. 1314–1319.
Takenobu, Tokunaga, Iwayama Makoto, and Tanaka Hozumi: 1995, ‘Automatic Thesaurus Construction Based on Grammatical Relations’, The 1995 International Joint Conference for Artificial Intelligence, Montreal, Canada, pp. 1308–1313.
Yarowsky, David: 1992, ‘Word-Sense Disambiguation Using Statistical Models of Roget's Categories Trained on Large Corpora’, Proceedings of the fifteenth [sic] International Conference on Computational Linguistics, COLING-92, Nantes, France, pp. 454–460.

Download references

Author information

Authors and Affiliations

Department of Computer Engineering, Seoul National University, Seoul, 151-742, Korea
Yuseop Kim, Byoung-Tak Zhang & Yung Taek Kim

Authors

Yuseop Kim
View author publications
You can also search for this author in PubMed Google Scholar
Byoung-Tak Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yung Taek Kim
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, Y., Zhang, BT. & Kim, Y.T. Collocation Dictionary Optimization Using WordNet and k-Nearest Neighbor Learning. Machine Translation 16, 89–108 (2001). https://doi.org/10.1023/A:1014540107013

Download citation

Issue Date: June 2001
DOI: https://doi.org/10.1023/A:1014540107013

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Collocation Dictionary Optimization Using WordNet and k-Nearest Neighbor Learning

Abstract

Access this article

Similar content being viewed by others

Bilingually Learning Word Senses for Translation

Cross-Lingual Word Sense Clustering for Sense Disambiguation

Five Languages Are Better Than One: An Attempt to Bypass the Data Acquisition Bottleneck for WSD

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Collocation Dictionary Optimization Using WordNet and k-Nearest Neighbor Learning

Abstract

Access this article

Similar content being viewed by others

Bilingually Learning Word Senses for Translation

Cross-Lingual Word Sense Clustering for Sense Disambiguation

Five Languages Are Better Than One: An Attempt to Bypass the Data Acquisition Bottleneck for WSD

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation