Abstract
This paper addresses the problem of automatic acquisition of lexical knowledge for rapid construction of engines for machine translation and embedded multilingual applications. We describe new techniques for large-scale construction of a Chinese–English verb lexicon and we evaluate the coverage and effectiveness of the resulting lexicon. Leveraging off an existing Chinese conceptual database called How Net and a large, semantically rich English verb database, we use thematic-role information to create links between Chinese concepts and English classes. We apply the metrics of recall and precision to evaluate the coverage and effectiveness of the linguistic resources. The results of this work indicate that: (a) we are able to obtain reliable Chinese–English entries both with and without pre-existing semantic links between the two languages; (b) if we have pre-existing semantic links, we are able to produce a more robust lexical resource by merging these with our semantically rich English database; (c) in our comparisons with manual lexicon creation, our automatic techniques were shown to achieve 62% precision, compared to a much lower precision of 10% for arbitrary assignment of semantic links.
Similar content being viewed by others
References
Ayan, N. F. and B. J. Dorr: 2002, ‘Generating A Parsing Lexicon from an LCS-Based Lexicon’, in LREC 2002 Workshop Proceedings: Linguistic Knowledge Acquisition and Representation: Bootstrapping Annotated Language Data, Las Palmas, Spain.
Baker, C. F., C. J. Fillmore, and J. B. Lowe: 1998, ‘The Berkeley FrameNet Project’, in COLINGACL '98: 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Montreal, Quebec, pp. 86–90.
Ballesteros, L. and W. B. Croft: 1997, ‘Phrasal Translation and Query Expansion Techniques for Cross-Language Information Retrieval’, in SIGIR '97: Proceedings of the 20th International ACM SIGIR Conference on Research and Development in Information Retrieval, Philadelphia, PA, pp. 84–91.
Carpuat, M., G. Ngai, P. Fung, and K. Church: 2002, ‘Creating a Bilingual Ontology: A Corpus-Based Approach for Aligning WordNet and HowNet’, in Proceedings of the 1st Global WordNet Conference, Mysore, India.
Dang, H. T., K. Kipper, M. Palmer, and J. Rosenzweig: 1998, ‘Investigating Regular Sense Extensions Based on Intersective Levin’, in COLING-ACL '98: 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Montreal, Quebec, pp. 293–299.
Dong, Z.: 1988a, ‘Enlightment and Challenge of Machine Translation’, Shanghai Journal of Translators for Science and Technology 1, 9–15.
Dong, Z.: 1988b, ‘Knowledge Description: What, How and Who?’, in Proceedings of International Symposium on Electronic Dictionary, Tokyo, Japan, p. 18.
Dong, Z. D.: 1988c, ‘MT Research in China’, in Dan Maxwell, Klaus Schubert and Toon Witkam (eds), New Directions in Machine Translation, Foris, Dordrecht, pp. 85–91.
Dong, Z.: 2000, ‘HowNet Chinese—English Conceptual Database’, Technical Report Online Software Database, Released at ACL. http://www.keenage.com.
Dorr, B. J.: 1993, Machine Translation: A View from the Lexicon, MIT Press, Cambridge, MA.
Dorr, B. J.: 1994, ‘Machine Translation Divergences: A Formal Description and Proposed Solution’, Computational Linguistics 20, 597–633.
Dorr, B. J.: 1997a, ‘Large-Scale Acquisition of LCS-Based Lexicons for Foreign Language Tutoring’, in Fifth Conference on Applied Natural Language Processing, Washington, DC, pp. 139–146.
Dorr, B. J.: 1997b, ‘Large-Scale Dictionary Construction for Foreign Language Tutoring and Interlingual Machine Translation’, Machine Translation 12, 271–322.
Dorr, B. J.: 2001, ‘LCS Verb Database’, Technical Report Online Software Database, University of Maryland, College Park, MD. http://www.umiacs.umd.edu/?bonnie/LCS_Database_ Docmentation.html.
Dorr, B. J., N. Habash, and D. Traum: 1998, ‘A Thematic Hierarchy for Efficient Generation from Lexical-Conceptal Structure’, in Farwell et al. (1998), pp. 333–343.
Dorr, B. J. and D. Jones: 1999, ‘Acquisition of Semantic Lexicons: Using Word Sense Disambiguation to Improve Precision’, in E. Viegas (ed.), Breadth and Depth of Semantic Lexicons, Kluwer Academic Publishers, Norwell MA, pp. 79–98.
Dorr, B. J. and M. Katsova: 1998, ‘Lexical Selection for Cross-Language Applications: Combining LCS with WordNet’, in Farwell et al. (1998), pp. 438–447.
Dorr, B. J., G.-A. Levow, D. Lin, and S. Thomas: 2000, ‘Chinese—English Semantic Resource Construction’, in Proceedings of the Second International Conference on Language Resources and Evaluation (LREC2000), Athens, Greece, pp. 757–760.
Dorr, B. J., M. A. Martí, and I. Castellón: 1997, ‘Spanish EuroWordNet and LCS-Based Interlingual MT’, Proceedings of the Workshop on Interlinguas in MT, MT Summit, San Diego, CA, pp. 19–32.
Dorr, B. J. and M. B. Olsen: 1996, ‘Multilingual Generation: The Role of Telicity in Lexical Choice and Syntactic Realization’, Machine Translation 11, 37–74.
Dorr, B. J., L. Pearl, R. Hwa, and N. Habash: 2002, ‘DUSTer: A Method for Unraveling Cross-Language Divergences for StatisticalWord-Level Alignment’, in Richardson (2002), pp. 31–43.
Dowty, D.: 1979, Word Meaning in Montague Grammar, Reidel, Dordrecht.
Dowty, D.: 1991, ‘Thematic Proto-Roles and Argument Selection’, Language 67, 547–619.
Farwell, D., L. Gerber, and E. Hovy (eds): 1998, Machine Translation and the Information Soup: Third Conference of the Association for Machine Translation in the Americas, AMTA'98, Springer, Berlin.
Fellbaum, C.: 1998, WordNet: An Electronic Lexical Database, MIT Press, Cambridge, MA.
Gildea, D. and D. Jurafsky: 2002, ‘Automatic Labeling of Semantic Roles’, Computational Linguistics 28, 245–288.
Green, R., L. Pearl, B. J. Dorr, and P. Resnik: 2001a, ‘Lexical Resource Integration across the Syntax-Semantics Interface’, in Proceedings of the NAACL Workshop on WordNet and Other Lexical Resources: Applications, Customizations, Pittsburg, PA, pp. 71–76.
Green, R., L. Pearl, B. J. Dorr, and P. Resnik: 2001b, ‘Mapping WordNet Senses to a Lexical Database of Verbs’, in Association for Computational Linguistics 39th Annual Meeting and 10th Conference of the European Chapter, Toulouse, France, pp. 244–251.
Habash, N.: 2000, ‘Oxygen: A Language Independent Linearization Engine’, in John S. White (ed.), Envisioning Machine Translation in the Information Future: 4th Conference of the Association for Machine Translation in the Americas, AMTA2000, Springer, Berlin, pp. 68–79.
Habash, N.: 2002, ‘IL Annotation Experiment’, in Workshop on Interlingual Reliability, Fifth Conference of the Association for Machine Translation in the Americas, AMTA2002, Tiburon, CA.
Habash, N. Y.: 2003, ‘Generation-Heavy Hybrid Machine Translation’, Ph.D. thesis, Department of Computer Science, University of Maryland, College Park, MD.
Habash, N. and B. Dorr: 2001, ‘Large Scale Language Independent Generation Using Thematic Hierarchies’, in MT Summit VIII: Machine Translation in the Information Age, Santiago de Compostela, Spain, pp. 139–144.
Habash, N. and B. J. Dorr: 2002, ‘Handling Translation Divergences: Combining Statistical and Symbolic Techniques in Generation-Heavy Machine Translation’, in Richardson (2002), pp. 84–93.
Hobbs, J. R., D. Appelt, J. Bear, D. Israel, M. Kameyama, M. Stickel, and M. Tyson: 1997, ‘FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text’, in E. Roche and Y. Schabes (eds), Finite-State Language Processing, MIT Press, Cambridge, MA, pp. 383–406.
Hovy, E.: 1998, ‘Combining and Standardizing Large-Scale, Practical Ontologies forMachine Translation and Other Uses’, in Proceedings of the First International Conference on Language Resources and Evaluation (LREC), Granada, Spain.
Hull, D. A. and G. Grefenstette: 1996, ‘Experiments in Multilingual Information Retrieval’, in Proceedings of the 19th Annual International ACMSIGIR Conference on Research and Development in Information Retrieval SIGIR '96, Zurich, Switzerland, pp. 49–57.
Jackendoff, R.: 1983, Semantics and Cognition, MIT Press, Cambridge, MA.
Jackendoff, R.: 1990, Semantic Structures, MIT Press, Cambridge, MA.
Jones, D., R. Berwick, F. Cho, Z. Khan, K. Kohl, N. Nomura, A. Radhakrishnan, U. Sauerland, and B. Ulicny: 1994, ‘Verb Classes and Alternations in Bangla, German, English, and Korean’, Technical report, Massachusetts Institute of Technology.
Kingsbury, P. and M. Palmer: 2002, ‘From Treebank to PropBank’, in LREC 2002: Third International Conference on Language Resources and Evaluation, Las Palmas, Spain, pp. 1989–1993.
Langkilde, I. and K. Knight: 1998a, ‘Generating Word Lattices from Abstract Meaning Representation’, Technical report, Information Science Institute, University of Southern California.
Langkilde, I. and K. Knight: 1998b, ‘Generation that Exploits Corpus-Based Statistical Knowledge’, in COLING-ACL '98: 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, Montreal, Quebec, pp. 704–710.
Langkilde, I. and K. Knight: 1998c, ‘The Practical Value of n-Grams in Generation’, in Proceedings of the 9th International Natural Language Generation Workshop (INLG '98), Niagra-on-the-Lake, Ontario.
Langkilde-Geary, I.: 2002, ‘An Empirical Verification of Coverage and Correctness for a General-Purpose Sentence Generator’, in International Natural Language Generation Conference (INLG '02), Marriman, NY.
Levin, B.: 1993, English Verb Classes and Alternations: A Preliminary Investigation, University of Chicago Press, Chicago, IL.
Levow, G.-A., B. J. Dorr, and D. Lin: 2000, ‘Construction of Chinese—English Semantic Hierarchy for Cross-Language Retrieval’, in Proceedings of the Workshop on English-Chinese Cross Language Information Retrieval, International Conference on Chinese Language Computing, Chicago, IL, pp. 187–194.
Miller, G. A. and C. Fellbaum: 1991, ‘Semantic Networks of English’, in B. Levin and S. Pinter (eds), Lexical and Conceptual Semantics, Blackwell, Cambridge, MA, pp. 197–229.
Nomura, N., D. A. Jones, and R. C. Berwick: 1994, ‘An Architecture for a Universal Lexicon: A Case Study on Shared Syntactic Information in Japanese, Hindi, Bengali, Greek, and English’, in COLING 94: The 15th International Conference on Computational Linguistics, Kyoto, Japan, pp. 243–249.
Oard, D. W.: 1998, ‘A Comparative Study of Query and Document Translation for Cross-Language Information Retrieval’, in Farwell et al. (1998), pp. 472–483.
Oard, D. W. and B. J. Dorr: 1996, ‘A Survey of Multilingual Text Retrieval’, Technical Report UMIACS TR 96-19, CS-TR-3615, University of Maryland, Institute for Advanced Computer Studies. http://www.glue.umd.edu/~oard/research.html.
Olsen, M. B., B. J. Dorr, and D. J. Clark: 1997a, ‘Using WordNet to Posit Hierarchical Structure in Levin's Verb Classes’, in Proceedings of the Workshop on Interlinguas in MT, MT Summit, San Diego, CA, pp. 99–110.
Olsen, M. B., B. J. Dorr, and S. C. Thomas: 1997b, ‘Toward Compact Monotonically Compositional Interlingua Using Lexical Aspect’, in Proceedings of the Workshop on Interlinguas in MT, MT Summit, San Diego, CA, pp. 33–44.
Olsen, M. B., B. J. Dorr, and S. C. Thomas: 1998, ‘Enhancing Automatic Acquisition of Thematic Structure in a Large-Scale Lexicon for Mandarin Chinese’, in Farwell et al. (1998), pp. 41–50.
Palmer, M., A. Joshi, M. Marcus, M. Liberman, and F. Pereira: 2002, ‘Multilingual PennTools’, TIDES Presentation, University of Pennsylvania.
Palmer, M. and J. Rosenzweig: 1996, ‘Capturing Motion Verb Generalizations with Synchronous Adjoining Grammars’, in Expanding MT Horizons, Proceedings of the Second Conference of the Association for Machine Translation in the Americas, Montreal, Quebec, pp. 76–85.
Palmer, M., J. Rosenzweig, and S. Cotton: 2001, ‘Automatic Predicate Argument Analysis of the Penn TreeBank’, in Human Language Technologies Conference, San Diego, CA.
Palmer, M., J. Rosenzweig, and H. T. Dang: 1997, ‘Intersective Levin Classes’, in Workshop on Tagging Text with Lexical Semantics: Why, What, and How?, Washington, D.C. Presentation at the Working Group on Combining Knowledge Sources for Automatic Semantic Tagging.
Palmer,M. and Z. Wu: 1995, ‘Verb Semantics for English-Chinese Translation’, Machine Translation 10, 59–92.
Peters, W., P. Vossen, P. Diez-Orzas, and G. Adriaens: 1998, ‘Cross-Linguistic Alignment of Wordnets with an Inter-Lingual-Index’, Computers and the Humanities 32, 221–251.
Procter, P.: 1978, Longman Dictionary of Contemporary English, Longman, London.
Resnik, P.: 1995, ‘Using Information Content to Evaluate Semantic Similarity in a Taxonomy’, in Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, IJCAI 95, Montréal, Québec, pp. 448–453.
Richardson, S. D. (ed.): 2002, Machine Translation: From Research to Real Users, 5th Conference of the Association for Machine Translation in the Americas, AMTA 2002, Springer, Berlin.
Saint-Dizier, P.: 1996, ‘Semantic Verb Classes Based on ‘Alternations’ and on WordNet-like Semantic Criteria: A Powerful Convergence’, in Proceedings of the Workshop on Predicative Forms in Natural Language and Lexical Knowledge Bases, Toulouse, France, pp. 62–70.
Stallard, D.: 2000, ‘Talk'n'Travel: A Conversational System for Air Travel Planning’, in Association for Computational Linguistics 6th Applied Natural Language Processing Conference, Seattle, Washington, pp. 68–75.
van Valin, J. R. D.: 1993, ‘A Synopsis of Role and Reference Grammar’, in J. Robert D. van Valin (ed.), Advances in Role and Reference Grammar, John Benjamins, Amsterdam, pp. 1–164.
Viegas, E., B. A. Onyshkevych, V. Raskin, and S. Nirenburg: 1996, ‘From Submit to Submitted via Submission: On Lexical Rules in Large-Scale Lexicon Acquisition’, in 34th Annual Meeting of the Association for Computational Linguistics, Santa Cruz, CA, pp. 32–39.
Vossen, P.: 1998, EuroWordNet: A Multilingual Database with Lexical Semantic Networks, Kluwer Academic Publishers, Dordrecht.
Vossen, P., L. Bloksma, A. Alonge, E. Marinai, C. Peters, I. Castellon, A. Marti, and G. Rigau: 1998, ‘Compatibility in Interpretation of Relations in EuroWordNet’, Computers and the Humanities 32, 153–184.
Vossen, P., P. Diez-Orzas, and W. Peters: 1997, ‘The Multilingual Design of EuroWordNet’, in Proceedings of the ACL/EACL-97 Workshop on Automatic Information Extraction and Building of Lexical Semantic Resources for NLP Application, Madrid, Spain.
Weinberg, A., J. Garman, J. Martin, and P. Merlo: 1995, ‘Principle-Based Parser for Foreign Language Training in German and Arabic’, in J. K. Melissa Holland and M. Sams (eds), Intelligent Language Tutors: Theory Shaping Technology, Lawrence Erlbaum Associates, Hillsdale, NJ, pp. 23–44.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Dorr, B.J., Levow, GA. & Lin, D. Construction of a Chinese–English Verb Lexicon for Machine Translation and Embedded Multilingual Applications. Machine Translation 17, 99–137 (2002). https://doi.org/10.1023/B:COAT.0000010116.83274.c3
Issue Date:
DOI: https://doi.org/10.1023/B:COAT.0000010116.83274.c3