Skip to main content
Log in

Constructing and utilizing wordnets using statistical methods

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Lexical databases following the wordnet paradigm capture information about words, word senses, and their relationships. A large number of existing tools and datasets are based on the original WordNet, so extending the landscape of resources aligned with WordNet leads to great potential for interoperability and to substantial synergies. Wordnets are being compiled for a considerable number of languages, however most have yet to reach a comparable level of coverage. We propose a method for automatically producing such resources for new languages based on WordNet, and analyse the implications of this approach both from a linguistic perspective as well as by considering natural language processing tasks. Our approach takes advantage of the original WordNet in conjunction with translation dictionaries. A small set of training associations is used to learn a statistical model for predicting associations between terms and senses. The associations are represented using a variety of scores that take into account structural properties as well as semantic relatedness and corpus frequency information. Although the resulting wordnets are imperfect in terms of their quality and coverage of language-specific phenomena, we show that they constitute a cheap and suitable alternative for many applications, both for monolingual tasks as well as for cross-lingual interoperability. Apart from analysing the resources directly, we conducted tests on semantic relatedness assessment and cross-lingual text classification with very promising results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Atserias, J., Climent, S., Farreres, X., Rigau, G., & Rodríguez, H. (1997). Combining multiple methods for the automatic construction of multilingual WordNets. In Proceedings of the international conference on recent advances in NLP 1997 (pp. 143–149).

  • Baker, C., & Fellbaum, C. (2008). Can wordnet and framenet be made “interoperable”? In Proceedings of the first international conference on global interoperability for language resources.

  • Benitez, L., Cervell, S., Escudero, G., Lopez, M., Rigau, G., & Taulé, M. (1998). Methods and tools for building the Catalan WordNet. In: Proceedings of the ELRA workshop on language res. for Europ. Minority Lang., 1st international conference on language resources and evaluation.

  • Bentivogli, L., Forner, P., Magnini, B., & Pianta, E. (2004). Revising the WordNet domains hierarchy. In COLING 2004 multiling. Ling. Resources, Geneva, Switzerland (pp. 94–101).

  • Bizer, C., Heath, T., & Berners-Lee, T. (2009). Linked data—the story so far. International Journal on Semantic Web and Information Systems, 5(3), 1–22.

    Google Scholar 

  • Buscaldi, D., & Rosso, P. (2008). Geo-wordnet: Automatic georeferencing of wordnet. In (ELRA) ELRA (Ed.), Proceedings of the 6th international language resources and evaluation (LREC’08), Marrakech, Morocco.

  • Chang, C. C., & Lin, C. J. (2001) LIBSVM: A library for support vector machines. URL http://www.csie.ntu.edu.tw/cjlin/libsvm.

  • Chen, H. H., Lin, C. C., & Lin, W. C. (2000). Construction of a Chinese-English WordNet and its application to CLIR. In Proceedings of the fifth international workshop on information retrieval with Asian languages, IRAL ’00 (pp. 189–196). New York, NY, USA: ACM Press.

  • Cortes, C., & Vapnik, V. (1995). Support-vector networks. Machine Learning, 20(3), 273–297.

    Google Scholar 

  • Cycorp Inc. (2008). Opencyc. http://www.opencyc.org/.

  • Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Li, F. F. (2009). ImageNet: A large-scale hierarchical image database. In Proceedings of the IEEE computer society conference on computer vision and pattern recognition (CVPR 2009).

  • de Melo, G., & Siersdorfer, S. (2007). Multilingual text classification using ontologies. In G. Amati (Ed.), Proceedings of the 29th European conference on information retrieval (ECIR 2007). Springer, Rome, Italy, Lecture Notes in Computer Science, Vol. 4425.

  • de Melo, G., & Weikum, G. (2009). Towards a universal wordnet by learning from combined evidence. In Proceedings of the 18th ACM conference on information and knowledge management (CIKM 2009) (pp. 513–522). New York, NY, USA: ACM.

  • Fellbaum, C. (Ed.) (1998). WordNet: An electronic lexical database (language, speech, and communication. Cambridge: The MIT Press.

    Google Scholar 

  • Francopoulo, G., Declerck, T., & Sornlertlamvanich, V., de la Clergerie, E., & Monachini, M. (2008). Data category registry: Morpho-syntactic and syntactic profiles. In Proceedings of the workshop on use and usage of language resource-related standards at the LREC 2008.

  • Gangemi, A., Navigli, R., & Velardi, P. (2003). The ontowordnet project: Extension and axiomatization of conceptual relations in wordnet. In On the move to meaningful internet systems 2003: CoopIS, DOA, and ODBASE (pp. 820–838).

  • Gurevych, I. (2005). Using the structure of a conceptual network in computing semantic relatedness. In Proceedings of the second international joint conference on natural language processing, IJCNLP, Jeju Island, Republic of Korea.

  • Gurevych, I., Müller, C., & Zesch, T. (2007). What to be?— electronic career guidance based on semantic relatedness. In Proceedings of the 45th annual meeting of the association for computational linguistics, Association for Computational Linguistics, Prague, Czech Republic (pp. 1032–1039).

  • Harabagiu, S. M., Bunescu, R. C., & Maiorano, S. J. (2001). Text and knowledge mining for coreference resolution. In NAACL ’01: Second meeting of the North American Chapter of the Association for Computational Linguistics on Language technologies 2001, Association for Computational Linguistics, Morristown, NJ, USA (pp. 1–8).

  • Joachims, T. (1999). Making large-scale support vector machine learning practical. In B. Schölkopf, C. Burges, & A. Smola (Eds.), Advances in kernel methods: Support vector machines. Cambridge, MA, USA: MIT Press.

    Google Scholar 

  • Kipper, K., Dang, H. T., & Palmer, M. (2000). Class-based construction of a verb lexicon. In AAAI (pp. 691–696).

  • Knight, K. (1993). Building a large ontology for machine translation. In Proceedings of the workshop human language technology (pp. 185–190).

  • Kunze, C., & Lemnitzer, L. (2002). GermaNet—representation, visualization, application. In Proceedings of the LREC 2002 (pp. 1485–1491).

  • Lesk, M. (1986). Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In Proceedings of the 5th annual international conference on systems documentation, SIGDOC ’86 (pp. 24–26). New York, NY, USA: ACM Press.

  • Lin, H. T., Lin, C. J., & Weng, R. C. (2007). A note on platt’s probabilistic outputs for support vector machines. Machine Learning, 68(3), 267–276.

    Article  Google Scholar 

  • Lyons, J. (1977). Semantics, Vol. 1. Cambridge: Cambridge University Press.

    Google Scholar 

  • Miháltz, M., & Prószéky, G. (2004). Results and evaluation of Hungarian Nominal WordNet v1.0. In Proceedings of the second global WordNet conference. Brno, Czech Republic: Masaryk University.

  • Niles, I., & Pease, A. (2003). Linking lexicons and ontologies: Mapping WordNet to the suggested upper merged ontology. In Proceedings of the 2003 international conference information and knowledge engineering, Las Vegas, NV, USA.

  • Okumura, A., & Hovy, E. (1994). Building Japanese-English dictionary based on ontology for machine translation. In Proceedings of the workshop on human language technology (pp. 141–146).

  • Ordan, N., & Wintner, S. (2007). Hebrew WordNet: A test case of aligning lexical databases across languages. International Journal of Translation, 19(1), 39–58.

    Google Scholar 

  • Patwardhan, S., Banerjee, S., & Pedersen, T. (2003). Using measures of semantic relatedness for word sense disambiguation. In Proceedings 4th international conference on computational linguistics and intelligent text processing (CICLing), Mexico City, Mexico.

  • Pianta, E., Bentivogli, L., & Girardi, C. (2002). MultiWordNet: Developing an aligned multilingual database. In Proceedings of the 1st international global WordNet conference, Mysore, India (pp. 293–302).

  • Platt, J. C. (1999). Fast training of support vector machines using sequential minimal optimization (pp. 185–208). Cambridge, MA, USA: MIT Press.

    Google Scholar 

  • Platt, J. C. (2000). Probabilistic outputs for support vector machines and comparison to regularized likelihood methods. In A. Smola, P. Bartlett, B. Schölkopf, & D. Schuurmans (Eds.), Advances in large margin classifiers (pp. 61–74). Cambridge, MA, USA: MIT Press.

    Google Scholar 

  • Reuters. (2000a). Reuters Corpus, Vol. 1: English language, 1996-08-20 to 1997-08-19. URL http://trec.nist.gov/data/reuters/reuters.html.

  • Reuters. (2000b). Reuters Corpus, Vol. 2: Multilingual, 1996-08-20 to 1997-08-19. http://trec.nist.gov/data/reuters/reuters.html.

  • Richter, F. (2007). Ding version 1.5. http://www-user.tu-chemnitz.de/~fri/ding/.

  • Rigau, G., & Agirre, E. (1995). Disambiguating bilingual nominal entries against WordNet. In Proceedings of the Workshop ‘The Computational Lexicon’ at European summer school logic, language & information.

  • Sathapornrungkij, P., & Pluempitiwiriyawej, C. (2005). Construction of Thai WordNet lexical database from machine readable dictionaries. In Proceedings of the 10th machine translation summit, Phuket, Thailand.

  • Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In International conference on new methods in language processing, Manchester, UK.

  • Sebastiani, F. (2002). Machine learning in automated text categorization. ACM Computing Surveys, 34(1), 1–47.

    Article  Google Scholar 

  • Suchanek, F. M., Kasneci, G., & Weikum, G. (2007). Yago: A Core of semantic knowledge. In 16th International World Wide Web Conference (WWW 2007). New York: ACM Press.

  • Tufiş, D., Ion, R., & Ide, N. (2004). Fine-grained word sense disambiguation based on parallel corpora, word alignment, word clustering and aligned wordnets. In COLING ’04: Proceedings of the 20th international conference on computational Linguistics, Association for Computational Linguistics, Morristown, NJ, USA (p. 1312).

  • Vapnik, V. N. (1998). Statistical learning theory. New York: Wiley-Interscience.

    Google Scholar 

  • Vossen, P. (Ed.) (1998). EuroWordNet: A multilingual database with lexical semantic networks. Berlin: Springer.

    Google Scholar 

  • Zesch, T., & Gurevych, I. (2006). Automatically creating datasets for measures of semantic relatedness. In COLING/ACL 2006 workshop on linguistic distances, Sydney, Australia (pp. 16–24).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gerard de Melo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

de Melo, G., Weikum, G. Constructing and utilizing wordnets using statistical methods. Lang Resources & Evaluation 46, 287–311 (2012). https://doi.org/10.1007/s10579-012-9183-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-012-9183-2

Keywords

Navigation