Skip to main content
Log in

The automatic generation of thesauri of related words for English, French, German, and Russian

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

A method for the automatic extraction of words with similar meanings is presented which is based on the analysis of word distribution in large monolingual text corpora. It involves compiling matrices of word co-occurrences and reducing the dimensionality of the semantic space by conducting a singular value decomposition. This way problems of data sparseness are reduced and a generalization effect is achieved which considerably improves the results. The method is largely language independent and has been applied to corpora of English, French, German, and Russian, with the resulting thesauri being freely available. For the English thesaurus, an evaluation has been conducted by comparing it to experimental results as obtained from test persons who were asked to give judgements of word similarities. According to this evaluation, the machine generated results come close to native speaker’s performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Bullinaria, J. A., & Levy, J. P. (2007). Extracting semantic representations from word co-occurrence statistics: A computational study. Behavior Research Methods, 39, 510–526.

    Google Scholar 

  • Burnard, L., & Aston, G. (1998). The BNC handbook: Exploring the British national corpus with Sara. Edinburgh: Edinburgh University Press.

    Google Scholar 

  • Denoyer, L., & Gallinari, P. (2006). The Wikipedia XML corpus. ACM SIGIR Forum, 40(1), 64–69.

    Article  Google Scholar 

  • Dunning, T. (1993). Accurate methods for the statistics of surprise and coincidence. Computational Linguistics, 19(1), 61–74.

    Google Scholar 

  • Fellbaum, C. (Ed.) (1998). WordNet: An electronic lexical database. Cambridge: Bradford Books, MIT Press.

    MATH  Google Scholar 

  • Grefenstette, G. (1994). Explorations in automatic thesaurus discovery. Dordrecht: Kluwer.

    MATH  Google Scholar 

  • Harris, Z. S. (1954). Distributional structure. Word, 10(23), 146–162.

    Google Scholar 

  • Hirst, G., & St-Onge, D. (1998). Lexical chains as representation of context for the detection and correction of malapropisms. In C. Fellbaum (Ed.), WordNet: An electronic lexical database (pp. 305–332). Cambridge: MIT Press.

    Google Scholar 

  • Jarmasz, M., & Szpakowicz, S. (2003). Roget’s thesaurus and semantic similarity. In Proceedings of the international conference on recent advances in natural language processing (RANLP-03), Borovets, Bulgaria, September (pp. 212–219).

  • Jiang, J. J., & Conrath, D. W. (1997). Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings of the international conference on research in computational linguistics, Taiwan.

  • Landauer, T. K., & Dumais, S. T. (1997). A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction, and representation of knowledge. Psychological Review, 104(2), 211–240.

    Article  Google Scholar 

  • Landauer, T. K., McNamara, D. S., Dennis, S., & Kintsch, W. (Eds.) (2007). Handbook of Latent Semantic Analysis. Hillsdale: Lawrence Erlbaum.

    Google Scholar 

  • Leacock, C., & Chodorow, M. (1998). Combining local context and WordNet similarity for word sense identification. In C. Fellbaum (Ed.), WordNet: An electronic lexical database (pp. 265–283). Cambridge: MIT Press.

    Google Scholar 

  • Lezius, W., Rapp, R., & Wettler, M. (1998). A freely available morphology system, part-of-speech tagger, and context-sensitive lemmatizer for German. In Proceedings of COLING-ACL 1998, Montreal (Vol. 2, pp. 743–748).

  • Lin, D. (1998a). Automatic retrieval and clustering of similar words. In Proceedings of COLING-ACL 1998, Montreal (Vol. 2, pp. 768–773).

  • Lin, D. (1998b). An information-theoretic definition of similarity. In Proceedings of the 15th international conference on machine learning (ICML-98), Madison, WI (pp. 296–304).

  • Matveeva, I., Levow, G., Farahat, A., & Royer, C. (2005). Generalized latent semantic analysis for term representation. In Proceedings of the international conference on recent advances in natural language processing (RANLP-05), Borovets, Bulgaria.

  • Pado, S., & Lapata, M. (2007). Dependency-based construction of semantic space models. Computational Linguistics, 33(2), 161–199.

    Article  Google Scholar 

  • Pantel, P., & Lin, D. (2002). Discovering word senses from text. In Proceedings of ACM SIGKDD, Edmonton (pp. 613–619).

  • Rapp, R. (2002). The computation of word associations: Comparing syntagmatic and paradigmatic approaches. In Proceedings of 19th COLING, Taipei, ROC (Vol. 2, pp. 821–827).

  • Rapp, R. (2003). Word sense discovery based on sense descriptor dissimilarity. In Proceedings of the ninth machine translation summit, New Orleans (pp. 315–322).

  • Rapp, R. (2004). A freely available automatically generated thesaurus of related words. In Proceedings of the fourth international conference on language resources and evaluation (LREC), Lisbon (Vol. II, pp. 395–398).

  • Rapp, R. (2007). The computation of semantically related words: Thesaurus generation for English, German, and Russian. In B. Sharp & M. Zock (Eds.), Natural language processing and cognitive science (pp. 71–80). Setúba: INSTICC Press.

    Google Scholar 

  • Resnik, P. (1995). Using information content to evaluate semantic similarity. In Proceedings of the 14th international joint conference on artificial intelligence (IJCAI-95), Montreal (pp. 448–453).

  • Ruge, G. (1992). Experiments on linguistically based term associations. Information Processing and Management, 28(3), 317–332.

    Article  Google Scholar 

  • Ruiz-Casado, M., Alfonseca, E., & Castells, P. (2005) Using context-window overlapping in Synonym Discovery and Ontology Extension. In Proceedings of the international conference recent advances in natural language processing (RANLP-2005), Borovets, Bulgaria.

  • Sahlgren, M. (2001). Vector-based semantic analysis: representing word meanings based on random labels. In A. Lenci, S. Montemagni, & V. Pirrelli (Eds.), Proceedings of the ESSLLI workshop on the acquisition and representation of word meaning, Helsinki.

  • Schmid, H. (1995). Improvements in part-of-speech tagging with an application to German. In Proceedings of the EACL SIGDAT workshop, Dublin (pp. 47–50).

  • Schütze, H. (1997). Ambiguity resolution in language learning: computational and cognitive models. Stanford: CSLI Publications.

    Google Scholar 

  • Terra, E., & Clarke, C. L. A. (2003). Frequency estimates for statistical word similarity measures. In Proceedings of HLT/NAACL, Edmonton, Alberta (pp. 244–251).

  • Turney, P. D. (2001). Mining the Web for synonyms. PMI-IR versus LSA on TOEFL. In Proc. of the twelfth European conference on machine learning, Freiburg, Germany (pp. 491–502).

  • Turney, P. D. (2006). Similarity of semantic relations. Computational Linguistics, 32(3), 379–416.

    Article  Google Scholar 

  • Turney, P. D. (2008). A uniform approach to analogies, synonyms, antonyms, and associations. In Proceedings of the 22nd international conference on computational linguistics (Coling 2008), Manchester, UK (pp. 905–912).

  • Turney, P. D., Littman, M. L., Bigham, J., & Shnayder, V. (2003). Combining independent modules to solve multiple-choice synonym and analogy problems. In Proceedings of the international conference on recent advances in natural language processing (RANLP-03), Borovets, Bulgaria (pp. 482–489).

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Reinhard Rapp.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rapp, R. The automatic generation of thesauri of related words for English, French, German, and Russian. Int J Speech Technol 11, 147 (2008). https://doi.org/10.1007/s10772-009-9043-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10772-009-9043-7

Keywords

Navigation