Skip to main content

Computing Semantic Similarity Using Large Static Corpora

  • Conference paper
SOFSEM 2013: Theory and Practice of Computer Science (SOFSEM 2013)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7741))

Abstract

Measuring semantic similarity of words is of crucial importance in Natural Language Processing. Although there are many different approaches for this task, there is still room for improvement. In contrast to many other methods that use web search engines or large lexical databases, we developed such methods that solely rely on large static corpora. They create a binary or numerical feature vector for each word making use of statistical information obtained from the corpora. These vectors contain features based on context words or grammatical relations extracted from the corpora and they employ diverse weighting schemes. After creating the feature vectors, word similarity is calculated using various vector similarity measures. Beside the individual methods, their combinations were also tested. Evaluated on both the Miller-Charles dataset and the TOEFL synonym questions, they achieve competitive results to recent methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Paşca, M., Soroa, A.: A study on similarity and relatedness using distributional and WordNet-based approaches. In: 10th Annual Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies, pp. 19–27. Association for Computational Linguistics, Stroudsburg (2009)

    Google Scholar 

  2. Clark, S., Curran, J.R.: Parsing the WSJ using CCG and log-linear models. In: 42nd Annual Meeting on Association for Computational Linguistics, pp. 103–110. Association for Computational Linguistics, Stroudsburg (2004)

    Google Scholar 

  3. Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Computational Linguistics 16, 22–29 (1989)

    Google Scholar 

  4. Dobó, A.: Angol szavak szinonimáinak automatikus keresése. In: National Scientific Conference of Students (OTDK), OTDT, Budapest (2011)

    Google Scholar 

  5. Dobó, A., Pulman, S.G.: Interpreting noun compounds using paraphrases. Procesamiento del Lenguaje Natural 46, 59–66 (2011)

    Google Scholar 

  6. Gabrilovich, E., Markovitch, S.: Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. In: 20th International Joint Conference on Artificial Intelligence, pp. 1606–1611. Morgan Kaufmann Publishers Inc., San Francisco (2007)

    Google Scholar 

  7. Higgins, D.: Which Statistics Reflect Semantics? Rethinking Synonymy and Word Similarity. In: Kepser, S., Reis, M. (eds.) Linguistic Evidence: Empirical, Theoretical and Computational Perspectives, pp. 265–284. Mouton de Gruyter, Berlin (2005)

    Chapter  Google Scholar 

  8. Jarmasz, M., Szpakowicz, S.: Roget’s Thesaurus and Semantic Similarity. In: 4th Conference on Recent Advances in Natural Language Processing, pp. 212–219. John Benjamins Publishers, Amsterdam (2003)

    Google Scholar 

  9. Kilgarriff, A.: Googleology is bad science. Computational Linguistics 33, 147–151 (2007)

    Article  Google Scholar 

  10. Kulkarni, S., Caragea, D.: Computation of the Semantic Relatedness between Words using Concept Clouds. In: International Conference on Knowledge Discovery and Information Retrieval, pp. 183–188. INSTICC Press, Setubal (2009)

    Google Scholar 

  11. Landauer, T.K., Dumais, S.T.: A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychological Review 104, 211–240 (1997)

    Article  Google Scholar 

  12. Lin, D.: An information-theoretic definition of similarity. In: 15th International Conference on Machine Learning, pp. 296–304. Morgan Kaufmann Publishers Inc., San Francisco (1998)

    Google Scholar 

  13. Manning, C., Schütze, H.: Foundations of statistical natural language processing. MIT Press, Cambridge (2000)

    Google Scholar 

  14. Milne, D., Witten, I.H.: An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links. In: 23rd AAAI Conference on Artificial Intelligence, pp. 25–30. AAAI Press, Menlo Park (2008)

    Google Scholar 

  15. Patwardhan, S., Pedersen, T.: Using WordNet-based Context Vectors to Estimate the Semantic Relatedness of Concepts. In: 11th Conference of the European Chapter of the Association for Computational Linguistics, pp. 1–8. Association for Computational Linguistics, Stroudsburg (2006)

    Google Scholar 

  16. Rapp, R.: Word Sense Discovery Based on Sense Descriptor Dissimilarity. In: 9th Machine Translation Summit, pp. 315–322. Association for Machine Translation in the Americas, Stroudsburg (2003)

    Google Scholar 

  17. Resnik, P.: Using Information Content to Evaluate Semantic Similarity in a Taxonomy. In: 14th International Joint Conference on Artificial Intelligence, pp. 448–453. Morgan Kaufmann Publishers Inc., San Francisco (1995)

    Google Scholar 

  18. Sahami, M., Heilman, T.D.: A web-based kernel function for measuring the similarity of short text snippets. In: 15th International Conference on World Wide Web, pp. 377–386. ACM Press, New York (2006)

    Chapter  Google Scholar 

  19. Tsatsaronis, G., Varlamis, I., Vazirgiannis, M.: Text Relatedness Based on a Word Thesaurus. Journal of Artificial Intelligence Research 37, 1–39 (2010)

    MATH  Google Scholar 

  20. Turney, P.D., Littman, M.L., Bigham, J., Shnayder, V.: Combining Independent Modules to Solve Multiple-choice Synonym and Analogy Problems. In: 4th Conference on Recent Advances in Natural Language Processing, pp. 482–489. John Benjamins Publishers, Amsterdam (2003)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Dobó, A., Csirik, J. (2013). Computing Semantic Similarity Using Large Static Corpora. In: van Emde Boas, P., Groen, F.C.A., Italiano, G.F., Nawrocki, J., Sack, H. (eds) SOFSEM 2013: Theory and Practice of Computer Science. SOFSEM 2013. Lecture Notes in Computer Science, vol 7741. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35843-2_42

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-35843-2_42

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-35842-5

  • Online ISBN: 978-3-642-35843-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics