Computing Semantic Similarity Using Large Static Corpora

Dobó, András; Csirik, János

doi:10.1007/978-3-642-35843-2_42

András Dobó²¹ &
János Csirik²¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7741))

Included in the following conference series:

International Conference on Current Trends in Theory and Practice of Computer Science

Abstract

Measuring semantic similarity of words is of crucial importance in Natural Language Processing. Although there are many different approaches for this task, there is still room for improvement. In contrast to many other methods that use web search engines or large lexical databases, we developed such methods that solely rely on large static corpora. They create a binary or numerical feature vector for each word making use of statistical information obtained from the corpora. These vectors contain features based on context words or grammatical relations extracted from the corpora and they employ diverse weighting schemes. After creating the feature vectors, word similarity is calculated using various vector similarity measures. Beside the individual methods, their combinations were also tested. Evaluated on both the Miller-Charles dataset and the TOEFL synonym questions, they achieve competitive results to recent methods.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Paşca, M., Soroa, A.: A study on similarity and relatedness using distributional and WordNet-based approaches. In: 10th Annual Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies, pp. 19–27. Association for Computational Linguistics, Stroudsburg (2009)
Google Scholar
Clark, S., Curran, J.R.: Parsing the WSJ using CCG and log-linear models. In: 42nd Annual Meeting on Association for Computational Linguistics, pp. 103–110. Association for Computational Linguistics, Stroudsburg (2004)
Google Scholar
Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Computational Linguistics 16, 22–29 (1989)
Google Scholar
Dobó, A.: Angol szavak szinonimáinak automatikus keresése. In: National Scientific Conference of Students (OTDK), OTDT, Budapest (2011)
Google Scholar
Dobó, A., Pulman, S.G.: Interpreting noun compounds using paraphrases. Procesamiento del Lenguaje Natural 46, 59–66 (2011)
Google Scholar
Gabrilovich, E., Markovitch, S.: Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. In: 20th International Joint Conference on Artificial Intelligence, pp. 1606–1611. Morgan Kaufmann Publishers Inc., San Francisco (2007)
Google Scholar
Higgins, D.: Which Statistics Reflect Semantics? Rethinking Synonymy and Word Similarity. In: Kepser, S., Reis, M. (eds.) Linguistic Evidence: Empirical, Theoretical and Computational Perspectives, pp. 265–284. Mouton de Gruyter, Berlin (2005)
Chapter Google Scholar
Jarmasz, M., Szpakowicz, S.: Roget’s Thesaurus and Semantic Similarity. In: 4th Conference on Recent Advances in Natural Language Processing, pp. 212–219. John Benjamins Publishers, Amsterdam (2003)
Google Scholar
Kilgarriff, A.: Googleology is bad science. Computational Linguistics 33, 147–151 (2007)
Article Google Scholar
Kulkarni, S., Caragea, D.: Computation of the Semantic Relatedness between Words using Concept Clouds. In: International Conference on Knowledge Discovery and Information Retrieval, pp. 183–188. INSTICC Press, Setubal (2009)
Google Scholar
Landauer, T.K., Dumais, S.T.: A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychological Review 104, 211–240 (1997)
Article Google Scholar
Lin, D.: An information-theoretic definition of similarity. In: 15th International Conference on Machine Learning, pp. 296–304. Morgan Kaufmann Publishers Inc., San Francisco (1998)
Google Scholar
Manning, C., Schütze, H.: Foundations of statistical natural language processing. MIT Press, Cambridge (2000)
Google Scholar
Milne, D., Witten, I.H.: An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links. In: 23rd AAAI Conference on Artificial Intelligence, pp. 25–30. AAAI Press, Menlo Park (2008)
Google Scholar
Patwardhan, S., Pedersen, T.: Using WordNet-based Context Vectors to Estimate the Semantic Relatedness of Concepts. In: 11th Conference of the European Chapter of the Association for Computational Linguistics, pp. 1–8. Association for Computational Linguistics, Stroudsburg (2006)
Google Scholar
Rapp, R.: Word Sense Discovery Based on Sense Descriptor Dissimilarity. In: 9th Machine Translation Summit, pp. 315–322. Association for Machine Translation in the Americas, Stroudsburg (2003)
Google Scholar
Resnik, P.: Using Information Content to Evaluate Semantic Similarity in a Taxonomy. In: 14th International Joint Conference on Artificial Intelligence, pp. 448–453. Morgan Kaufmann Publishers Inc., San Francisco (1995)
Google Scholar
Sahami, M., Heilman, T.D.: A web-based kernel function for measuring the similarity of short text snippets. In: 15th International Conference on World Wide Web, pp. 377–386. ACM Press, New York (2006)
Chapter Google Scholar
Tsatsaronis, G., Varlamis, I., Vazirgiannis, M.: Text Relatedness Based on a Word Thesaurus. Journal of Artificial Intelligence Research 37, 1–39 (2010)
MATH Google Scholar
Turney, P.D., Littman, M.L., Bigham, J., Shnayder, V.: Combining Independent Modules to Solve Multiple-choice Synonym and Analogy Problems. In: 4th Conference on Recent Advances in Natural Language Processing, pp. 482–489. John Benjamins Publishers, Amsterdam (2003)
Google Scholar

Download references

Author information

Authors and Affiliations

Institute of Informatics, University of Szeged, Szeged, Hungary
András Dobó & János Csirik

Authors

András Dobó
View author publications
You can also search for this author in PubMed Google Scholar
János Csirik
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Mathematics and Computer Science, University of Amsterdam, Plantage Muidergracht 24, 1018 TV, Amsterdam, The Netherlands
Peter van Emde Boas
Informatics Institute, Intelligent Systems Lab Amsterdam, University of Amsterdam, Science Park 904, 1098 XH, Amsterdam, The Netherlands
Frans C. A. Groen
Department of Civil Engineering and Computer Science, University of Rome Tor Vergata, Via del Politecnico 1, 00133, Rome, Italy
Giuseppe F. Italiano
Institute of Computing Science, Poznan University of Technology, ul. Piotrowo 2, 60-965, Poznan, Poland
Jerzy Nawrocki
Hasso-Plattner-Institute for Software Systems Engineering, Prof.-Dr.-Helmert-Str. 2-3, 14482, Potsdam, Germany
Harald Sack

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Dobó, A., Csirik, J. (2013). Computing Semantic Similarity Using Large Static Corpora. In: van Emde Boas, P., Groen, F.C.A., Italiano, G.F., Nawrocki, J., Sack, H. (eds) SOFSEM 2013: Theory and Practice of Computer Science. SOFSEM 2013. Lecture Notes in Computer Science, vol 7741. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35843-2_42

Download citation

DOI: https://doi.org/10.1007/978-3-642-35843-2_42
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35842-5
Online ISBN: 978-3-642-35843-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics