Abstract
Measuring semantic similarity of words is of crucial importance in Natural Language Processing. Although there are many different approaches for this task, there is still room for improvement. In contrast to many other methods that use web search engines or large lexical databases, we developed such methods that solely rely on large static corpora. They create a binary or numerical feature vector for each word making use of statistical information obtained from the corpora. These vectors contain features based on context words or grammatical relations extracted from the corpora and they employ diverse weighting schemes. After creating the feature vectors, word similarity is calculated using various vector similarity measures. Beside the individual methods, their combinations were also tested. Evaluated on both the Miller-Charles dataset and the TOEFL synonym questions, they achieve competitive results to recent methods.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Agirre, E., Alfonseca, E., Hall, K., Kravalova, J., Paşca, M., Soroa, A.: A study on similarity and relatedness using distributional and WordNet-based approaches. In: 10th Annual Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies, pp. 19–27. Association for Computational Linguistics, Stroudsburg (2009)
Clark, S., Curran, J.R.: Parsing the WSJ using CCG and log-linear models. In: 42nd Annual Meeting on Association for Computational Linguistics, pp. 103–110. Association for Computational Linguistics, Stroudsburg (2004)
Church, K.W., Hanks, P.: Word association norms, mutual information, and lexicography. Computational Linguistics 16, 22–29 (1989)
Dobó, A.: Angol szavak szinonimáinak automatikus keresése. In: National Scientific Conference of Students (OTDK), OTDT, Budapest (2011)
Dobó, A., Pulman, S.G.: Interpreting noun compounds using paraphrases. Procesamiento del Lenguaje Natural 46, 59–66 (2011)
Gabrilovich, E., Markovitch, S.: Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. In: 20th International Joint Conference on Artificial Intelligence, pp. 1606–1611. Morgan Kaufmann Publishers Inc., San Francisco (2007)
Higgins, D.: Which Statistics Reflect Semantics? Rethinking Synonymy and Word Similarity. In: Kepser, S., Reis, M. (eds.) Linguistic Evidence: Empirical, Theoretical and Computational Perspectives, pp. 265–284. Mouton de Gruyter, Berlin (2005)
Jarmasz, M., Szpakowicz, S.: Roget’s Thesaurus and Semantic Similarity. In: 4th Conference on Recent Advances in Natural Language Processing, pp. 212–219. John Benjamins Publishers, Amsterdam (2003)
Kilgarriff, A.: Googleology is bad science. Computational Linguistics 33, 147–151 (2007)
Kulkarni, S., Caragea, D.: Computation of the Semantic Relatedness between Words using Concept Clouds. In: International Conference on Knowledge Discovery and Information Retrieval, pp. 183–188. INSTICC Press, Setubal (2009)
Landauer, T.K., Dumais, S.T.: A solution to Plato’s problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychological Review 104, 211–240 (1997)
Lin, D.: An information-theoretic definition of similarity. In: 15th International Conference on Machine Learning, pp. 296–304. Morgan Kaufmann Publishers Inc., San Francisco (1998)
Manning, C., Schütze, H.: Foundations of statistical natural language processing. MIT Press, Cambridge (2000)
Milne, D., Witten, I.H.: An Effective, Low-Cost Measure of Semantic Relatedness Obtained from Wikipedia Links. In: 23rd AAAI Conference on Artificial Intelligence, pp. 25–30. AAAI Press, Menlo Park (2008)
Patwardhan, S., Pedersen, T.: Using WordNet-based Context Vectors to Estimate the Semantic Relatedness of Concepts. In: 11th Conference of the European Chapter of the Association for Computational Linguistics, pp. 1–8. Association for Computational Linguistics, Stroudsburg (2006)
Rapp, R.: Word Sense Discovery Based on Sense Descriptor Dissimilarity. In: 9th Machine Translation Summit, pp. 315–322. Association for Machine Translation in the Americas, Stroudsburg (2003)
Resnik, P.: Using Information Content to Evaluate Semantic Similarity in a Taxonomy. In: 14th International Joint Conference on Artificial Intelligence, pp. 448–453. Morgan Kaufmann Publishers Inc., San Francisco (1995)
Sahami, M., Heilman, T.D.: A web-based kernel function for measuring the similarity of short text snippets. In: 15th International Conference on World Wide Web, pp. 377–386. ACM Press, New York (2006)
Tsatsaronis, G., Varlamis, I., Vazirgiannis, M.: Text Relatedness Based on a Word Thesaurus. Journal of Artificial Intelligence Research 37, 1–39 (2010)
Turney, P.D., Littman, M.L., Bigham, J., Shnayder, V.: Combining Independent Modules to Solve Multiple-choice Synonym and Analogy Problems. In: 4th Conference on Recent Advances in Natural Language Processing, pp. 482–489. John Benjamins Publishers, Amsterdam (2003)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Dobó, A., Csirik, J. (2013). Computing Semantic Similarity Using Large Static Corpora. In: van Emde Boas, P., Groen, F.C.A., Italiano, G.F., Nawrocki, J., Sack, H. (eds) SOFSEM 2013: Theory and Practice of Computer Science. SOFSEM 2013. Lecture Notes in Computer Science, vol 7741. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-35843-2_42
Download citation
DOI: https://doi.org/10.1007/978-3-642-35843-2_42
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-35842-5
Online ISBN: 978-3-642-35843-2
eBook Packages: Computer ScienceComputer Science (R0)