Elsevier

Knowledge-Based Systems

Volume 50, September 2013, Pages 260-278
Knowledge-Based Systems

Computing semantic relatedness using Wikipedia features

https://doi.org/10.1016/j.knosys.2013.06.015Get rights and content

Abstract

Measuring semantic relatedness is a critical task in many domains such as psychology, biology, linguistics, cognitive science and artificial intelligence. In this paper, we propose a novel system for computing semantic relatedness between words. Recent approaches have exploited Wikipedia as a huge semantic resource that showed good performances. Therefore, we utilized the Wikipedia features (articles, categories, Wikipedia category graph and redirection) in a system combining this Wikipedia semantic information in its different components. The approach is preceded by a pre-processing step to provide for each category pertaining to the Wikipedia category graph a semantic description vector including the weights of stems extracted from articles assigned to the target category. Next, for each candidate word, we collect its categories set using an algorithm for categories extraction from the Wikipedia category graph. Then, we compute the semantic relatedness degree using existing vector similarity metrics (Dice, Overlap and Cosine) and a new proposed metric that performed well as cosine formula. The basic system is followed by a set of modules in order to exploit Wikipedia features to quantify better as possible the semantic relatedness between words. We evaluate our measure based on two tasks: comparison with human judgments using five datasets and a specific application “solving choice problem”. Our result system shows a good performance and outperforms sometimes ESA (Explicit Semantic Analysis) and TSA (Temporal Semantic Analysis) approaches.

Introduction

Semantic Relatedness (SR) is used as a necessary pre-processing step to many Natural Language Processing (NLP) tasks, such as Word Sense Disambiguation (WSD) [21], [15]. Moreover, SR constitutes one of the major stakes in the Information Retrieval (IR) [10], [2], [13], [56], [60] especially in some tasks such as semantic indexing [51]. A powerful semantic relatedness measure can have influences on Semantic Information Retrieval (SIR) system. It actually exists in some information retrieval systems that support retrieval by Semantic Similarity Retrieval Model (SSRM) [47]. Research in semantic technologies has had a major impact by enabling and improving a wide range of web-based applications, such as search [8] as well as category discovery of videos [44].

Semantic relatedness measures typically use linguistic knowledge resources like WordNet1 [9] whose construction is very expensive and time-consuming. So far, insufficient coverage of these linguistic resources has been a major impediment for using semantic relatedness measures in large-scale natural language processing applications. Some of these rapidly growing collaboratively constructed resources like Wikipedia have the potential to be used as a new kind of semantic resource due to their increasing size and significant coverage. Wikipedia has recently been widely recognized as an enabling knowledge base for a variety of intelligent systems would be useful to define semantic relatedness [11], [25], [29], [35], [43], [50].

This work exploits Wikipedia features for measuring semantic relatedness between words. The rest of the paper is organized as follows. Section 2 gives a detailed overview of the state of the art in computing semantic relatedness using Wikipedia except the temporal semantic analysis which uses the archive of The New York Times spanning 150 years. Section 3 describes the semantic relatedness system that must be preceded by a pre-processing step to generate the category semantic depiction used to extract categories assigned to the candidate couple words. Section 4 presents the evaluating semantic relatedness measures and different benchmarks formed by human judgments. Section 5 details our system modules exploiting different Wikipedia features. In this section the theoretical aspect is presented in parallel with the experimental side to show the contribution of each enhancement. As for Section 6, it is a synthesis section to summarize the performance of the result system on the different datasets. Moreover, we discuss within the same section our system in comparison with the existing SR measures. Section 7 includes the evaluation of the effectiveness of our proposed method in solving the word choice problem task. Concluding remarks and some future directions of our work are described in Section 8.

Section snippets

Related works

Several researches have been done to use Wikipedia as a semantic resource for computing the semantic relatedness between words or concepts. In this section, we present some main approaches.

Our semantic relatedness computing system

Our system is preceded by a pre-processing in order to provide a semantic depiction for each category. The pre-processing process is composed of many steps that will be explained in the next paragraph.

Evaluating semantic relatedness measures

For evaluating semantic relatedness measures, we tested our approach with human judgments and in solving word choice problems task. Some researchers used other methods for application-specific evaluation, e.g. word sense disambiguation [31] or malapropism detection [3]. In [14], Gurevych and Strube evaluated a set of WordNet-based semantic similarity measures for the tasks of dialog summarization, and did not find any significant differences in their performance. In [13], Gurevych et al.

Our semantic relatedness computing system modules using Wikipedia features

The Fig. 6 shows the design of the Semantic Relatedness (SR) approach. Indeed, to make our measure more understandable we have dressed the figure to give an overall explanation about the SR measure.

So, it is clear that our system is composed necessary of five connected modules:

The basic system: It extracts the categories sets Cat1 and Cat2 respectively for the words w1 and w2. Next, for each couple (c1, c2)  Cat1 × Cat2, we compute the similarity between the two vectors VC1θ and VC2θ using

Application of the SR measure on other datasets

In order to validate the performance of our system, we decided to apply our measure on 3 other benchmarks (RG-65, MC-30 and YP-130) already cited in Section 4.2.

Presentation

This approach to the evaluation of semantic relatedness measures relies on word choice problems (Jarmasz and Szpakowicz [19]; Turney [46]; Zesch and Gurevych [54]). A word choice problem consists of a target word and four candidate words or phrases. The objective is to find the one that is most closely related to the target. An example problem is given below. There is always only one correct candidate, ‘(a)’ in this case: psychology

(a) the mind(b) nymphs
(c) horror movies(d) evolving societies

Conclusion and future work

This paper proposes a new semantic relatedness system based on Wikipedia knowledge base. In fact, it is modeled to exploit the WCG as a high level concepts representation. Moreover, it provides for each category a CSD (Category Semantic Depiction) vector using their assigned articles. These CSDs are used to compute the semantic relatedness with our novel method for measuring similarity between vectors. This method gives an important weight to any common stem. It shows a competitive performance

References (61)

  • L.R. Dice

    Measures of the amount of ecologic association between species

    Ecology

    (1945)
  • O. Egozi, E. Gabrilovich, S. Markovitch, Concept-based feature generation and selection for information retrieval, in:...
  • O. Egozi et al.

    Concept-based information retrieval using explicit semantic analysis

    ACM Transactions on Information Systems

    (2011)
  • L. Finkelstein et al.

    Placing search in context: the concept revisited

    ACM Transactions on Information Systems

    (2002)
  • E. Gabrilovich et al.

    Computing semantic relatedness using Wikipedia-based explicit semantic analysis

  • I. Gurevych

    Using the structure of a conceptual network in computing semantic relatedness

  • I. Gurevych, C. Müller, T. Zesch, What to be? – electronic career guidance based on semantic relatedness, in:...
  • I. Gurevych, M. Strube, Semantic similarity applied to spoken dialogue summarization, in: Proceedings of the 20th...
  • X. Han, J. Zhao, Structural semantic relatedness: a knowledge based method to named entity disambiguation, in: ACL ‘10:...
  • T.H. Haveliwala

    Topic-sensitive pagerank: a context-sensitive ranking algorithm for web search

    IEEE Transactions on Knowledge and Data Engineering

    (2003)
  • T. Hughes, D. Ramage, Lexical semantic relatedness with random graph walks, in: EMNLP-CoNLL, 2007, pp....
  • P. Jaccard, Distribution de la flore alpine dans le bassin des Dranses et dans quelques régions voisines,...
  • M. Jarmasz, S. Szpakowicz, Roget’s Thesaurus and Semantic Similarity, RANLP, 2003, pp....
  • J.J. Jiang, D.W. Conrath, Semantic similarity based on corpus statistics and lexical taxonomy, in: International...
  • C. Leacock et al.

    Combining Local Context and WordNet Similarity for Word Sense Identification

    (1998)
  • M. Lesk, Automatic sense disambiguation using machine readable dictionaries: how to tell a pine cone from an ice cream...
  • D. Lin

    An information-theoretic definition of similarity

  • G.A. Miller et al.

    Contextual correlates of semantic similarity

    Language and Cognitive Processes

    (1991)
  • D. Milne, Computing semantic relatedness using Wikipedia link structure, in: Proc. of NZCSRSC’07,...
  • Cited by (56)

    • A semi-explicit short text retrieval method combining Wikipedia features

      2020, Engineering Applications of Artificial Intelligence
    • Computing semantic similarity based on novel models of semantic representation using Wikipedia

      2018, Information Processing and Management
      Citation Excerpt :

      The measures of this type are widely used in many areas for its simple computation. Information Content (IC) based measures (Jiang et al., 2017; Meng, Gu, & Zhou, 2012; Sanchez, Batet, & Isern, 2011; Taieb et al., 2013; Taieb et al., 2012) assess similarities by the IC of concepts in a given ontology. The notion of IC is based on an assumption that concrete and special entities present more IC than the general and abstract ones.

    View all citing articles on Scopus
    View full text