Computing semantic relatedness using Wikipedia features
Introduction
Semantic Relatedness (SR) is used as a necessary pre-processing step to many Natural Language Processing (NLP) tasks, such as Word Sense Disambiguation (WSD) [21], [15]. Moreover, SR constitutes one of the major stakes in the Information Retrieval (IR) [10], [2], [13], [56], [60] especially in some tasks such as semantic indexing [51]. A powerful semantic relatedness measure can have influences on Semantic Information Retrieval (SIR) system. It actually exists in some information retrieval systems that support retrieval by Semantic Similarity Retrieval Model (SSRM) [47]. Research in semantic technologies has had a major impact by enabling and improving a wide range of web-based applications, such as search [8] as well as category discovery of videos [44].
Semantic relatedness measures typically use linguistic knowledge resources like WordNet1 [9] whose construction is very expensive and time-consuming. So far, insufficient coverage of these linguistic resources has been a major impediment for using semantic relatedness measures in large-scale natural language processing applications. Some of these rapidly growing collaboratively constructed resources like Wikipedia have the potential to be used as a new kind of semantic resource due to their increasing size and significant coverage. Wikipedia has recently been widely recognized as an enabling knowledge base for a variety of intelligent systems would be useful to define semantic relatedness [11], [25], [29], [35], [43], [50].
This work exploits Wikipedia features for measuring semantic relatedness between words. The rest of the paper is organized as follows. Section 2 gives a detailed overview of the state of the art in computing semantic relatedness using Wikipedia except the temporal semantic analysis which uses the archive of The New York Times spanning 150 years. Section 3 describes the semantic relatedness system that must be preceded by a pre-processing step to generate the category semantic depiction used to extract categories assigned to the candidate couple words. Section 4 presents the evaluating semantic relatedness measures and different benchmarks formed by human judgments. Section 5 details our system modules exploiting different Wikipedia features. In this section the theoretical aspect is presented in parallel with the experimental side to show the contribution of each enhancement. As for Section 6, it is a synthesis section to summarize the performance of the result system on the different datasets. Moreover, we discuss within the same section our system in comparison with the existing SR measures. Section 7 includes the evaluation of the effectiveness of our proposed method in solving the word choice problem task. Concluding remarks and some future directions of our work are described in Section 8.
Section snippets
Related works
Several researches have been done to use Wikipedia as a semantic resource for computing the semantic relatedness between words or concepts. In this section, we present some main approaches.
Our semantic relatedness computing system
Our system is preceded by a pre-processing in order to provide a semantic depiction for each category. The pre-processing process is composed of many steps that will be explained in the next paragraph.
Evaluating semantic relatedness measures
For evaluating semantic relatedness measures, we tested our approach with human judgments and in solving word choice problems task. Some researchers used other methods for application-specific evaluation, e.g. word sense disambiguation [31] or malapropism detection [3]. In [14], Gurevych and Strube evaluated a set of WordNet-based semantic similarity measures for the tasks of dialog summarization, and did not find any significant differences in their performance. In [13], Gurevych et al.
Our semantic relatedness computing system modules using Wikipedia features
The Fig. 6 shows the design of the Semantic Relatedness (SR) approach. Indeed, to make our measure more understandable we have dressed the figure to give an overall explanation about the SR measure.
So, it is clear that our system is composed necessary of five connected modules:
The basic system: It extracts the categories sets Cat1 and Cat2 respectively for the words w1 and w2. Next, for each couple (c1, c2) ∈ Cat1 × Cat2, we compute the similarity between the two vectors and using
Application of the SR measure on other datasets
In order to validate the performance of our system, we decided to apply our measure on 3 other benchmarks (RG-65, MC-30 and YP-130) already cited in Section 4.2.
Presentation
This approach to the evaluation of semantic relatedness measures relies on word choice problems (Jarmasz and Szpakowicz [19]; Turney [46]; Zesch and Gurevych [54]). A word choice problem consists of a target word and four candidate words or phrases. The objective is to find the one that is most closely related to the target. An example problem is given below. There is always only one correct candidate, ‘(a)’ in this case: psychology(a) the mind (b) nymphs (c) horror movies (d) evolving societies
Conclusion and future work
This paper proposes a new semantic relatedness system based on Wikipedia knowledge base. In fact, it is modeled to exploit the WCG as a high level concepts representation. Moreover, it provides for each category a CSD (Category Semantic Depiction) vector using their assigned articles. These CSDs are used to compute the semantic relatedness with our novel method for measuring similarity between vectors. This method gives an important weight to any common stem. It shows a competitive performance
References (61)
- et al.
Term-weighting approaches in automatic text retrieval
Information Processing & Management
(1988) Concept similarity in formal concept analysis: an information content approach
Knowledge-Based Systems
(2008)A semantic similarity metric combining features and intrinsic information content
Data & Knowledge Engineering
(2009)- et al.
Ontology-based information content computation
Knowledge-Based Systems
(2011) - et al.
A knowledge-based question answering system for B2C ecommerce
Knowledge-Based Systems
(2008) - E. Agirre, A. Soroa, Personalizing pagerank for word sense disambiguation, in: Proceedings of the 12th Conference of...
- et al.
Evaluating a conceptual indexing method by utilizing wordnet
- et al.
Evaluating wordnet-based measures of semantic distance
Computational Linguistics
(2006) - P. Cimiano, A. Schultz, S. Sizov, P. Sorg, S. Staab, Explicit versus latent concept models for cross-language...
- et al.
Indexing by latent semantic analysis
Journal of the American Society of Information Science
(1990)
Measures of the amount of ecologic association between species
Ecology
Concept-based information retrieval using explicit semantic analysis
ACM Transactions on Information Systems
Placing search in context: the concept revisited
ACM Transactions on Information Systems
Computing semantic relatedness using Wikipedia-based explicit semantic analysis
Using the structure of a conceptual network in computing semantic relatedness
Topic-sensitive pagerank: a context-sensitive ranking algorithm for web search
IEEE Transactions on Knowledge and Data Engineering
Combining Local Context and WordNet Similarity for Word Sense Identification
An information-theoretic definition of similarity
Contextual correlates of semantic similarity
Language and Cognitive Processes
Cited by (56)
A semi-explicit short text retrieval method combining Wikipedia features
2020, Engineering Applications of Artificial IntelligenceAn approach for measuring semantic similarity between Wikipedia concepts using multiple inheritances
2020, Information Processing and ManagementConcept embedding to measure semantic relatedness for biomedical information ontologies
2019, Journal of Biomedical InformaticsComputing semantic similarity based on novel models of semantic representation using Wikipedia
2018, Information Processing and ManagementCitation Excerpt :The measures of this type are widely used in many areas for its simple computation. Information Content (IC) based measures (Jiang et al., 2017; Meng, Gu, & Zhou, 2012; Sanchez, Batet, & Isern, 2011; Taieb et al., 2013; Taieb et al., 2012) assess similarities by the IC of concepts in a given ontology. The notion of IC is based on an assumption that concrete and special entities present more IC than the general and abstract ones.
Weighting-based semantic similarity measure based on topological parameters in semantic taxonomy
2018, Natural Language EngineeringMulti-knowledge resources-based semantic similarity models with application for movie recommender system
2023, Artificial Intelligence Review