Wikipedia-based information content and semantic similarity computation

doi:10.1016/j.ipm.2016.09.001

Information Processing & Management

Volume 53, Issue 1, January 2017, Pages 248-265

https://doi.org/10.1016/j.ipm.2016.09.001 Get rights and content

Highlights

•
Some novel methods to Information Content (IC) computation are proposed.
•
The presented IC computation methods focus on a concept drawn from Wikipedia.
•
Several approaches to semantic similarity measurement for concepts are provided.

Abstract

The Information Content (IC) of a concept is a fundamental dimension in computational linguistics. It enables a better understanding of concept's semantics. In the past, several approaches to compute IC of a concept have been proposed. However, there are some limitations such as the facts of relying on corpora availability, manual tagging, or predefined ontologies and fitting non-dynamic domains in the existing methods. Wikipedia provides a very large domain-independent encyclopedic repository and semantic network for computing IC of concepts with more coverage than usual ontologies. In this paper, we propose some novel methods to IC computation of a concept to solve the shortcomings of existing approaches. The presented methods focus on the IC computation of a concept (i.e., Wikipedia category) drawn from the Wikipedia category structure. We propose several new IC-based measures to compute the semantic similarity between concepts. The evaluation, based on several widely used benchmarks and a benchmark developed in ourselves, sustains the intuitions with respect to human judgments. Overall, some methods proposed in this paper have a good human correlation and constitute some effective ways of determining IC values for concepts and semantic similarity between concepts.

Introduction

The Information Content (IC) of a concept is a fundamental dimension in computational linguistics. It states the amount of information provided by the concept when appearing in a context. In this manner, the basic idea is that general and abstract entities present less IC when found in a discourse than more concrete and specialized ones. A proper quantification of the IC of concepts improves text understanding by enabling assessing the degree of semantic generality or concreteness of words referring to these concepts (Sanchez, Batet, & Isern, 2011). Informally, IC is defined as a measure of the informativeness of concepts and computed by counting the occurrence of words in large corpora (Pirro, 2009). That is, IC measures the amount of information provided by a given term based on its probability of appearance in a corpus. Up to the present, research on IC has been very active and many results have been achieved in the theoretical and application aspects. In particular, IC has been applied to the computation of semantic similarity, which acts as a fundamental principle by which humans organize and classify objects (Batet et al., 2011, Buggenhout and Ceusters, 2005, Formica, 2008, Jiang and Conrath, 1997, Lin, 1998, Pirro, 2009, Resnik, 1999, Resnik, 1995, Sanchez and Batet, 2013, Sanchez and Batet, 2011, Sanchez et al., 2011).

Making judgments about the semantic similarity of different concepts is a routine yet deceptively complex task. To perform it, people need to draw on an immense amount of background knowledge about the concepts. As a result, any attempt to compute semantic similarity automatically must also consult external sources of knowledge. Usually, these resources can be search engines (Bollegala et al., 2007, Cilibrasi and Vitanyi, 2007, Martinez-Gil and Aldana-Montes, 2013), topical directories such as Open Directory Project (Maguitman, Menczer, Erdinc, Roinestad, & Vespignani, 2006), well-defined semantic networks such as WordNet (Ahsaee et al., 2014, Liu et al., 2012), or more domain-dependent ontologies such as Gene Ontology (Couto et al., 2007, Mathur and Dinakarpandian, 2012) and biomedical ontologies MeSH or SNOMED CT (Batet et al., 2013, Pedersen et al., 2007, Sanchez and Batet, 2011). In fact, several works about semantic similarity measures with external sources of knowledge have been developed in the past years. According to the concrete knowledge sources exploited and the way in which they are used, different families of methods can be identified (Sanchez and Batet, 2013, Sanchez et al., 2011). These families are (Martinez-Gil, 2014, Petrakis et al., 2006): (1) edge counting measures: which consist of taking into account the length of the path linking the concepts (or terms) and the position of the concepts (or terms) in a given dictionary (or taxonomy, ontology) (Li, Bandar, & McLean, 2003); (2) feature based measures: which consist of measuring the similarity between concepts (or terms) as a function of their properties or based on their relationships to other similar concepts (or terms) (Petrakis et al., 2006, Rodriguez and Egenhofer, 2003, Sanchez et al., 2012); (3) information content measures: which consist of measuring the difference of the information content of the two concepts (or terms) as a function of their probability of occurrence in a text corpus (or an ontology) (Buggenhout and Ceusters, 2005, Lin, 1998, Resnik, 1999, Resnik, 1995, Sanchez and Batet, 2013, Sanchez et al., 2010); (4) hybrid measures: which consist of combining all of the above (Batet et al., 2013, Pirro, 2009).

Information theoretic approaches (i.e., IC-based approaches) assess the similarity between two concepts as a function of the IC that both concepts have in common in a given ontology. In the past, IC was typically computed from concept distribution in tagged textual corpora (Jiang and Conrath, 1997, Lin, 1998, Resnik, 1995). However, this introduces a dependency on corpora availability and manual tagging that hampered their accuracy and applicability due to data sparseness (Sanchez et al., 2010). To overcome this problem, in recent years several researchers have proposed various ways to infer IC of concepts in an intrinsic manner from the knowledge structure modeled in an ontology (Sanchez and Batet, 2013, Sanchez and Batet, 2011, Sanchez et al., 2011). From a domain-independent point of view, these approaches provide accurate results when relying on large and general-purpose knowledge sources such as biomedical ontologies MeSH or SNOMED CT (Batet et al., 2013, Pedersen et al., 2007, Sanchez and Batet, 2011) and tagged corpora such as SemCor (Fellbaum, 1998a, b) or Brown Corpus (Francis & Kucera, 1982). However, there are still some limitations in these methods of ontology based IC computation. The fact that intrinsic IC-based measures only rely on ontological knowledge is a drawback because they completely depend on the degree of coverage and detail of the unique input ontology (Sanchez & Batet, 2013). Especially, with the emergence of social networks or instant messaging systems (Martinez-Gil and Aldana-Montes, 2013, Retzer et al., 2012), many (sets of) concepts or terms (proper nouns, brands, acronyms, new words, conversational words, technical terms and so on) are not included in domain ontologies. Therefore, IC computation that is based on these kinds of knowledge resources (i.e., domain ontologies) cannot be used in these tasks. On the other hand, the prerequisite of ontology based IC computation is the existence of a (or several) predefined domain ontology (or ontologies). Such ontologies are established by panel experts in the given domains. Clearly, the construction process of these domain ontologies is time-consuming and error-prone and maintaining these ontologies also requires a lot of effort from experts. Thus, the methods of ontology based IC computation are also limited in scope and scalability. These limitations are the motivation behind the new techniques presented in this paper which compute IC of a concept from a kind of new source of information, i.e., a wide coverage online encyclopedia, namely Wikipedia (Hovy et al., 2013, Medelyan et al., 2009). As everyone knows, Wikipedia was launched in 2001 with the goal of building free encyclopedia in all languages. Today it is the largest, most widely used, and fastest growing encyclopedia in existence (Medelyan et al., 2009).

The purpose of this paper is to present several new methods to IC computation of a concept and similarity computation between two concepts to solve the shortcomings of existing approaches for IC computation and semantic similarity computation. The presented paper focuses on the IC computation of a concept and similarity computation between two concepts drawn from the Wikipedia category structure. In other words, we approach the problems of IC computation and similarity computation of concepts from a novel perspective by making use of Wikipedia. Thus, the terminologies of Wikipedia categories, categories, concepts, and words can be used interchangeably. In this paper, we utilize the Wikipedia category structure to act as a knowledge source. It is well known that the Wikipedia category structure is a very complex network. Comparing with traditional taxonomy structures, the Wikipedia category structure is a graph structure. Faced with such a complex structure, how do we compute the semantic similarity between concepts (categories)? Since the Wikipedia category structure is a graph, naturally, we may assess the semantic similarity between concepts by extending traditional information theoretic approaches (i.e., IC based approaches). The first thing we need to do is to compute the IC value of a concept (category) in a graph. Because the Wikipedia category structure is too complex, we are not sure which computation method for IC of a concept is most appropriate. Therefore, according to the characteristics of the Wikipedia category structure we have to present several IC computation approaches by extending traditional methods. Based on these IC computation approaches, we need to propose an approach for semantic similarity computation. How to give the method of similarity computation? Clearly, we may generalize existing approaches to the similarity measure for Wikipedia categories. It is noted that in traditional IC based methods, the key issue is to find the LCS (Least Common Subsumer) of two concepts. However, the Wikipedia category structure is a graph structure, so we need to extend the LCS to GCS (Good Common Subsumer). Because we don't know which underlying similarity measure method is suitable for the similarity computation of concepts in the Wikipedia category structure based on GCS, we have to present several similarity computation methods by extending existing underlying similarity measures. In fact, each measurement is based on different IC computation methods or underlying similarity measure methods. Thus, the hypotheses we want to test is which IC computation method and underlying similarity measure method are suitable for the similarity computation of concepts in the Wikipedia category structure.

The remainder of this paper is organized as follows. In the next section, we briefly review some background on IC computation and Wikipedia. Section 3 presents some novel methods for Wikipedia-based IC computation in detail. In Section 4, we describe our approaches to semantic similarity measurement using Wikipedia. Section 5 is devoted to presenting detail of evaluation of our approaches. Finally, in Section 6, we draw our conclusion and present some perspectives for future research.

Section snippets

Background

For completeness of presentation and convenience of subsequent discussions, in the current section we will briefly recall some basic notions of IC computation and Wikipedia. See especially (Lin, 1998, Medelyan et al., 2009, Ponzetto and Strube, 2007, Resnik, 1999, Resnik, 1995, Sanchez and Batet, 2013) for further details.

Novel methods for Wikipedia-based IC computation

To solve the shortcomings of existing approaches for IC computation (see Section 1 for details), in this section we propose some novel methods to IC computation by using Wikipedia.

The basic unit of information in Wikipedia is the article. Each article describes a single concept, and there is a single article for each concept. For example, Fig. 1 shows a typical article in Wikipedia, entitled Big data. Generally speaking, articles are assigned one or more categories in Wikipedia. For example,

Semantic similarity measurement for concepts

In this section we propose some approaches to semantic similarity measurement of concepts (i.e., Wikipedia categories). That is, we give the methods to similarity measures based on IC provided in Section 3.

Regarding IC-based similarity measurement, the key point to compare a pair of concepts is to retrieve their LCS (Least Common Subsumer). In a taxonomy, this information (LCS) is the most specific taxonomical ancestor common to concepts. The more specific the subsumer is (higher IC), the more

Evaluation

In this section we discuss the evaluation problem of IC computation and similarity measures for concepts. Section 5.1 introduces some benchmarks. Section 5.2 gives our experimental results. Lastly, in Section 5.3, we discuss and analyze the experimental results.

Conclusion

The Information Content (IC) of a concept is a fundamental dimension in computational linguistics. It states the amount of information provided by the concept when appearing in a context. Accurate quantification of the IC of concepts permits the estimation of their semantic similarity as a function of their shared information. In the past, several approaches to compute IC of a concept have been proposed. However, there are some limitations such as the facts of relying on corpora availability,

Acknowledgments

The authors would like to thank the anonymous referees for their valuable comments as well as helpful suggestions from Professors Jim Jansen, Fabio Crestani, and Hideo Joho which greatly improved the exposition of the paper. The works described in this paper are supported by The National Natural Science Foundation of China under Grant No. 61272066; The Program for New Century Excellent Talents in University in China under Grant No. NCET-12-0644; The Natural Science Foundation of Guangdong

References (49)

M. Batet et al.
An ontology-based measure to compute semantic similarity in biomedicine
Journal of Biomedical Informatics
(2011)
C.V. Buggenhout et al.
A novel view on information content of concepts in a large ontology and a view on the structure and the quality of the ontology
International Journal of Medical Informatics
(2005)
F.M. Couto et al.
Measuring semantic similarity between gene ontology terms
Data & Knowledge Engineering
(2007)
A. Formica
Concept similarity in formal concept analysis: An information content approach
Knowledge-Based Systems
(2008)
E. Hovy et al.
Collaboratively built semi-structured content and artificial intelligence: The story so far
Artificial Intelligence
(2013)
A. Ittoo et al.
Minimally-supervised extraction of domain-specific part–whole relations using Wikipedia as knowledge-base
Data & Knowledge Engineering
(2013)
R. Kaptein et al.
Exploiting the category structure of Wikipedia for entity ranking
Artificial Intelligence
(2013)
Y. Li et al.
An approach for measuring semantic similarity between words using multiple information sources
IEEE Transactions on Knowledge and Data Engineering
(2003)
A.G. Maguitman et al.
Algorithmic computation and approximation of semantic similarity
World Wide Web
(2006)
O. Medelyan et al.
Mining meaning from Wikipedia
International Journal of Human-Computer Studies
(2009)

G.A. Miller et al.

Contextual correlates of semantic similarity

Language and Cognitive Processes

(1991)

T. Pedersen et al.

Measures of semantic similarity and relatedness in the biomedical domain

Journal of Biomedical Informatics

(2007)

E.G.M. Petrakis et al.

X-Similarity: Computing semantic similarity between concepts from different ontologies

Journal of Digital Information Management

(2006)

S.P. Ponzetto et al.

Knowledge derived from Wikipedia for computing semantic relatedness

Journal of Artificial Intelligence Research

(2007)

D. Sanchez et al.

Semantic similarity estimation in the biomedical domain: An ontology-based information-theoretic perspective

Journal of Biomedical Informatics

(2011)

D. Sanchez et al.

Ontology-based information content computation

Knowledge-Based Systems

(2011)

D. Sanchez et al.

Ontology-based semantic similarity: A new feature-based approach

Expert Systems with Applications

(2012)

D. Sanchez et al.

Ontology-driven web-based semantic similarity

Journal of Intelligent Information Systems

(2010)

M. Yazdani et al.

Computing text semantic relatedness using the contents and links of a hypertext encyclopedia

Artificial Intelligence

(2013)

Z. Zhou et al.

A new model of information content for semantic similarity in WordNet

M.G. Ahsaee et al.

Semantic similarity assessment of words using weighted WordNet

International Journal of Machine Learning and Cybernetics

(2014)

M. Batet et al.

Semantic similarity estimation from multiple ontologies

Applied Intelligence

(2013)

A. Blank

Words and concepts in time: Towards diachronic cognitive onomasiology

D. Bollegala et al.

Measuring semantic similarity between words using web search engines

Cited by (82)

Evaluating semantic similarity and relatedness between concepts by combining taxonomic and non-taxonomic semantic features of WordNet and Wikipedia
2023, Information Sciences
Many applications in cognitive science and artificial intelligence utilize semantic similarity and relatedness to solve difficult tasks such as information retrieval, word sense disambiguation, and text classification. Previously, several approaches for evaluating concept similarity and relatedness based on WordNet or Wikipedia have been proposed. WordNet-based methods rely on highly precise knowledge but have limited lexical coverage. In contrast, Wikipedia-based models achieve more coverage but sacrifice knowledge quality. Therefore, in this paper, we focus on developing a comprehensive semantic similarity and relatedness method based on WordNet and Wikipedia. To improve the accuracy of existing measures, we combine various taxonomic and non-taxonomic features of WordNet, including gloss, lemmas, examples, sister-terms, derivations, holonyms/meronyms, and hypernyms/hyponyms, with Wikipedia gloss and hyperlinks, to describe concepts. We present a novel technique for extracting ‘is-a’ and ‘part-whole’ relationships between concepts using the Wikipedia link structure. The suggested technique identifies taxonomic and non-taxonomic relationships between concepts and offers dense vector representations of concepts. To fully exploit WordNet and Wikipedia’s semantic attributes, the proposed method integrates their semantic knowledge at feature-level, combining semantic similarity and relatedness into a single comprehensive measure. The experimental results demonstrate the effectiveness of the proposed method over state-of-the-art measures on various gold standard benchmarks.
Wikipedia bi-linear link (WBLM) model: A new approach for measuring semantic similarity and relatedness between linguistic concepts using Wikipedia link structure
2023, Information Processing and Management
Wikipedia links its articles by manually defined semantic relations called the Wikipedia hyperlink (link) structure. The existing Wikipedia link-based semantic similarity (SS) and semantic relatedness (SR) computation models, such as Wikipedia one-way link (WOLM) model and Wikipedia two-way link (WTLM) model, do not assess the strengths of the relationships between a candidate concept and its links (out-links or in-links). These models treat all the links as equally important even though some links are semantically more influential than others and should be given more importance. This phenomenon reduces the accuracy of these models. This paper presents the Wikipedia bi-linear link (WBLM) model that extends the previously proposed WOLM and WTLM models. The WBLM model explores the Wikipedia link structure as a semantic graph and discovers the strongly (bi-linear links) and weakly (out-links or in-links) connected links of a candidate concept. It improves the link-based vector representations of concepts by assigning weights to their connected links according to the strengths of their semantic associations. The experimental results demonstrate that the proposed WBLM model significantly improves the SS and SR computation accuracy of the WOLM model (6.9%, 8%, 24%, 17.3%, 31.2%, 30.6%, 26.5%, and 35.4%) and WTLM model (1.2%, 3.9%, 7.1%, 9.9%, 11%, 6.3%, 12.7%, and 13%), in terms of linear correlations with human judgments on gold standard benchmarks, including MC30, RG65, WS203, SimLex, 353All, MTurk287, MTurk771, and MEN3000, respectively. Moreover, this research offers a deep insight into the Wikipedia link structure and provides an adequate base for understanding it as a semantic graph.
Subgraph-based feature fusion models for semantic similarity computation in heterogeneous knowledge graphs
2022, Knowledge-Based Systems
Citation Excerpt :
Li et al. [6] present a family of ten different parametric similarity measures, the core idea of which is to break down the overall similarity function into a combination of linear or non-linear functions, with each base function relying on a different taxonomic feature such as the length of the shortest path between concepts and the depth of the lowest common ancestor. IC-based methods [8,19,20] are introduced to count the information content (IC) of concepts, which indicates how specific and informative a concept is, using an external corpus. Resnik [19] proposes the first work applying information theory to semantic similarity computation, stating that concept similarity is determined by the amount of shared information between them.
Semantic similarity is a fundamental task in natural language processing that determines the similarity between two concepts within a taxonomy. For example, a pair of words (e.g., car and bike) appear similar because they share the same category (e.g., vehicle). Numerous computation methods, such as distance-based and feature-based approaches, are proposed to precisely depict this similarity. As knowledge graphs become heterogeneous (e.g., DBpedia), existing methods have limitations on utilizing multi-view features (e.g., abstract, structure, and categories). On the one hand, some features are incomplete for various reasons, reducing the effectiveness of embedding methods. On the other hand, the hidden connections among multi-view features are omitted by existing approaches. To address the problems mentioned above, we first extract three subgraphs from a heterogeneous knowledge graph and then combine various embedding approaches to capture the global semantics of each concept. Next, we offer subgraph-based feature fusion models that improve concept representation by fusing multi-view features. Finally, we devise mixed computation methods to calculate the semantic similarity between the two concepts. Experiment results show that multi-view features, particularly the abstract feature, can effectively improve the performance of the proposed methods. Compared to existing approaches, our methods significantly improve the Pearson correlation coefficient by about 7%. The source code of this paper is available at: https://github.com/fiego/SubgraphSS.
Zipfian regularities in “non-point” word representations
2021, Information Processing and Management
Being one of the most common empirical regularities, the Zipf’s law for word frequencies is a power law relation between word frequencies and frequency ranks of words. We quantitatively study semantic uncertainty of words through non-point distribution-based word embeddings and reveal the Zipfian regularities. Uncertainty of a word can increase due to polysemy, the word having “broad” meaning (such as the relation between broader emotion and narrower exasperation) or a combination of both. Variances of Gaussian embeddings are utilized to quantify the extent a word can be used in different senses or contexts. By using the variance information embedded in the non-point Gaussian embeddings, we quantitatively show that semantic breadth of words also exhibits Zipfian patterns, when polysemy is controlled. This outcome is complementary to Zipf’s law of meaning distribution and the related meaning-frequency law by indicating the existence of Zipfian patterns: more frequent words tend to be generic while less frequent ones tend to be specific. Results for two languages, English and Turkish that belong to different language families, are also provided. Such regularities provide valuable information to extract and understand relationships between semantic properties of words and word frequencies. In various applications, performance improvements can be obtained by employing these regularities. We also propose a method that leverages the Zipfian regularity to improve the performance of baseline textual entailment detection algorithms. To the best of our knowledge, our approach is the first quantitative study that uses Gaussian embeddings to examine the relationships between word frequencies and semantic breadth.
A semantic similarity computation method for virtual resources in cloud manufacturing environment based on information content
2021, Journal of Manufacturing Systems
To ensure the effective and reasonable classification of virtual resources and provide support for the search and matching of required resources that accomplish tasks in cloud manufacturing (CMfg) environments, the semantic similarity computation of virtual resources needs to be promoted for that it is difficult to ensure that a large number of effective features participate in the calculation of semantic similarity and it is difficult to practically apply to efficient classification. However, to promote semantic similarity calculations still have problems such as difficulty in unifying the virtual resource pool and lagging in updating semantic depth calculations. In this paper, we propose a creative approach that can compute the semantic similarity of virtual resources described by video semantic description (VSD) based on information content (IC). In this approach, first, a multi-level progression semantics description framework for virtual resources is proposed with the help of VSD that can provide a new solution because of its explicit description and dynamic update to obtain real and reliable data; second, an event cloud (EC) that serves as the computing scope for semantics similarity calculation of virtual resources is constructed to settle the problem of substantial data growth caused by real-time resource updates on top of the resource manufacturing processes; third, based on an effective solution of corpus dependency and data sparseness, an extended semantic computing model is developed to calculate quantificationally the semantic similarities of virtual resources based on semantic relationships. The results show that compared with classifying virtual resources qualitatively, the semantic similarity computing method that starts from the manufacturing result and considers the comprehensive relationship is more reliable and effective.
Dissecting The Analects: an NLP-based exploration of semantic similarities and differences across English translations
2024, Humanities and Social Sciences Communications

View all citing articles on Scopus

View full text

Wikipedia-based information content and semantic similarity computation

Highlights

Abstract

Introduction

Section snippets

Background

Novel methods for Wikipedia-based IC computation

Semantic similarity measurement for concepts

Evaluation

Conclusion

Acknowledgments

Journal of Biomedical Informatics

International Journal of Medical Informatics

Data & Knowledge Engineering

Knowledge-Based Systems

Artificial Intelligence

Data & Knowledge Engineering

Artificial Intelligence

IEEE Transactions on Knowledge and Data Engineering

World Wide Web

International Journal of Human-Computer Studies

Language and Cognitive Processes

Journal of Biomedical Informatics

Journal of Digital Information Management

Journal of Artificial Intelligence Research

Journal of Biomedical Informatics

Knowledge-Based Systems

Expert Systems with Applications

Journal of Intelligent Information Systems

Artificial Intelligence

Semantic similarity assessment of words using weighted WordNet

International Journal of Machine Learning and Cybernetics

Semantic similarity estimation from multiple ontologies

Applied Intelligence

Words and concepts in time: Towards diachronic cognitive onomasiology

Measuring semantic similarity between words using web search engines