Wikipedia-based information content and semantic similarity computation

https://doi.org/10.1016/j.ipm.2016.09.001Get rights and content

Highlights

  • Some novel methods to Information Content (IC) computation are proposed.

  • The presented IC computation methods focus on a concept drawn from Wikipedia.

  • Several approaches to semantic similarity measurement for concepts are provided.

Abstract

The Information Content (IC) of a concept is a fundamental dimension in computational linguistics. It enables a better understanding of concept's semantics. In the past, several approaches to compute IC of a concept have been proposed. However, there are some limitations such as the facts of relying on corpora availability, manual tagging, or predefined ontologies and fitting non-dynamic domains in the existing methods. Wikipedia provides a very large domain-independent encyclopedic repository and semantic network for computing IC of concepts with more coverage than usual ontologies. In this paper, we propose some novel methods to IC computation of a concept to solve the shortcomings of existing approaches. The presented methods focus on the IC computation of a concept (i.e., Wikipedia category) drawn from the Wikipedia category structure. We propose several new IC-based measures to compute the semantic similarity between concepts. The evaluation, based on several widely used benchmarks and a benchmark developed in ourselves, sustains the intuitions with respect to human judgments. Overall, some methods proposed in this paper have a good human correlation and constitute some effective ways of determining IC values for concepts and semantic similarity between concepts.

Introduction

The Information Content (IC) of a concept is a fundamental dimension in computational linguistics. It states the amount of information provided by the concept when appearing in a context. In this manner, the basic idea is that general and abstract entities present less IC when found in a discourse than more concrete and specialized ones. A proper quantification of the IC of concepts improves text understanding by enabling assessing the degree of semantic generality or concreteness of words referring to these concepts (Sanchez, Batet, & Isern, 2011). Informally, IC is defined as a measure of the informativeness of concepts and computed by counting the occurrence of words in large corpora (Pirro, 2009). That is, IC measures the amount of information provided by a given term based on its probability of appearance in a corpus. Up to the present, research on IC has been very active and many results have been achieved in the theoretical and application aspects. In particular, IC has been applied to the computation of semantic similarity, which acts as a fundamental principle by which humans organize and classify objects (Batet et al., 2011, Buggenhout and Ceusters, 2005, Formica, 2008, Jiang and Conrath, 1997, Lin, 1998, Pirro, 2009, Resnik, 1999, Resnik, 1995, Sanchez and Batet, 2013, Sanchez and Batet, 2011, Sanchez et al., 2011).

Making judgments about the semantic similarity of different concepts is a routine yet deceptively complex task. To perform it, people need to draw on an immense amount of background knowledge about the concepts. As a result, any attempt to compute semantic similarity automatically must also consult external sources of knowledge. Usually, these resources can be search engines (Bollegala et al., 2007, Cilibrasi and Vitanyi, 2007, Martinez-Gil and Aldana-Montes, 2013), topical directories such as Open Directory Project (Maguitman, Menczer, Erdinc, Roinestad, & Vespignani, 2006), well-defined semantic networks such as WordNet (Ahsaee et al., 2014, Liu et al., 2012), or more domain-dependent ontologies such as Gene Ontology (Couto et al., 2007, Mathur and Dinakarpandian, 2012) and biomedical ontologies MeSH or SNOMED CT (Batet et al., 2013, Pedersen et al., 2007, Sanchez and Batet, 2011). In fact, several works about semantic similarity measures with external sources of knowledge have been developed in the past years. According to the concrete knowledge sources exploited and the way in which they are used, different families of methods can be identified (Sanchez and Batet, 2013, Sanchez et al., 2011). These families are (Martinez-Gil, 2014, Petrakis et al., 2006): (1) edge counting measures: which consist of taking into account the length of the path linking the concepts (or terms) and the position of the concepts (or terms) in a given dictionary (or taxonomy, ontology) (Li, Bandar, & McLean, 2003); (2) feature based measures: which consist of measuring the similarity between concepts (or terms) as a function of their properties or based on their relationships to other similar concepts (or terms) (Petrakis et al., 2006, Rodriguez and Egenhofer, 2003, Sanchez et al., 2012); (3) information content measures: which consist of measuring the difference of the information content of the two concepts (or terms) as a function of their probability of occurrence in a text corpus (or an ontology) (Buggenhout and Ceusters, 2005, Lin, 1998, Resnik, 1999, Resnik, 1995, Sanchez and Batet, 2013, Sanchez et al., 2010); (4) hybrid measures: which consist of combining all of the above (Batet et al., 2013, Pirro, 2009).

Information theoretic approaches (i.e., IC-based approaches) assess the similarity between two concepts as a function of the IC that both concepts have in common in a given ontology. In the past, IC was typically computed from concept distribution in tagged textual corpora (Jiang and Conrath, 1997, Lin, 1998, Resnik, 1995). However, this introduces a dependency on corpora availability and manual tagging that hampered their accuracy and applicability due to data sparseness (Sanchez et al., 2010). To overcome this problem, in recent years several researchers have proposed various ways to infer IC of concepts in an intrinsic manner from the knowledge structure modeled in an ontology (Sanchez and Batet, 2013, Sanchez and Batet, 2011, Sanchez et al., 2011). From a domain-independent point of view, these approaches provide accurate results when relying on large and general-purpose knowledge sources such as biomedical ontologies MeSH or SNOMED CT (Batet et al., 2013, Pedersen et al., 2007, Sanchez and Batet, 2011) and tagged corpora such as SemCor (Fellbaum, 1998a, b) or Brown Corpus (Francis & Kucera, 1982). However, there are still some limitations in these methods of ontology based IC computation. The fact that intrinsic IC-based measures only rely on ontological knowledge is a drawback because they completely depend on the degree of coverage and detail of the unique input ontology (Sanchez & Batet, 2013). Especially, with the emergence of social networks or instant messaging systems (Martinez-Gil and Aldana-Montes, 2013, Retzer et al., 2012), many (sets of) concepts or terms (proper nouns, brands, acronyms, new words, conversational words, technical terms and so on) are not included in domain ontologies. Therefore, IC computation that is based on these kinds of knowledge resources (i.e., domain ontologies) cannot be used in these tasks. On the other hand, the prerequisite of ontology based IC computation is the existence of a (or several) predefined domain ontology (or ontologies). Such ontologies are established by panel experts in the given domains. Clearly, the construction process of these domain ontologies is time-consuming and error-prone and maintaining these ontologies also requires a lot of effort from experts. Thus, the methods of ontology based IC computation are also limited in scope and scalability. These limitations are the motivation behind the new techniques presented in this paper which compute IC of a concept from a kind of new source of information, i.e., a wide coverage online encyclopedia, namely Wikipedia (Hovy et al., 2013, Medelyan et al., 2009). As everyone knows, Wikipedia was launched in 2001 with the goal of building free encyclopedia in all languages. Today it is the largest, most widely used, and fastest growing encyclopedia in existence (Medelyan et al., 2009).

The purpose of this paper is to present several new methods to IC computation of a concept and similarity computation between two concepts to solve the shortcomings of existing approaches for IC computation and semantic similarity computation. The presented paper focuses on the IC computation of a concept and similarity computation between two concepts drawn from the Wikipedia category structure. In other words, we approach the problems of IC computation and similarity computation of concepts from a novel perspective by making use of Wikipedia. Thus, the terminologies of Wikipedia categories, categories, concepts, and words can be used interchangeably. In this paper, we utilize the Wikipedia category structure to act as a knowledge source. It is well known that the Wikipedia category structure is a very complex network. Comparing with traditional taxonomy structures, the Wikipedia category structure is a graph structure. Faced with such a complex structure, how do we compute the semantic similarity between concepts (categories)? Since the Wikipedia category structure is a graph, naturally, we may assess the semantic similarity between concepts by extending traditional information theoretic approaches (i.e., IC based approaches). The first thing we need to do is to compute the IC value of a concept (category) in a graph. Because the Wikipedia category structure is too complex, we are not sure which computation method for IC of a concept is most appropriate. Therefore, according to the characteristics of the Wikipedia category structure we have to present several IC computation approaches by extending traditional methods. Based on these IC computation approaches, we need to propose an approach for semantic similarity computation. How to give the method of similarity computation? Clearly, we may generalize existing approaches to the similarity measure for Wikipedia categories. It is noted that in traditional IC based methods, the key issue is to find the LCS (Least Common Subsumer) of two concepts. However, the Wikipedia category structure is a graph structure, so we need to extend the LCS to GCS (Good Common Subsumer). Because we don't know which underlying similarity measure method is suitable for the similarity computation of concepts in the Wikipedia category structure based on GCS, we have to present several similarity computation methods by extending existing underlying similarity measures. In fact, each measurement is based on different IC computation methods or underlying similarity measure methods. Thus, the hypotheses we want to test is which IC computation method and underlying similarity measure method are suitable for the similarity computation of concepts in the Wikipedia category structure.

The remainder of this paper is organized as follows. In the next section, we briefly review some background on IC computation and Wikipedia. Section 3 presents some novel methods for Wikipedia-based IC computation in detail. In Section 4, we describe our approaches to semantic similarity measurement using Wikipedia. Section 5 is devoted to presenting detail of evaluation of our approaches. Finally, in Section 6, we draw our conclusion and present some perspectives for future research.

Section snippets

Background

For completeness of presentation and convenience of subsequent discussions, in the current section we will briefly recall some basic notions of IC computation and Wikipedia. See especially (Lin, 1998, Medelyan et al., 2009, Ponzetto and Strube, 2007, Resnik, 1999, Resnik, 1995, Sanchez and Batet, 2013) for further details.

Novel methods for Wikipedia-based IC computation

To solve the shortcomings of existing approaches for IC computation (see Section 1 for details), in this section we propose some novel methods to IC computation by using Wikipedia.

The basic unit of information in Wikipedia is the article. Each article describes a single concept, and there is a single article for each concept. For example, Fig. 1 shows a typical article in Wikipedia, entitled Big data. Generally speaking, articles are assigned one or more categories in Wikipedia. For example,

Semantic similarity measurement for concepts

In this section we propose some approaches to semantic similarity measurement of concepts (i.e., Wikipedia categories). That is, we give the methods to similarity measures based on IC provided in Section 3.

Regarding IC-based similarity measurement, the key point to compare a pair of concepts is to retrieve their LCS (Least Common Subsumer). In a taxonomy, this information (LCS) is the most specific taxonomical ancestor common to concepts. The more specific the subsumer is (higher IC), the more

Evaluation

In this section we discuss the evaluation problem of IC computation and similarity measures for concepts. Section 5.1 introduces some benchmarks. Section 5.2 gives our experimental results. Lastly, in Section 5.3, we discuss and analyze the experimental results.

Conclusion

The Information Content (IC) of a concept is a fundamental dimension in computational linguistics. It states the amount of information provided by the concept when appearing in a context. Accurate quantification of the IC of concepts permits the estimation of their semantic similarity as a function of their shared information. In the past, several approaches to compute IC of a concept have been proposed. However, there are some limitations such as the facts of relying on corpora availability,

Acknowledgments

The authors would like to thank the anonymous referees for their valuable comments as well as helpful suggestions from Professors Jim Jansen, Fabio Crestani, and Hideo Joho which greatly improved the exposition of the paper. The works described in this paper are supported by The National Natural Science Foundation of China under Grant No. 61272066; The Program for New Century Excellent Talents in University in China under Grant No. NCET-12-0644; The Natural Science Foundation of Guangdong

References (49)

  • G.A. Miller et al.

    Contextual correlates of semantic similarity

    Language and Cognitive Processes

    (1991)
  • T. Pedersen et al.

    Measures of semantic similarity and relatedness in the biomedical domain

    Journal of Biomedical Informatics

    (2007)
  • E.G.M. Petrakis et al.

    X-Similarity: Computing semantic similarity between concepts from different ontologies

    Journal of Digital Information Management

    (2006)
  • S.P. Ponzetto et al.

    Knowledge derived from Wikipedia for computing semantic relatedness

    Journal of Artificial Intelligence Research

    (2007)
  • D. Sanchez et al.

    Semantic similarity estimation in the biomedical domain: An ontology-based information-theoretic perspective

    Journal of Biomedical Informatics

    (2011)
  • D. Sanchez et al.

    Ontology-based information content computation

    Knowledge-Based Systems

    (2011)
  • D. Sanchez et al.

    Ontology-based semantic similarity: A new feature-based approach

    Expert Systems with Applications

    (2012)
  • D. Sanchez et al.

    Ontology-driven web-based semantic similarity

    Journal of Intelligent Information Systems

    (2010)
  • M. Yazdani et al.

    Computing text semantic relatedness using the contents and links of a hypertext encyclopedia

    Artificial Intelligence

    (2013)
  • Z. Zhou et al.

    A new model of information content for semantic similarity in WordNet

  • M.G. Ahsaee et al.

    Semantic similarity assessment of words using weighted WordNet

    International Journal of Machine Learning and Cybernetics

    (2014)
  • M. Batet et al.

    Semantic similarity estimation from multiple ontologies

    Applied Intelligence

    (2013)
  • A. Blank

    Words and concepts in time: Towards diachronic cognitive onomasiology

  • D. Bollegala et al.

    Measuring semantic similarity between words using web search engines

  • Cited by (82)

    • Subgraph-based feature fusion models for semantic similarity computation in heterogeneous knowledge graphs

      2022, Knowledge-Based Systems
      Citation Excerpt :

      Li et al. [6] present a family of ten different parametric similarity measures, the core idea of which is to break down the overall similarity function into a combination of linear or non-linear functions, with each base function relying on a different taxonomic feature such as the length of the shortest path between concepts and the depth of the lowest common ancestor. IC-based methods [8,19,20] are introduced to count the information content (IC) of concepts, which indicates how specific and informative a concept is, using an external corpus. Resnik [19] proposes the first work applying information theory to semantic similarity computation, stating that concept similarity is determined by the amount of shared information between them.

    • Zipfian regularities in “non-point” word representations

      2021, Information Processing and Management
    View all citing articles on Scopus
    View full text