Feature-based approaches to semantic similarity assessment of concepts using Wikipedia

https://doi.org/10.1016/j.ipm.2015.01.001Get rights and content

Highlights

  • A formal representation of Wikipedia concepts is presented.

  • A framework for feature based similarity is proposed.

  • Some novel feature based approaches to semantic similarity measures are presented.

  • Results show that several proposed methods have good human correlation.

Abstract

Semantic similarity assessment between concepts is an important task in many language related applications. In the past, several approaches to assess similarity by evaluating the knowledge modeled in an (or multiple) ontology (or ontologies) have been proposed. However, there are some limitations such as the facts of relying on predefined ontologies and fitting non-dynamic domains in the existing measures. Wikipedia provides a very large domain-independent encyclopedic repository and semantic network for computing semantic similarity of concepts with more coverage than usual ontologies. In this paper, we propose some novel feature based similarity assessment methods that are fully dependent on Wikipedia and can avoid most of the limitations and drawbacks introduced above. To implement similarity assessment based on feature by making use of Wikipedia, firstly a formal representation of Wikipedia concepts is presented. We then give a framework for feature based similarity based on the formal representation of Wikipedia concepts. Lastly, we investigate several feature based approaches to semantic similarity measures resulting from instantiations of the framework. The evaluation, based on several widely used benchmarks and a benchmark developed in ourselves, sustains the intuitions with respect to human judgements. Overall, several methods proposed in this paper have good human correlation and constitute some effective ways of determining similarity between Wikipedia concepts.

Introduction

Semantic similarity between concepts is becoming a common problem for many applications of Computational Linguistics and Artificial Intelligence such as natural language processing, knowledge acquisition, information retrieval, and word sense disambiguation (Budanitsky and Hirst, 2006, Liu et al., 2012, Sanchez et al., 2012). Proper assessment of concept similarity improves the understanding of textual resources and increases the accuracy of knowledge based applications. The notion of similarity is to identify concepts having common “characteristics”. Semantic similarity is understood as the degree of taxonomic proximity between concepts (or terms, words). In other words, semantic similarity states how taxonomically near two concepts (or terms, words) are, because they share some aspects of their meaning. Technically, similarity measures assess a numerical score that quantifies this proximity as a function of the semantic evidence observed in one or several knowledge sources (Sanchez & Batet, 2013). In fact, assessing semantic similarity between concepts (or terms, words) has been and continues to be widely studied, and is a central and common issue in many research areas such as Psychology, Linguistics, Cognitive Science, Biomedicine, and Artificial Intelligence (Liu et al., 2012, Pirro, 2009). Roughly speaking, semantic similarity measurement relates to computing the similarity between concepts (or words, terms, short text expressions), having the same meaning or related information, but which are not lexicographically similar (Li et al., 2003, Martinez-Gil and Aldana-Montes, 2013). Up to the present, research on semantic similarity has been very active and many results have been achieved (Martinez-Gil, 2014).

Making judgments about the semantic similarity of different concepts is a routine yet deceptively complex task. To perform it, people draw on an immense amount of background knowledge about the concepts. Any attempt to compute semantic similarity automatically must also consult external sources of knowledge. Most of the work dealing with semantic similarity measures has been developed using taxonomies and more general ontologies, which provide a formal and machine-readable way to express a shared conceptualization by means of a unified terminology and semantic inter-relations from which semantic similarity can be assessed (Batet et al., 2013, Budanitsky and Hirst, 2006, Couto et al., 2007, Cross et al., 2013, Liu et al., 2012, Rodriguez and Egenhofer, 2003, Sanchez and Batet, 2013, Sanchez et al., 2010, Sanchez et al., 2012). According to the theoretical principles and the way in which ontologies are analyzed to estimate similarity, different families of methods can also be identified (Sanchez & Batet, 2013). These families are (Martinez-Gil, 2014, Petrakis et al., 2006): (1) edge counting measures: which consists of taking into account the length of the path linking the concepts (or terms) and the position of the concepts (or terms) in a given dictionary (or taxonomy, ontology) (Leacock and Chodorow, 1998, Li et al., 2003, Rada et al., 1989); (2) information content measures: which consists of measuring the difference of the information content of the two concepts (or terms) as a function of their probability of occurrence in a text corpus (or an ontology) (Buggenhout and Ceusters, 2005, Lin, 1998, Resnik, 1995, Resnik, 1999, Sanchez and Batet, 2013, Sanchez et al., 2010); (3) feature based measures: which consists of measuring the similarity between concepts (or terms) as a function of their properties or based on their relationships to other similar concepts (or terms) (Banerjee and Pedersen, 2003, Petrakis et al., 2006, Rodriguez and Egenhofer, 2003, Sanchez et al., 2012); (4) hybrid measures: which consists of combining all of the above (Batet et al., 2013, Pirro, 2009, Schickel-Zuber and Faltings, 2007).

In a nutshell, edge counting measures base the similarity assessment on the number of taxonomical links of the minimum path separating two concepts contained in a given ontology (Li et al., 2003, Rada et al., 1989). The main advantage of edge counting measures is their simplicity. They only rely on the graph model of an input ontology whose evaluation requires a low computational cost. Due to their simplicity, these approaches offer a limited accuracy due to ontologies model a large amount of taxonomical knowledge that is not considered during the evaluation of the minimum path (Batet et al., 2011, Sanchez and Batet, 2013). In another perspective, the main assumption of edge counting measures is that an edge represents the same semantic distance anywhere in the structure of the graph (or path), which is not true as sections of the graph may be finely classified and others only coarsely defined (Mathur & Dinakarpandian, 2012).

To acknowledge some of the limitations of edge counting measures, Resnik (1995) proposes to complement the taxonomical structure of an ontology with the information distribution of concepts evaluated in input corpora (Sanchez et al., 2012). Information Content (IC) based approaches assess the similarity between concepts as a function of the information content that both concepts have in common in a given ontology. In the past, IC was typically computed from concept distribution in tagged textual corpora (Jiang and Conrath, 1997, Lin, 1998, Resnik, 1995). However, this introduces a dependency on corpora availability and manual tagging that hampered their accuracy and applicability due to data sparseness (Sanchez et al., 2010). To overcome this problem, in recent years several researchers have proposed ways to infer IC of concepts in an intrinsic manner from the knowledge structure modeled in an ontology (Sanchez and Batet, 2011, Sanchez and Batet, 2013, Sanchez et al., 2011). However, the fact that intrinsic IC-based measures only rely on ontological knowledge is also a drawback because they completely depend on the degree of coverage and detail of the unique input ontology (Sanchez & Batet, 2013).

To overcome the limitations of edge counting measures regarding the fact that taxonomical links in an ontology do not necessary represent uniform distances, feature based measures are addressed by considering the degree of overlapping between sets of ontological features (Sanchez et al., 2012). As a result, they are more general and, potentially, they could be applied in cross ontology similarity estimation settings (i.e., when concept pairs belong to two different ontologies), a situation in which edge counting methods cannot be directly applied (Petrakis et al., 2006, Sanchez et al., 2012). So, on the contrary to edge counting measures which are based on the notion of minimum path distance, feature based approaches estimate similarity between concepts according to the weighted sum of the amount of common and non-common features. By features, authors usually consider taxonomic and non-taxonomic information modeled in an ontology, in addition to concept descriptions (e.g., glosses) retrieved from dictionaries (Petrakis et al., 2006, Rodriguez and Egenhofer, 2003, Tversky, 1977). Due to the additional semantic evidences considered during the assessment, they potentially improve edge counting approaches (Sanchez & Batet, 2013).

There are still some limitations in the above-mentioned ontology based similarity measures (Batet et al., 2011, Batet et al., 2013, Liu et al., 2012, Sanchez and Batet, 2011, Sanchez and Batet, 2013, Sanchez et al., 2011, Sanchez et al., 2012). In fact, similarity estimation is based on the extraction of semantic evidence from one or several knowledge sources. The more available the background knowledge is and the better its structure is, the more accurate the estimation will potentially be. Usually, these knowledge sources can be well-defined semantic networks such as WordNet (Ahsaee et al., 2014, Fellbaum, 1998, Liu et al., 2012) or more domain-dependant ontologies such as Gene Ontology (Couto et al., 2007, Mathur and Dinakarpandian, 2012) and biomedical ontologies MeSH or SNOMED CT (Batet et al., 2013, Pedersen et al., 2007, Sanchez and Batet, 2011). On the one hand, the prerequisite of ontology based similarity measures is the existence of a (or several) predefined domain ontology (or ontologies). Such ontologies are established by a panel experts in the given domains. Clearly, the construction process of domain ontologies is time-consuming and error-prone and maintaining these ontologies also requires a lot of effort from experts. Thus, the methods of ontology based similarity measures are limited in scope and scalability. On the other hand, while WordNet (or domain ontology) represents a well structured taxonomy organized in a meaningful way, questions arise about the need for a larger coverage. Especially, with the emergence of social networks or instant messaging systems (Martinez-Gil and Aldana-Montes, 2013, Retzer et al., 2012), a lot of (sets of) concepts or terms (proper nouns, brands, acronyms, new words, conversational words, technical terms and so on) are not included in WordNet and domain ontologies (in fact Web users can publish whatever they want to share with the rest of the world by using Wikis, Blogs and online communities at present), therefore, similarity measures that are based on these kinds of knowledge resources (i.e., WordNet or domain ontologies) cannot be used in these tasks. These limitations are the motivation behind the new techniques presented in this paper which infer semantic similarity from a kind of new source of information, i.e., a wide coverage online encyclopedia, namely Wikipedia (Hovy et al., 2013, Medelyan et al., 2009). Several researchers like Ponzetto and Strube, 2007, Gabrilovich and Markovitch, 2007, Gabrilovich and Markovitch, 2009, Zesch et al., 2008, Taieb et al., 2012, Yazdani and Popescu-Belis, 2013 have worked on the problem of semantic relatedness measures for concepts (or terms, words) by making use of Wikipedia. It is important to note that semantic similarity and semantic relatedness are two separate notions. Semantic relatedness is a more general notion of the relatedness of concepts, while similarity is a special case of relatedness that is tied to the likeness of the concepts (see Budanitsky & Hirst, 2006 for a discussion of semantic similarity and semantic relatedness). For example, antonyms (e.g., “increase” vs. “decrease”) are related, but not similar. Or, following Resnik (1995), “car” and “bicycle” are more similar than “car” and “gasoline”, though the latter pair may seem more related in the world (Yazdani & Popescu-Belis, 2013). In this paper hyperlinks between Wikipedia articles (or Wikipedia concepts) are not considered, so in the following we will use the term of semantic similarity.

The purpose of this paper is to present several new feature based similarity measures to solve the shortcomings of existing approaches for semantic similarity. The present paper focuses on the study of semantic similarity between concepts (or words, terms) drawn from Wikipedia (Wikipedia concepts, i.e., the titles of Wikipedia articles). In other words, we approach the problem of feature based semantic similarity between Wikipedia concepts from a novel perspective by making use of Wikipedia. Thus, the terminologies of Wikipedia concepts, concepts, and words can be used interchangeably. The author prefers to use concepts but some readers may interpret them to be terms or words instead. Since Wikipedia is a rich encyclopedia (or corpus, thesaurus, network structure) that covers almost all imaginable sources, thus, the method to similarity measures based on Wikipedia presented in this paper can process lots of terms (i.e., proper nouns, brands, acronyms, new words, conversational words, technical terms and so on) that Web users publish by social networks such as Wikis and be used in dynamic domains.

The remainder of this paper is organized as follows. In the next section, we briefly review some background on feature based similarity and Wikipedia. Section 3 presents some feature based approaches to similarity assessment using Wikipedia in detail. This includes the formal representation of Wikipedia concepts, a framework for feature based similarity, and several feature based measures resulting from instantiations of the framework. Section 4 is devoted to presenting evaluation of our approaches. Finally, in Section 5, we draw our conclusion and present some perspectives for future research.

Section snippets

Background

For completeness of presentation and convenience of subsequent discussions, in the current section we will briefly recall some basic notions of feature based similarity and Wikipedia. See especially (Medelyan et al., 2009, Petrakis et al., 2006, Ponzetto and Strube, 2007, Rodriguez and Egenhofer, 2003, Sanchez et al., 2012) for further details.

Feature-based similarity using Wikipedia

In this section we propose some feature based approaches to semantic similarity measures of concepts using Wikipedia. In order to illustrate these feature based approaches, we first present the notion of formal representation of Wikipedia concepts. We then give a framework for feature based similarity based on the formal representation of Wikipedia concepts. Lastly, we investigate several feature based approaches to semantic similarity measures resulting from instantiations of the framework.

Evaluation

To validate the effectiveness of semantic similarity measure methods proposed here and to evaluate them, in this section we will use real-world datasets to compute semantic similarity for Wikipedia concepts. Concretely, we use Wikipedia as the data resource of our experiments. For environment of our evaluation, the version of Wikipedia used to obtain our measures is released on 4 June 2013, and we use JWPL3 (Java Wikipedia Library), Java with JavaTM 2 SDK and

Conclusion

The final goal of computerized similarity measures is to accurately mimic human judgements about semantic similarity. At present similarity measures have been used for many different areas such as natural language processing, ontology mapping, and Web searching. In this paper, some limitations of the existing feature based measures are identified, such as the facts of relying on a (or multiple) predefined domain ontology (or ontologies) and fitting static domains (i.e., non-dynamic domains). To

Acknowledgements

The authors would like to thank the anonymous referees for their valuable comments as well as helpful suggestions from Editor-in-Chief (Professor Fabio Crestani) and Associate Editor (Professor Mark Sanderson) which greatly improved the exposition of the paper. The works described in this paper are supported by The National Natural Science Foundation of China under Grant Nos. 61272066 and 61272067; The Program for New Century Excellent Talents in University in China under Grant No.NCET-12-0644;

References (60)

  • H. Liu et al.

    Concept vector for semantic similarity and relatedness based on WordNet structure

    Journal of Systems and Software

    (2012)
  • S. Mathur et al.

    Finding disease similarity based on implicit semantic similarity

    Journal of Biomedical Informatics

    (2012)
  • O. Medelyan et al.

    Mining meaning from Wikipedia

    International Journal of Human–Computer Studies

    (2009)
  • J. Nothman et al.

    Learning multilingual named entity recognition from Wikipedia

    Artificial Intelligence

    (2013)
  • J. Oliva et al.

    SyMSS: A syntax-based measure for short-text semantic similarity

    Data & Knowledge Engineering

    (2011)
  • T. Pedersen et al.

    Measures of semantic similarity and relatedness in the biomedical domain

    Journal of Biomedical Informatics

    (2007)
  • G. Pirro

    A semantic similarity metric combining features and intrinsic information content

    Data & Knowledge Engineering

    (2009)
  • D. Sanchez et al.

    Semantic similarity estimation in the biomedical domain: An ontology-based information-theoretic perspective

    Journal of Biomedical Informatics

    (2011)
  • D. Sanchez et al.

    A semantic similarity method based on information content exploiting multiple ontologies

    Expert Systems with Applications

    (2013)
  • D. Sanchez et al.

    Ontology-based information content computation

    Knowledge-Based Systems

    (2011)
  • D. Sanchez et al.

    Ontology-based semantic similarity: A new feature-based approach

    Expert Systems with Applications

    (2012)
  • P. Sorg et al.

    Exploiting Wikipedia for cross-lingual and multilingual information retrieval

    Data & Knowledge Engineering

    (2012)
  • M. Yazdani et al.

    Computing text semantic relatedness using the contents and links of a hypertext encyclopedia

    Artificial Intelligence

    (2013)
  • M.G. Ahsaee et al.

    Semantic similarity assessment of words using weighted WordNet

    International Journal of Machine Learning and Cybernetics

    (2014)
  • S. Banerjee et al.

    Extended gloss overlaps as a measure of semantic relatedness

  • M. Batet et al.

    Semantic similarity estimation from multiple ontologies

    Applied Intelligence

    (2013)
  • A. Budanitsky et al.

    Evaluating WordNet-based measures of lexical semantic relatedness

    Computational Linguistics

    (2006)
  • C. Fellbaum

    WordNet: An electronic lexical database

    (1998)
  • L. Finkelstein et al.

    Placing search in context: the concept revisited

    ACM Transactions on Information Systems

    (2002)
  • A. Formica et al.

    Concept similarity in SymOntos: An enterprise ontology management tool

    The Computer Journal

    (2002)
  • Cited by (84)

    • Subgraph-based feature fusion models for semantic similarity computation in heterogeneous knowledge graphs

      2022, Knowledge-Based Systems
      Citation Excerpt :

      Two terms are lexically similar if their synsets or description sets, or the synsets of terms in their neighborhood (e.g., more specific and more general terms), are similar. Jiang et al. [10] investigate some feature-based approaches to semantic similarity assessment of concepts using Wikipedia and present the framework for feature-based similarity using all synonym sets, gloss sets, anchor sets, and category sets of Wikipedia concepts. Hybrid methods [4,11] combine two or more of the methods mentioned above to form a comprehensive solution that overcomes the limitations of a single individual similarity method and gets better performance.

    • Research on knowledge graph alignment model based on deep learning

      2021, Expert Systems with Applications
      Citation Excerpt :

      Knowledge graphs effectively reveal the inner structure of complex domain knowledge among information processing, machine learning, data mining and information visualization. In recent years, knowledge graphs have become a popular research topic in the field of information science (Dong et al., 2006; Dong et al., 2006; Fader & Zettlemoyer, 2014; Jiang, Zhang, Tang, & Nie, 2015; Lee, Min, Oh, & Chung, 2014; Zhao, Wu, & Liu, 2016). From the perspective of domain knowledge alignment, related studies include deep representation of knowledge graph and knowledge graph alignment.

    • Zipfian regularities in “non-point” word representations

      2021, Information Processing and Management
    View all citing articles on Scopus
    View full text