Feature-based approaches to semantic similarity assessment of concepts using Wikipedia

doi:10.1016/j.ipm.2015.01.001

Information Processing & Management

Volume 51, Issue 3, May 2015, Pages 215-234

https://doi.org/10.1016/j.ipm.2015.01.001 Get rights and content

Highlights

•
A formal representation of Wikipedia concepts is presented.
•
A framework for feature based similarity is proposed.
•
Some novel feature based approaches to semantic similarity measures are presented.
•
Results show that several proposed methods have good human correlation.

Abstract

Semantic similarity assessment between concepts is an important task in many language related applications. In the past, several approaches to assess similarity by evaluating the knowledge modeled in an (or multiple) ontology (or ontologies) have been proposed. However, there are some limitations such as the facts of relying on predefined ontologies and fitting non-dynamic domains in the existing measures. Wikipedia provides a very large domain-independent encyclopedic repository and semantic network for computing semantic similarity of concepts with more coverage than usual ontologies. In this paper, we propose some novel feature based similarity assessment methods that are fully dependent on Wikipedia and can avoid most of the limitations and drawbacks introduced above. To implement similarity assessment based on feature by making use of Wikipedia, firstly a formal representation of Wikipedia concepts is presented. We then give a framework for feature based similarity based on the formal representation of Wikipedia concepts. Lastly, we investigate several feature based approaches to semantic similarity measures resulting from instantiations of the framework. The evaluation, based on several widely used benchmarks and a benchmark developed in ourselves, sustains the intuitions with respect to human judgements. Overall, several methods proposed in this paper have good human correlation and constitute some effective ways of determining similarity between Wikipedia concepts.

Introduction

Semantic similarity between concepts is becoming a common problem for many applications of Computational Linguistics and Artificial Intelligence such as natural language processing, knowledge acquisition, information retrieval, and word sense disambiguation (Budanitsky and Hirst, 2006, Liu et al., 2012, Sanchez et al., 2012). Proper assessment of concept similarity improves the understanding of textual resources and increases the accuracy of knowledge based applications. The notion of similarity is to identify concepts having common “characteristics”. Semantic similarity is understood as the degree of taxonomic proximity between concepts (or terms, words). In other words, semantic similarity states how taxonomically near two concepts (or terms, words) are, because they share some aspects of their meaning. Technically, similarity measures assess a numerical score that quantifies this proximity as a function of the semantic evidence observed in one or several knowledge sources (Sanchez & Batet, 2013). In fact, assessing semantic similarity between concepts (or terms, words) has been and continues to be widely studied, and is a central and common issue in many research areas such as Psychology, Linguistics, Cognitive Science, Biomedicine, and Artificial Intelligence (Liu et al., 2012, Pirro, 2009). Roughly speaking, semantic similarity measurement relates to computing the similarity between concepts (or words, terms, short text expressions), having the same meaning or related information, but which are not lexicographically similar (Li et al., 2003, Martinez-Gil and Aldana-Montes, 2013). Up to the present, research on semantic similarity has been very active and many results have been achieved (Martinez-Gil, 2014).

Making judgments about the semantic similarity of different concepts is a routine yet deceptively complex task. To perform it, people draw on an immense amount of background knowledge about the concepts. Any attempt to compute semantic similarity automatically must also consult external sources of knowledge. Most of the work dealing with semantic similarity measures has been developed using taxonomies and more general ontologies, which provide a formal and machine-readable way to express a shared conceptualization by means of a unified terminology and semantic inter-relations from which semantic similarity can be assessed (Batet et al., 2013, Budanitsky and Hirst, 2006, Couto et al., 2007, Cross et al., 2013, Liu et al., 2012, Rodriguez and Egenhofer, 2003, Sanchez and Batet, 2013, Sanchez et al., 2010, Sanchez et al., 2012). According to the theoretical principles and the way in which ontologies are analyzed to estimate similarity, different families of methods can also be identified (Sanchez & Batet, 2013). These families are (Martinez-Gil, 2014, Petrakis et al., 2006): (1) edge counting measures: which consists of taking into account the length of the path linking the concepts (or terms) and the position of the concepts (or terms) in a given dictionary (or taxonomy, ontology) (Leacock and Chodorow, 1998, Li et al., 2003, Rada et al., 1989); (2) information content measures: which consists of measuring the difference of the information content of the two concepts (or terms) as a function of their probability of occurrence in a text corpus (or an ontology) (Buggenhout and Ceusters, 2005, Lin, 1998, Resnik, 1995, Resnik, 1999, Sanchez and Batet, 2013, Sanchez et al., 2010); (3) feature based measures: which consists of measuring the similarity between concepts (or terms) as a function of their properties or based on their relationships to other similar concepts (or terms) (Banerjee and Pedersen, 2003, Petrakis et al., 2006, Rodriguez and Egenhofer, 2003, Sanchez et al., 2012); (4) hybrid measures: which consists of combining all of the above (Batet et al., 2013, Pirro, 2009, Schickel-Zuber and Faltings, 2007).

In a nutshell, edge counting measures base the similarity assessment on the number of taxonomical links of the minimum path separating two concepts contained in a given ontology (Li et al., 2003, Rada et al., 1989). The main advantage of edge counting measures is their simplicity. They only rely on the graph model of an input ontology whose evaluation requires a low computational cost. Due to their simplicity, these approaches offer a limited accuracy due to ontologies model a large amount of taxonomical knowledge that is not considered during the evaluation of the minimum path (Batet et al., 2011, Sanchez and Batet, 2013). In another perspective, the main assumption of edge counting measures is that an edge represents the same semantic distance anywhere in the structure of the graph (or path), which is not true as sections of the graph may be finely classified and others only coarsely defined (Mathur & Dinakarpandian, 2012).

To acknowledge some of the limitations of edge counting measures, Resnik (1995) proposes to complement the taxonomical structure of an ontology with the information distribution of concepts evaluated in input corpora (Sanchez et al., 2012). Information Content (IC) based approaches assess the similarity between concepts as a function of the information content that both concepts have in common in a given ontology. In the past, IC was typically computed from concept distribution in tagged textual corpora (Jiang and Conrath, 1997, Lin, 1998, Resnik, 1995). However, this introduces a dependency on corpora availability and manual tagging that hampered their accuracy and applicability due to data sparseness (Sanchez et al., 2010). To overcome this problem, in recent years several researchers have proposed ways to infer IC of concepts in an intrinsic manner from the knowledge structure modeled in an ontology (Sanchez and Batet, 2011, Sanchez and Batet, 2013, Sanchez et al., 2011). However, the fact that intrinsic IC-based measures only rely on ontological knowledge is also a drawback because they completely depend on the degree of coverage and detail of the unique input ontology (Sanchez & Batet, 2013).

To overcome the limitations of edge counting measures regarding the fact that taxonomical links in an ontology do not necessary represent uniform distances, feature based measures are addressed by considering the degree of overlapping between sets of ontological features (Sanchez et al., 2012). As a result, they are more general and, potentially, they could be applied in cross ontology similarity estimation settings (i.e., when concept pairs belong to two different ontologies), a situation in which edge counting methods cannot be directly applied (Petrakis et al., 2006, Sanchez et al., 2012). So, on the contrary to edge counting measures which are based on the notion of minimum path distance, feature based approaches estimate similarity between concepts according to the weighted sum of the amount of common and non-common features. By features, authors usually consider taxonomic and non-taxonomic information modeled in an ontology, in addition to concept descriptions (e.g., glosses) retrieved from dictionaries (Petrakis et al., 2006, Rodriguez and Egenhofer, 2003, Tversky, 1977). Due to the additional semantic evidences considered during the assessment, they potentially improve edge counting approaches (Sanchez & Batet, 2013).

There are still some limitations in the above-mentioned ontology based similarity measures (Batet et al., 2011, Batet et al., 2013, Liu et al., 2012, Sanchez and Batet, 2011, Sanchez and Batet, 2013, Sanchez et al., 2011, Sanchez et al., 2012). In fact, similarity estimation is based on the extraction of semantic evidence from one or several knowledge sources. The more available the background knowledge is and the better its structure is, the more accurate the estimation will potentially be. Usually, these knowledge sources can be well-defined semantic networks such as WordNet (Ahsaee et al., 2014, Fellbaum, 1998, Liu et al., 2012) or more domain-dependant ontologies such as Gene Ontology (Couto et al., 2007, Mathur and Dinakarpandian, 2012) and biomedical ontologies MeSH or SNOMED CT (Batet et al., 2013, Pedersen et al., 2007, Sanchez and Batet, 2011). On the one hand, the prerequisite of ontology based similarity measures is the existence of a (or several) predefined domain ontology (or ontologies). Such ontologies are established by a panel experts in the given domains. Clearly, the construction process of domain ontologies is time-consuming and error-prone and maintaining these ontologies also requires a lot of effort from experts. Thus, the methods of ontology based similarity measures are limited in scope and scalability. On the other hand, while WordNet (or domain ontology) represents a well structured taxonomy organized in a meaningful way, questions arise about the need for a larger coverage. Especially, with the emergence of social networks or instant messaging systems (Martinez-Gil and Aldana-Montes, 2013, Retzer et al., 2012), a lot of (sets of) concepts or terms (proper nouns, brands, acronyms, new words, conversational words, technical terms and so on) are not included in WordNet and domain ontologies (in fact Web users can publish whatever they want to share with the rest of the world by using Wikis, Blogs and online communities at present), therefore, similarity measures that are based on these kinds of knowledge resources (i.e., WordNet or domain ontologies) cannot be used in these tasks. These limitations are the motivation behind the new techniques presented in this paper which infer semantic similarity from a kind of new source of information, i.e., a wide coverage online encyclopedia, namely Wikipedia (Hovy et al., 2013, Medelyan et al., 2009). Several researchers like Ponzetto and Strube, 2007, Gabrilovich and Markovitch, 2007, Gabrilovich and Markovitch, 2009, Zesch et al., 2008, Taieb et al., 2012, Yazdani and Popescu-Belis, 2013 have worked on the problem of semantic relatedness measures for concepts (or terms, words) by making use of Wikipedia. It is important to note that semantic similarity and semantic relatedness are two separate notions. Semantic relatedness is a more general notion of the relatedness of concepts, while similarity is a special case of relatedness that is tied to the likeness of the concepts (see Budanitsky & Hirst, 2006 for a discussion of semantic similarity and semantic relatedness). For example, antonyms (e.g., “increase” vs. “decrease”) are related, but not similar. Or, following Resnik (1995), “car” and “bicycle” are more similar than “car” and “gasoline”, though the latter pair may seem more related in the world (Yazdani & Popescu-Belis, 2013). In this paper hyperlinks between Wikipedia articles (or Wikipedia concepts) are not considered, so in the following we will use the term of semantic similarity.

The purpose of this paper is to present several new feature based similarity measures to solve the shortcomings of existing approaches for semantic similarity. The present paper focuses on the study of semantic similarity between concepts (or words, terms) drawn from Wikipedia (Wikipedia concepts, i.e., the titles of Wikipedia articles). In other words, we approach the problem of feature based semantic similarity between Wikipedia concepts from a novel perspective by making use of Wikipedia. Thus, the terminologies of Wikipedia concepts, concepts, and words can be used interchangeably. The author prefers to use concepts but some readers may interpret them to be terms or words instead. Since Wikipedia is a rich encyclopedia (or corpus, thesaurus, network structure) that covers almost all imaginable sources, thus, the method to similarity measures based on Wikipedia presented in this paper can process lots of terms (i.e., proper nouns, brands, acronyms, new words, conversational words, technical terms and so on) that Web users publish by social networks such as Wikis and be used in dynamic domains.

The remainder of this paper is organized as follows. In the next section, we briefly review some background on feature based similarity and Wikipedia. Section 3 presents some feature based approaches to similarity assessment using Wikipedia in detail. This includes the formal representation of Wikipedia concepts, a framework for feature based similarity, and several feature based measures resulting from instantiations of the framework. Section 4 is devoted to presenting evaluation of our approaches. Finally, in Section 5, we draw our conclusion and present some perspectives for future research.

Section snippets

Background

For completeness of presentation and convenience of subsequent discussions, in the current section we will briefly recall some basic notions of feature based similarity and Wikipedia. See especially (Medelyan et al., 2009, Petrakis et al., 2006, Ponzetto and Strube, 2007, Rodriguez and Egenhofer, 2003, Sanchez et al., 2012) for further details.

Feature-based similarity using Wikipedia

In this section we propose some feature based approaches to semantic similarity measures of concepts using Wikipedia. In order to illustrate these feature based approaches, we first present the notion of formal representation of Wikipedia concepts. We then give a framework for feature based similarity based on the formal representation of Wikipedia concepts. Lastly, we investigate several feature based approaches to semantic similarity measures resulting from instantiations of the framework.

Evaluation

To validate the effectiveness of semantic similarity measure methods proposed here and to evaluate them, in this section we will use real-world datasets to compute semantic similarity for Wikipedia concepts. Concretely, we use Wikipedia as the data resource of our experiments. For environment of our evaluation, the version of Wikipedia used to obtain our measures is released on 4 June 2013, and we use JWPL³ (Java Wikipedia Library), Java with JavaTM 2 SDK and

Conclusion

The final goal of computerized similarity measures is to accurately mimic human judgements about semantic similarity. At present similarity measures have been used for many different areas such as natural language processing, ontology mapping, and Web searching. In this paper, some limitations of the existing feature based measures are identified, such as the facts of relying on a (or multiple) predefined domain ontology (or ontologies) and fitting static domains (i.e., non-dynamic domains). To

Acknowledgements

The authors would like to thank the anonymous referees for their valuable comments as well as helpful suggestions from Editor-in-Chief (Professor Fabio Crestani) and Associate Editor (Professor Mark Sanderson) which greatly improved the exposition of the paper. The works described in this paper are supported by The National Natural Science Foundation of China under Grant Nos. 61272066 and 61272067; The Program for New Century Excellent Talents in University in China under Grant No.NCET-12-0644;

References (60)

M. Batet et al.
An ontology-based measure to compute semantic similarity in biomedicine
Journal of Biomedical Informatics
(2011)
C.V. Buggenhout et al.
A novel view on information content of concepts in a large ontology and a view on the structure and the quality of the ontology
International Journal of Medical Informatics
(2005)
R.C. Chen et al.
Merging domain ontologies based on the WordNet system and fuzzy formal concept analysis techniques
Applied Soft Computing
(2011)
F.M. Couto et al.
Measuring semantic similarity between gene ontology terms
Data & Knowledge Engineering
(2007)
V. Cross et al.
Unifying ontological similarity measures: A theoretical and empirical investigation
International Journal of Approximate Reasoning
(2013)
A. Formica
Ontology-based concept similarity in formal concept analysis
Information Sciences
(2006)
B. Furlan et al.
Semantic similarity of short texts in languages with a deficient natural language processing support
Decision Support Systems
(2013)
E. Hovy et al.
Collaboratively built semi-structured content and artificial intelligence: The story so far
Artificial Intelligence
(2013)
A. Ittoo et al.
Minimally-supervised extraction of domain-specific part – Whole relations using Wikipedia as knowledge-base
Data & Knowledge Engineering
(2013)
R. Kaptein et al.
Exploiting the category structure of Wikipedia for entity ranking
Artificial Intelligence
(2013)

H. Liu et al.

Concept vector for semantic similarity and relatedness based on WordNet structure

Journal of Systems and Software

(2012)

S. Mathur et al.

Finding disease similarity based on implicit semantic similarity

Journal of Biomedical Informatics

(2012)

O. Medelyan et al.

Mining meaning from Wikipedia

International Journal of Human–Computer Studies

(2009)

J. Nothman et al.

Learning multilingual named entity recognition from Wikipedia

Artificial Intelligence

(2013)

J. Oliva et al.

SyMSS: A syntax-based measure for short-text semantic similarity

Data & Knowledge Engineering

(2011)

T. Pedersen et al.

Measures of semantic similarity and relatedness in the biomedical domain

Journal of Biomedical Informatics

(2007)

G. Pirro

A semantic similarity metric combining features and intrinsic information content

Data & Knowledge Engineering

(2009)

D. Sanchez et al.

Semantic similarity estimation in the biomedical domain: An ontology-based information-theoretic perspective

Journal of Biomedical Informatics

(2011)

D. Sanchez et al.

A semantic similarity method based on information content exploiting multiple ontologies

Expert Systems with Applications

(2013)

D. Sanchez et al.

Ontology-based information content computation

Knowledge-Based Systems

(2011)

D. Sanchez et al.

Ontology-based semantic similarity: A new feature-based approach

Expert Systems with Applications

(2012)

P. Sorg et al.

Exploiting Wikipedia for cross-lingual and multilingual information retrieval

Data & Knowledge Engineering

(2012)

M. Yazdani et al.

Computing text semantic relatedness using the contents and links of a hypertext encyclopedia

Artificial Intelligence

(2013)

M.G. Ahsaee et al.

Semantic similarity assessment of words using weighted WordNet

International Journal of Machine Learning and Cybernetics

(2014)

S. Banerjee et al.

Extended gloss overlaps as a measure of semantic relatedness

M. Batet et al.

Semantic similarity estimation from multiple ontologies

Applied Intelligence

(2013)

A. Budanitsky et al.

Evaluating WordNet-based measures of lexical semantic relatedness

Computational Linguistics

(2006)

C. Fellbaum

WordNet: An electronic lexical database

(1998)

L. Finkelstein et al.

Placing search in context: the concept revisited

ACM Transactions on Information Systems

(2002)

A. Formica et al.

Concept similarity in SymOntos: An enterprise ontology management tool

The Computer Journal

(2002)

Cited by (84)

Wikipedia bi-linear link (WBLM) model: A new approach for measuring semantic similarity and relatedness between linguistic concepts using Wikipedia link structure
2023, Information Processing and Management
Wikipedia links its articles by manually defined semantic relations called the Wikipedia hyperlink (link) structure. The existing Wikipedia link-based semantic similarity (SS) and semantic relatedness (SR) computation models, such as Wikipedia one-way link (WOLM) model and Wikipedia two-way link (WTLM) model, do not assess the strengths of the relationships between a candidate concept and its links (out-links or in-links). These models treat all the links as equally important even though some links are semantically more influential than others and should be given more importance. This phenomenon reduces the accuracy of these models. This paper presents the Wikipedia bi-linear link (WBLM) model that extends the previously proposed WOLM and WTLM models. The WBLM model explores the Wikipedia link structure as a semantic graph and discovers the strongly (bi-linear links) and weakly (out-links or in-links) connected links of a candidate concept. It improves the link-based vector representations of concepts by assigning weights to their connected links according to the strengths of their semantic associations. The experimental results demonstrate that the proposed WBLM model significantly improves the SS and SR computation accuracy of the WOLM model (6.9%, 8%, 24%, 17.3%, 31.2%, 30.6%, 26.5%, and 35.4%) and WTLM model (1.2%, 3.9%, 7.1%, 9.9%, 11%, 6.3%, 12.7%, and 13%), in terms of linear correlations with human judgments on gold standard benchmarks, including MC30, RG65, WS203, SimLex, 353All, MTurk287, MTurk771, and MEN3000, respectively. Moreover, this research offers a deep insight into the Wikipedia link structure and provides an adequate base for understanding it as a semantic graph.
Subgraph-based feature fusion models for semantic similarity computation in heterogeneous knowledge graphs
2022, Knowledge-Based Systems
Citation Excerpt :
Two terms are lexically similar if their synsets or description sets, or the synsets of terms in their neighborhood (e.g., more specific and more general terms), are similar. Jiang et al. [10] investigate some feature-based approaches to semantic similarity assessment of concepts using Wikipedia and present the framework for feature-based similarity using all synonym sets, gloss sets, anchor sets, and category sets of Wikipedia concepts. Hybrid methods [4,11] combine two or more of the methods mentioned above to form a comprehensive solution that overcomes the limitations of a single individual similarity method and gets better performance.
Semantic similarity is a fundamental task in natural language processing that determines the similarity between two concepts within a taxonomy. For example, a pair of words (e.g., car and bike) appear similar because they share the same category (e.g., vehicle). Numerous computation methods, such as distance-based and feature-based approaches, are proposed to precisely depict this similarity. As knowledge graphs become heterogeneous (e.g., DBpedia), existing methods have limitations on utilizing multi-view features (e.g., abstract, structure, and categories). On the one hand, some features are incomplete for various reasons, reducing the effectiveness of embedding methods. On the other hand, the hidden connections among multi-view features are omitted by existing approaches. To address the problems mentioned above, we first extract three subgraphs from a heterogeneous knowledge graph and then combine various embedding approaches to capture the global semantics of each concept. Next, we offer subgraph-based feature fusion models that improve concept representation by fusing multi-view features. Finally, we devise mixed computation methods to calculate the semantic similarity between the two concepts. Experiment results show that multi-view features, particularly the abstract feature, can effectively improve the performance of the proposed methods. Compared to existing approaches, our methods significantly improve the Pearson correlation coefficient by about 7%. The source code of this paper is available at: https://github.com/fiego/SubgraphSS.
Research on knowledge graph alignment model based on deep learning
2021, Expert Systems with Applications
Citation Excerpt :
Knowledge graphs effectively reveal the inner structure of complex domain knowledge among information processing, machine learning, data mining and information visualization. In recent years, knowledge graphs have become a popular research topic in the field of information science (Dong et al., 2006; Dong et al., 2006; Fader & Zettlemoyer, 2014; Jiang, Zhang, Tang, & Nie, 2015; Lee, Min, Oh, & Chung, 2014; Zhao, Wu, & Liu, 2016). From the perspective of domain knowledge alignment, related studies include deep representation of knowledge graph and knowledge graph alignment.
The construction of large-scale knowledge graphs from heterogeneous sources is fundamental to knowledge-driven applications. To solve the problem of redundancy and inconsistency in the process of domain knowledge fusion, this paper reports studies of domain knowledge alignment from the perspective of a knowledge graph. A novel knowledge graph alignment (KGA) model is proposed, based on knowledge graph deep representation learning. To assess the validity of the model, comparative experiments are conducted on the datasets of heterogeneous, cross-lingual, and domain-specific knowledge graphs. Our results of experiments suggest significant improvement on all of these datasets. We discuss the implications for improving the alignment effect of knowledge graph entities, enhancing the coverage and correctness of knowledge graphs, and promoting the performance of knowledge graphs in knowledge-driven applications.
Zipfian regularities in “non-point” word representations
2021, Information Processing and Management
Being one of the most common empirical regularities, the Zipf’s law for word frequencies is a power law relation between word frequencies and frequency ranks of words. We quantitatively study semantic uncertainty of words through non-point distribution-based word embeddings and reveal the Zipfian regularities. Uncertainty of a word can increase due to polysemy, the word having “broad” meaning (such as the relation between broader emotion and narrower exasperation) or a combination of both. Variances of Gaussian embeddings are utilized to quantify the extent a word can be used in different senses or contexts. By using the variance information embedded in the non-point Gaussian embeddings, we quantitatively show that semantic breadth of words also exhibits Zipfian patterns, when polysemy is controlled. This outcome is complementary to Zipf’s law of meaning distribution and the related meaning-frequency law by indicating the existence of Zipfian patterns: more frequent words tend to be generic while less frequent ones tend to be specific. Results for two languages, English and Turkish that belong to different language families, are also provided. Such regularities provide valuable information to extract and understand relationships between semantic properties of words and word frequencies. In various applications, performance improvements can be obtained by employing these regularities. We also propose a method that leverages the Zipfian regularity to improve the performance of baseline textual entailment detection algorithms. To the best of our knowledge, our approach is the first quantitative study that uses Gaussian embeddings to examine the relationships between word frequencies and semantic breadth.
A semantic similarity computation method for virtual resources in cloud manufacturing environment based on information content
2021, Journal of Manufacturing Systems
To ensure the effective and reasonable classification of virtual resources and provide support for the search and matching of required resources that accomplish tasks in cloud manufacturing (CMfg) environments, the semantic similarity computation of virtual resources needs to be promoted for that it is difficult to ensure that a large number of effective features participate in the calculation of semantic similarity and it is difficult to practically apply to efficient classification. However, to promote semantic similarity calculations still have problems such as difficulty in unifying the virtual resource pool and lagging in updating semantic depth calculations. In this paper, we propose a creative approach that can compute the semantic similarity of virtual resources described by video semantic description (VSD) based on information content (IC). In this approach, first, a multi-level progression semantics description framework for virtual resources is proposed with the help of VSD that can provide a new solution because of its explicit description and dynamic update to obtain real and reliable data; second, an event cloud (EC) that serves as the computing scope for semantics similarity calculation of virtual resources is constructed to settle the problem of substantial data growth caused by real-time resource updates on top of the resource manufacturing processes; third, based on an effective solution of corpus dependency and data sparseness, an extended semantic computing model is developed to calculate quantificationally the semantic similarities of virtual resources based on semantic relationships. The results show that compared with classifying virtual resources qualitatively, the semantic similarity computing method that starts from the manufacturing result and considers the comprehensive relationship is more reliable and effective.
Exploiting non-taxonomic relations for measuring semantic similarity and relatedness in WordNet
2021, Knowledge-Based Systems
Various applications in computational linguistics and artificial intelligence employ semantic similarity to solve challenging tasks, such as word sense disambiguation, text classification, information retrieval, machine translation, and document clustering. To our knowledge, research to date rely solely on the taxonomic relation “ISA” to evaluate semantic similarity and relatedness between terms. This paper explores the benefits of using all types of non-taxonomic relations in large linked data, such as WordNet knowledge graph, to enhance existing semantic similarity and relatedness measures. We propose a holistic poly-relational approach based on a new relation-based information content and non-taxonomic-based weighted paths to devise a comprehensive semantic similarity and relatedness measure. To demonstrate the benefits of exploiting non-taxonomic relations in a knowledge graph, we used three strategies to deploy non-taxonomic relations at different granularity levels. We conduct experiments on four well-known gold standard datasets. The results of our proposed method demonstrate an improvement over the benchmark semantic similarity methods, including the state-of-the-art knowledge graph embedding techniques, that ranged from 3.8%–23.8%, 1.3%–18.3%, 31.8%–117.2%, and 19.1%–111.1%, on all gold standard datasets MC, RG, WordSim, and Mturk, respectively. These results demonstrate the robustness and scalability of the proposed semantic similarity and relatedness measure, significantly improving existing similarity measures.

View all citing articles on Scopus

View full text

Feature-based approaches to semantic similarity assessment of concepts using Wikipedia

Highlights

Abstract

Introduction

Section snippets

Background

Feature-based similarity using Wikipedia

Evaluation

Conclusion

Acknowledgements

Journal of Biomedical Informatics

International Journal of Medical Informatics

Applied Soft Computing

Data & Knowledge Engineering

International Journal of Approximate Reasoning

Information Sciences

Decision Support Systems

Artificial Intelligence

Data & Knowledge Engineering

Artificial Intelligence

Journal of Systems and Software

Journal of Biomedical Informatics

International Journal of Human–Computer Studies

Artificial Intelligence

Data & Knowledge Engineering

Journal of Biomedical Informatics

Data & Knowledge Engineering

Journal of Biomedical Informatics

Expert Systems with Applications

Knowledge-Based Systems

Expert Systems with Applications

Data & Knowledge Engineering

Artificial Intelligence

Semantic similarity assessment of words using weighted WordNet

International Journal of Machine Learning and Cybernetics

Extended gloss overlaps as a measure of semantic relatedness

Semantic similarity estimation from multiple ontologies

Applied Intelligence

Evaluating WordNet-based measures of lexical semantic relatedness

Computational Linguistics

WordNet: An electronic lexical database

Placing search in context: the concept revisited

ACM Transactions on Information Systems

Concept similarity in SymOntos: An enterprise ontology management tool

The Computer Journal