A novel family of IC-based similarity measures with a detailed experimental survey on WordNet
Section snippets
Introduction and positioning
The ontology-based similarity measures have found many applications in natural language processing (NLP), information retrieval (IR), and bioengineering. For example, in IR the aim is to retrieve resources that are semantically related to a user query both defined as concept sets. In this context, the word-to-word similarity measures can be extended to compute the distance between bags of concepts, or weighted concepts and individuals, thus, they are a key component in estimating the closeness
Ontology-based similarity measures and IC models
The literature on ontology-based semantic similarity measures and distances is very extensive, thus, we only focus on the measures that are evaluated in this work. First, we review the non IC-based similarity measures. Next, the rest of the section is devoted to review the family of IC-based measures and models, in which our work is framed. For a broader and recent survey on semantic similarity measures, we refer the reader to the recent book of Harispe et al. (2015). Further surveys can be
The new IC-based similarity measures
In this section, we introduce a new ontology-based distance, defined in Eq. (7), and three new IC-based similarity measures, defined in Eqs. (8), (11), (12) below. These measures are based on two different unexplored notions: (1) a generalization of the Jiang–Conrath distance, and (2) a non-linear normalization for the conversion of ontology-based semantic distances into similarity measures.
The normalization function is based on the computation of the maximum distance on the taxonomy, and a
Evaluation
The goals of the experimental work described in this section are as follows: (1) the experimental evaluation and comparison of the new IC-based similarity measures with most of similarity measures, intrinsic and corpus-based IC models reported in the literature, (2) the replication of previously reported methods and results, (3) an experimental study into the influence of the WordNet version on the similarity measures, (4) a comparison between intrinsic and corpus-based IC models, (5) a study
Discussion
Our new IC-based similarity measures called coswJ&C and cosJ&C obtain the highest correlation values in the classic RG65 dataset and all the WordNet versions. The first intrinsic Hadj Taieb et al. similarity measure obtains the higher correlation values in the dataset and all the WordNet versions. Finally, the Hadj Taieb et al. measure and Meng et al. (2012) IC-based similarity measure obtain the best overall scores when the correlation values are averaged over all the datasets and
Conclusions and future work
First, we have introduced one IC-based semantic distance and three new IC-based similarity measures based on a generalization and normalization of the classic Jiang–Conrath distance, which outperform the state-of-the-art methods in the RG65 dataset. Second, we introduce an up-to-date experimental survey, whose aim is the uniform comparison of the most recent and relevant similarity measures on WordNet, especially the families of IC-based similarity measures and intrinsic IC models. In addition,
Acknowledgements
Despite deciding to develop our own software library to implement all the IC-based models and measures evaluated in this work, we would like to express our gratitude to Sébastien Harispe, who even provided us the source code of the SML library, offering his total support. Jian-Bo Gao, David Sánchez, Montserrat Batet and Giuseppe Pirró kindly answered all our questions to clarify certain issues on their methods and experimental results to replicate them in our platform. Mohamed Hadj Taieb kindly
References (64)
- et al.
An ontology-based measure to compute semantic similarity in biomedicine
J. Biomed. Inform.
(2011) - et al.
A SNOMED supported ontological vector model for subclinical disorder detection using EHR similarity
Eng. Appl. Artif. Intell.
(2011) - et al.
Unifying ontological similarity measuresa theoretical and empirical investigation
Int. J. Approx. Reason.
(2013) - et al.
A WordNet-based semantic similarity measurement combining edge-counting and information content theory
Eng. Appl. Artif. Intell.
(2015) - et al.
Ontology-based approach for measuring semantic similarity
Eng. Appl. Artif. Intell.
(2014) - et al.
A framework for unifying ontology-based semantic similarity measuresa study in the biomedical domain
J. Biomed. Inform.
(2014) A semantic similarity metric combining features and intrinsic information content
Data Knowl. Eng.
(2009)- et al.
Ontology-based information content computation
Knowl.-Based Syst.
(2011) - et al.
Ontology-based semantic similaritya new feature-based approach
Expert Syst. Appl.
(2012) - et al.
Semantic variancean intuitive measure for ontology accuracy evaluation
Eng. Appl. Artif. Intell.
(2015)