A novel family of IC-based similarity measures with a detailed experimental survey on WordNet

https://doi.org/10.1016/j.engappai.2015.09.006Get rights and content

Abstract

This paper introduces a novel family of ontology-based similarity measures based on the Information Content (IC) theory, a detailed state of the art, a large experimental survey into ontology-based similarity measures on WordNet, and a new comparison between intrinsic and corpus-based IC models. Our experiments are based on our implementation of a large set of similarity measures, intrinsic and corpus-based IC models, which are evaluated on two known datasets and three different WordNet versions. The new measures are called weighted Jiang–Conrath distance (wJ&Cdist) and similarity (wJ&Csim), cosine-normalized Jiang–Conrath similarity (cosJ&Csim) and cosine-normalized weighted Jiang–Conrath similarity (coswJ&Csim). Two of our similarity measures outperform the state-of-the-art measures on the RG65 dataset, and one of them obtains the third overall score on all the datasets and evaluated WordNet versions. The cosine-normalized similarity measures are a non-linear normalization of the classic Jiang–Conrath (J&C) distance and the new wJ&C distance. On the other hand, the wJ&C distance is a generalization of the classic J&C distance which is based on the length of the shortest path between concepts within an IC-based weighted graph. Our measures are based on two not previously considered notions: (1) a generalization of the classic J&C distance to any type of taxonomy, based on an IC-based weighted graph derived from the conditional probabilities between child and parent concepts, and (2) a non-linear normalization function that converts the ontology-based semantic distances into similarity functions. Finally, the corpus-based IC models based on the Resnik method obtain rivaling results as regards the state-of-the-art intrinsic IC models, when they are used with some unexplored WordNet-based frequency files. Therefore, this latter fact allows us to reconsider some previous conclusions about the outperformance of the intrinsic IC models over the corpus-based ones.

Section snippets

Introduction and positioning

The ontology-based similarity measures have found many applications in natural language processing (NLP), information retrieval (IR), and bioengineering. For example, in IR the aim is to retrieve resources that are semantically related to a user query both defined as concept sets. In this context, the word-to-word similarity measures can be extended to compute the distance between bags of concepts, or weighted concepts and individuals, thus, they are a key component in estimating the closeness

Ontology-based similarity measures and IC models

The literature on ontology-based semantic similarity measures and distances is very extensive, thus, we only focus on the measures that are evaluated in this work. First, we review the non IC-based similarity measures. Next, the rest of the section is devoted to review the family of IC-based measures and models, in which our work is framed. For a broader and recent survey on semantic similarity measures, we refer the reader to the recent book of Harispe et al. (2015). Further surveys can be

The new IC-based similarity measures

In this section, we introduce a new ontology-based distance, defined in Eq. (7), and three new IC-based similarity measures, defined in Eqs. (8), (11), (12) below. These measures are based on two different unexplored notions: (1) a generalization of the Jiang–Conrath distance, and (2) a non-linear normalization for the conversion of ontology-based semantic distances into similarity measures.

The normalization function is based on the computation of the maximum distance on the taxonomy, and a

Evaluation

The goals of the experimental work described in this section are as follows: (1) the experimental evaluation and comparison of the new IC-based similarity measures with most of similarity measures, intrinsic and corpus-based IC models reported in the literature, (2) the replication of previously reported methods and results, (3) an experimental study into the influence of the WordNet version on the similarity measures, (4) a comparison between intrinsic and corpus-based IC models, (5) a study

Discussion

Our new IC-based similarity measures called coswJ&C and cosJ&C obtain the highest correlation values in the classic RG65 dataset and all the WordNet versions. The first intrinsic Hadj Taieb et al. similarity measure obtains the higher correlation values in the P&Sfull dataset and all the WordNet versions. Finally, the Hadj Taieb et al. measure and Meng et al. (2012) IC-based similarity measure obtain the best overall scores when the correlation values are averaged over all the datasets and

Conclusions and future work

First, we have introduced one IC-based semantic distance and three new IC-based similarity measures based on a generalization and normalization of the classic Jiang–Conrath distance, which outperform the state-of-the-art methods in the RG65 dataset. Second, we introduce an up-to-date experimental survey, whose aim is the uniform comparison of the most recent and relevant similarity measures on WordNet, especially the families of IC-based similarity measures and intrinsic IC models. In addition,

Acknowledgements

Despite deciding to develop our own software library to implement all the IC-based models and measures evaluated in this work, we would like to express our gratitude to Sébastien Harispe, who even provided us the source code of the SML library, offering his total support. Jian-Bo Gao, David Sánchez, Montserrat Batet and Giuseppe Pirró kindly answered all our questions to clarify certain issues on their methods and experimental results to replicate them in our platform. Mohamed Hadj Taieb kindly

References (64)

  • W. Yan et al.

    An ontology-based approach for inventive problem solving

    Eng. Appl. Artif. Intell.

    (2014)
  • Alvarez, M.A., Lim, S., 2007. A graph modeling of semantic similarity between words. In: Proceedings of the First IEEE...
  • A. Budanitsky et al.

    Evaluating WordNet-based measures of lexical semantic relatedness

    Comput. Linguist.

    (2006)
  • Chen, M., Chowdhury, R.A., Ramachandran, V., Roche, D.L., Tong, L., 2007. Priority queues and Dijkstra׳s algorithm....
  • F.M. Couto et al.

    The next generation of similarity measures that fully explore the semantics in biomedical ontologies

    J. Bioinform. Comput. Biol.

    (2013)
  • Cross, V., Hu, X., 2011. Using semantic similarity in ontology alignment. In: Proceedings of the Sixth International...
  • Fiorini, N., Ranwez, S., Montmain, J., Ranwez, V., 2015. USI: a fast and accurate approach for conceptual document...
  • Fokkens, A., Van Erp, M., Postma, M., Pedersen, T., Vossen, P., Freire, N., 2013. Offspring from reproduction problems:...
  • Gabrilovich, E., Markovitch, S., 2007. Computing semantic relatedness using wikipedia-based explicit semantic analysis....
  • M. Gan et al.

    From ontology to semantic similaritycalculation of ontology-based semantic similarity

    Sci. World J.

    (2013)
  • V.N. Garla et al.

    Semantic similarity in the biomedical domainan evaluation across knowledge sources

    BMC Bioinform.

    (2012)
  • M.A. Hadj Taieb et al.

    A new semantic relatedness measurement using WordNet features

    Knowl. Inf. Syst.

    (2014)
  • Harispe, S., Ranwez, S., Janaqi, S., Montmain, J., 2014a. The semantic measures library: assessing semantic similarity...
  • Harispe, S., Ranwez, S., Janaqi, S., Montmain, J., 2015. Semantic similarity from natural language and ontology...
  • Harris, Z.S., 1981. Distributional structure. In: Hiż, H. (Ed.), Papers on Syntax. vol. 14 of Synthese Language...
  • Hirst, G., St-Onge, D., 1998. Lexical chains as representations of context for the detection and correction of...
  • Jiang, J.J., Conrath, D.W., 1997. Semantic similarity based on corpus statistics and lexical taxonomy. In: Proceedings...
  • Lastra-Díaz, J.J., 2014. Intrinsic semantic spaces for the representation of documents and semantic annotated data....
  • Lastra-Díaz, J.J., García-Serrano, A., 2014. System and method for the indexing and retrieval of semantically annotated...
  • Leacock, C., Chodorow, M., 1998. Combining local context and WordNet similarity for word sense identification. In:...
  • Li, P., Wang, H., Zhu, K.Q., Wang, Z., Hu, X.-G., Wu, X., 2015. A large probabilistic semantic network based approach...
  • Li, P., Wang, H., Zhu, K.Q., Wang, Z., Wu, X., 2013. Computing term similarity by large probabilistic is a knowledge....
  • Cited by (0)

    View full text