Skip to main content
Log in

Sentence similarity based on semantic kernels for intelligent text retrieval

  • Published:
Journal of Intelligent Information Systems Aims and scope Submit manuscript

Abstract

We propose a new approach to compute semantic similarity between sentences. It is based on the semantic kernel, composed of subject, verb, and object that, we suppose, summarize the general meaning of each sentence. Thanks to linguistics resources available such as Stanford Parser, many features are then extracted from the semantic kernels and aggregated by mean of weights. The weighting is produced by a supervised machine learning technique on a training data set provided by human experts as ground truth. The cross validation shows good performances. Thanks to this similarity measure between sentences, one can build an intelligent text retrieval engine more sensitive to the semantic content, specifically suited for short texts than the classical methods based on bag of words. An application is being developed for highlighting parts of speech in scientific articles.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

Notes

  1. http://lsa.colorado.edu/

References

  • Breaux, H.J. (1968). A modification of efroymson’s technique for stepwise regression analysis. Communications of the ACM, 11(8), 556–558.

    Article  Google Scholar 

  • Budanitsky, A., & Hirst, G. (2006). Evaluating wordnet-based measures of lexical semantic relatedness. Computer Linguistic, 32(1), 13–47.

    Article  MATH  Google Scholar 

  • Che, L.M., Wei, C.J., Cheng, H.T., Hui, C.H., & Chen, C.H. (2012). A sentence similarity metric based on semantic patterns. Advances in Information Sciences and Service Sciences, 4(1), 576–585.

    Google Scholar 

  • Croft, D., Coupland, S., Shell, J., & Brown, S. (2013). A fast and efficient semantic short text similarity metric.

  • De Boni, M., & Manandhar, S. (2003). The use of sentence similarity as a semantic relevance metric for question answering. In New directions in question answering, papers from 2003 AAAI spring symposium (pp. 138–144). Stanford: Stanford University.

  • de Marneffe, M.-C., & Manning, C.D. (2008). The stanford typed dependencies representation. In Coling 2008: Proceedings of the Workshop on Cross-Framework and Cross-Domain Parser Evaluation, CrossParser ’08 (pp. 1–8). Stroudsburg: Association for Computational Linguistics.

  • Hardin, J.W., & Hilbe, J. (2001). Generalized linear models and extensions. College station: Stata Press.

    MATH  Google Scholar 

  • Hatzlvassiloglou, V., Klavans, J.L., & Eskin, E. (1999). Detecting text similarity over short passages:Exploring linguistic feature combinations via machine learning. In 1999 Joint SIGDAT conference on empirical methods in natural language processing and very large corpora (pp. 203–212).

  • Heidinger, V. (1984). Analyzing Syntax and Semantics: Workbook: Gallaudet university press.

  • Hirst, G., & St-Onge, D. (1994). WORDNET: A Lexical database for English. In Human language technology, proceedings of a workshop held at plainsboro, New Jersey, USA, March 8-11.

  • Hirst, G., & St Onge, D. (1998). Lexical Chains as representation of context for the detection and correction malapropisms: The MIT Press.

  • Islam, A., & Inkpen, D. (2008). Semantic text similarity using corpus-based word similarity and string similarity. ACM Transactions Knowledge Discovery Data, 2(2), 10:1-10:25.

    Google Scholar 

  • Jurafsky, D., & Martin, J.H. (2000). Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition, 1st edn. Upper Saddle River: Prentice Hall PTR.

    Google Scholar 

  • Landauer, T.K., Foltz, P.W., & Laham, D. (1998). An introduction to latent semantic analysis. Discourse Processes, 25, 259–284.

    Article  Google Scholar 

  • Lehmann, J., Isele, R., Jakob, M., Jentzsch, A., Kontokostas, D., Mendes, P.N., Hellmann, S., Morsey, M., Kleef, P.V., Auer, S., & Bizer, C. (2015). Dbpedia- A large-scale, multilingual knowledge base extracted from wikipedia. Semantic Web, 6(2), 167–195.

    Google Scholar 

  • Li, Y., McLean, D., Bandar, Z.A., O’Shea, J.D., & Crockett, K. (2006). Sentence similarity based on semantic nets and corpus statistics. IEEE Transactions on Knowledge and Data Engineering, 18(8), 1138–1150.

    Article  Google Scholar 

  • Oliva, J., Serrano, J.I., Dolores del Castillo, M., & Iglesias, A. (2011). Symss: A syntax-based measure for short-text semantic similarity. Data Knowledge Engineering, 70(4), 390–405.

    Article  Google Scholar 

  • O’Shea, J., Bandar, Z., Crockett, K.A., & McLean, D. (2008). A comparative study of two short text semantic similarity measures. In Proceedings onAgent and multi-agent systems: Technologies and applications, second KES international symposium, KES-AMSTA 2008, incheon, korea, march 26-28, 2008 (pp. 172–181).

  • O’shea, J., Bandar, Z., & Crockett, K. (2014). A new benchmark dataset with production methodology for short text semantic similarity algorithms. ACM Transactions Speech Language Processing, 10(4), 19:1–19:63.

    Google Scholar 

  • Rakesh, P., Shivapratap, G., Divya, G., & Soman, K.P. (2009). Evaluation of svd and nmf methods for latent semantic analysis. International Journal of Recent Trends in Engineering, 1(3).

  • Salton, G., Wong, A., & Yang, C. (1975). A vector space model for automatic indexing. Communications of the ACM, 18(11), 613–620.

    Article  MATH  Google Scholar 

  • Salton, G., & McGill, M. (1984). Introduction to Modern Information Retrieval: McGraw-Hill Book Company.

  • Spaeth, A., & Desmarais, M.C. (2013). Combining collaborative filtering and text similarity for expert profile recommendations in social websites. In Proceedings on User modeling, adaptation, and personalization - 21th international conference, UMAP 2013, rome, Italy, June 10-14, 2013 (pp. 178–189).

  • Tsatsaronis, G., Varlamis, I., & Vazirgiannis, Michalis (2010). Text relatedness based on a word thesaurus. Journal of Artificial Intelligence Research, 37(1), 1–40.

    MATH  Google Scholar 

  • Winkler, W.E. (1999). The state of record linkage and current research problems. Technical report, Statistical Research Division, U.S. Census Bureau.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Samir Amir.

Appendix

Appendix

Table 4 The benchmark used for the first experiment (O’Shea et al. 2008)
Table 5 The benchmark used for the second experiment (O’shea et al. 2014)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Amir, S., Tanasescu, A. & Zighed, D.A. Sentence similarity based on semantic kernels for intelligent text retrieval. J Intell Inf Syst 48, 675–689 (2017). https://doi.org/10.1007/s10844-016-0434-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10844-016-0434-3

Keywords

Navigation