Skip to main content
Log in

Interpretable semantic textual similarity of sentences using alignment of chunks with classification and regression

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

The proposed work is focused on establishing an interpretable Semantic Textual Similarity (iSTS) method for a pair of sentences, which can clarify why two sentences are completely or partially similar or have some variations. This proposed interpretable approach is a pipeline of five modules that begins with the pre-processing and chunking of text. Further chunks of two sentences are aligned using a one–to–multi (1:M) chunk aligner. Thereafter, support vector, Gaussian Naive Bayes and k–Nearest Neighbours classifiers are then used to create a multiclass classification algorithm, and different class labels are used to define an alignment type. At last, a multivariate regression algorithm is developed to find the semantic equivalence of an alignment with a score (that ranges from 0 to 5). The efficiency of the proposed method is verified on three different datasets and also compared to other state–of–the–art interpretable STS (iSTS) methods. The evaluated results show that the proposed method performs better than other iSTS methods. Most importantly, the modules of the proposed iSTS method are used to develop a Textual Entailment (TE) method. It is found that, when we combined chunk level, alignment, and sentence level features the entailment results significantly improves.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

Notes

  1. http://ixa2.si.ehu.es/stswiki/index.php/Main_Page

  2. https://alt.qcri.org/semeval2015/task2/index.php?id=data-and-tools

  3. http://alt.qcri.org/semeval2016/task2/index.php?id=data-and-tools

  4. https://www.sketchengine.eu/penn-treebank-tagset/

  5. https://github.com/FerreroJeremy/monolingual-word-aligner

  6. http://paraphrase.org/#/download

  7. https://code.google.com/archive/p/word2vec/

  8. http://cocodataset.org/#download

  9. https://wordnet.princeton.edu/

  10. https://spacy.io/models

  11. https://www.csie.ntu.edu.tw/~cjlin/libsvm/

  12. http://alt.qcri.org/semeval2016/task2/index.php?id=data-and-tools

  13. https://doi.org/https://fivethirtyeight.com/features/not-even-scientists-can-easily-explain-p-values/

  14. https://tac.nist.gov/data/RTE/index.html

References

  1. Abney SP (1992) Parsing by chunks. In: Berwick RC, Abney SP, Tenny C (eds) Principle–Based parsing: computation and psycholinguistics, vol 44. Springer, Dordrecht, pp 257–278

  2. Agirre E, Banea C, Cardie C, Cer D, Diab M, Gonzalez-Agirre A, Guo W, Mihalcea R, Rigau G, Wiebe J (2014) Semeval-2014 task 10: Multilingual semantic textual similarity. In: Proceedings of the 8th international workshop on semantic evaluation, SemEval’14. Association for Computational Linguistics, Dublin, Ireland, pp 81–91

  3. Agirre E, Cer D, Diab M, Gonzalez-Agirre A, Guo W (2013) *sem 2013 shared task: Semantic textual similarity. In: Second joint conference on lexical and computational semantics (*SEM), Volume 1: Proceedings of the Main Conference and the Shared Task, SemEval’ 13. Association for Computational Linguistics, Georgia, pp 32–43

  4. Agirre E, Diab M, Cer D, Gonzalez-Agirre A (2012) Semeval-2012 task 6: A pilot on semantic textual similarity. In: Proceedings of the first joint conference on lexical and computational semantics - Volume 1: proceedings of the main conference and the shared task, and volume 2: Proceedings of the sixth international workshop on semantic evaluation, SemEval ’12. Association for Computational Linguistics, Stroudsburg, pp 385–393. http://dl.acm.org/citation.cfm?id=2387636.2387697

  5. Agirre E, Gonzalez-Agirre A, Lopez-Gazpio I, Maritxalar M, Rigau G, Uria L (2016) Semeval-2016 task 2: Interpretable semantic textual similarity. In: Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016). Association for Computational Linguistics, San Diego, pp 512–524. http://www.aclweb.org/anthology/S16-1082

  6. Agirrea E, Baneab C, Cardiec C, Cerd D, Diabe M, Gonzalez-Agirrea A, Guof W, Lopez-Gazpioa I, Maritxalara M, Mihalceab R, Rigaua G, Uriaa L, Wiebe J (2015) Semeval-2015 task 2: Semantic textual similarity, english, spanish and pilot on interpretability. In: Proceedings of the 9th International Workshop on Semantic Evaluation, SemEval’15. Association for Computational Linguistics, Denver, Colorado, pp 252–263. https://doi.org/10.18653/v1/s15-2045

  7. Aleven V, Popescu O, Koedinger KR (2001) Pedagogical content knowledge in a tutorial dialogue system to support self-explanation. Working Notes of the AIED 2001 Workhop Tutorial Dialogue Systems

  8. Baker RS, Corbett AT, Koedinger KR (2004) Detecting student misuse of intelligent tutoring systems. In: Lester JC, Vicari RM, Paraguaċu F (eds) Intelligent tutoring systems. Springer, Berlin, pp 531–540

  9. Banjade R, Maharjan N, Niraula NB, Rus V (2016) Dtsim at semeval-2016 task 2: Interpreting similarity of texts based on automated chunking, chunk alignment and semantic relation prediction. In: Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016). Association for Computational Linguistics, pp 809–813. https://doi.org/10.18653/v1/S16-1125. http://www.aclweb.org/anthology/S16-1125

  10. Banjade R, Niraula NB, Maharjan N, Rus V, Stefanescu D, Lintean M, Gautam D (2015) Nerosim: A system for measuring and interpreting semantic textual similarity. In: Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015). ACL, Denver, pp 164–171. http://www.aclweb.org/anthology/S15-2030

  11. Brockett C (2007) Aligning the rte 2006 corpus. Microsoft Research Technical Report, MSR–TR–2007–77

  12. Chambers N, Cer D, Grenager T, Hall D, Kiddon C, MacCartney B, de Marneffe MC, Ramage D, Yeh E, Manning C (2007) Learning alignments and leveraging natural logic. In: Proceedings of the ACL–PASCAL workshop on textual entailment and paraphrasing. Association for Computational Linguistics, Prague, pp 165–170. https://www.aclweb.org/anthology/W07-1427

  13. Chang CC, Lin CJ (2011) Libsvm: A library for support vector machines. ACM Trans. Intell. Syst. Technol. 2(3):1–27. https://doi.org/10.1145/1961189.1961199

    Article  Google Scholar 

  14. Clive B, van der Goot E, Blackler K, Garcia T, Horby D (2005) Europe media monitor–system description. EUR Report 22173-En

  15. Coelho TAS, Calado PP, Souza LV, Ribeiro-Neto B, Muntz R (2004) Image retrieval using multiple evidence ranking. In: IEEE transactions on knowledge and data engineering, vol 16. IEEE Educational Activities Department, Piscataway, pp 408–417. https://doi.org/10.1109/TKDE.2004.1269666

  16. Dagan I, Glickman O, Magnini B (2006) The PASCAL recognising textual entailment challenge. In: Machine learning challenges. Evaluating predictive uncertainty, visual object classification, and recognising tectual entailment. Springer, Berlin, pp 177–190. https://doi.org/10.1007/11736790_9

  17. Dzikovska MO, Bental D, Moore JD, Steinhauser NB, Campbell GE, Farrow E, Callaway CB (2010) Intelligent tutoring with natural language support in the beetle – ii system. In: Sustaining TEL: from innovation to learning and practice. Springer, Berlin, pp 620–625. https://doi.org/10.1007/978-3-642-16020-2_64

  18. Fellbaum C, Miller G (1998) WordNet: an electronic lexical database. Combining Local Context and Wordnet Similarity for Word Sense Identification. MIT Press, Cambridge, pp 265–283

    Book  Google Scholar 

  19. Fyshe A, Wehbe L, Talukdar P, Murphy B, Mitchell T (2015) A compositional and interpretable semantic space. In: Proceedings of the 2015 conference of the North American chapter of the association for computational linguistics: human language technologies. Association for Computational Linguistics, pp 32–41. http://www.aclweb.org/anthology/N15-1004

  20. Guthrie D, Allison B, Liu W, Guthrie L, Wilks Y (2006) A closer look at skip–gram modelling. In: LREC’06 . European Language Resources Association (ELRA), pp 1222– 1225

  21. Hand DJ, Yu K (2001) Idiot’s bayes: Not So Stupid after All?. International Statistical Review / Revue Internationale De Statistique 69(3):385–98. https://doi.org/10.2307/1403452. Accessed January 12, 2020

    Article  MATH  Google Scholar 

  22. Harris ZS (1954) Distributional Structure, WORD, 10:2-3 pp 146–162. https://doi.org/10.1080/00437956.1954.11659520

  23. Hänig C, Remus R, de la Puente X (2015) ExB themis: Extensive feature extraction from word alignments for semantic textual similarity. In: Proceedings of the 9th international workshop on semantic evaluation (SemEval 2015). Association for Computational Linguistics, DOI https://doi.org/10.18653/v1/s15-2046, (to appear in print)

  24. Islam A, Inkpen D (2008) Semantic text similarity using corpus-based word similarity and string similarity. ACM Trans Knowl Discov Data (TKDD) 2(2):10

    Google Scholar 

  25. Jackendoff R (1983) Semantics and cognition (Current studies in linguistics series). MIT Press, Cambridge

    Google Scholar 

  26. Jaynes ET (1957) Information theory and statistical mechanics. Phys Rev 106(4):620–630. https://doi.org/10.1103/PhysRev.106.620

    Article  MathSciNet  MATH  Google Scholar 

  27. Jordan PW, Makatchev M, Pappuswamy U, VanLehn K, Albacete PL (2006) A natural language tutorial dialogue system for physics. In: Proceedings of the nineteenth international florida artificial intelligence research society conference, pp 521–526. http://www.aaai.org/Library/FLAIRS/2006/flairs06-102.php

  28. Karumuri S, Vuggumudi VKR, Chitirala SCR (2015) Umduluth–blueteam : Svcsts – a multilingual and chunk level semantic similarity system (semeval 2015). In: Proceedings of the 9th International workshop on semantic evaluation. Association for Computational Linguistics , pp 107–110. https://doi.org/10.18653/v1/S15-2019. http://www.aclweb.org/anthology/S15-2019

  29. Kazmi M, Schüller P (2016) Inspire at SemEval-2016 task 2: Interpretable semantic textual similarity alignment based on answer set programming. In: Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016). Association for Computational Linguistics, San Diego, pp 1109–1115, DOI https://doi.org/10.18653/v1/s16-1171, (to appear in print)

  30. Koeling R (2000) Chunking with maximum entropy models. In: Proceedings of the 2nd workshop on Learning language in logic and the 4th conference on Computational natural language learning -, CoNLL’2000. Association for Computational Linguistics, pp 139–141. https://doi.org/10.3115/1117601.1117634

  31. Konopik M, Prazak O, Steinberger D, Brychcín T (2016) UWB at SemEval-2016 task 2: Interpretable semantic textual similarity with distributional semantics for chunks. In: Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016). Association for Computational Linguistics, DOI https://doi.org/10.18653/v1/s16-1124, (to appear in print)

  32. Lesk M (1986) Automatic sense disambiguation using machine readable dictionaries: How to tell a pine cone from an ice cream cone. In: Proceedings of the 5th annual international conference on systems documentation, SIGDOC ’86. ACM, New York, pp 24–26. https://doi.org/10.1145/318723.318728

  33. Li Y, McLean D, Bandar Z, O’Shea J, Crockett K (2006) Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. Knowl. Data Eng. 18(8):1138–1150. https://doi.org/10.1109/tkde.2006.130

    Article  Google Scholar 

  34. Lopez-Gazpio I, Maritxalar M, Gonzalez-Agirre A, Rigau G, Uria L, Agirre E (2017) Interpretable semantic textual similarity: Finding and explaining differences between sentences. Knowl-Based Syst 119:186–199. https://doi.org/10.1016/j.knosys.2016.12.013

    Article  Google Scholar 

  35. Magnolini S, Feltracco A, Magnini B (2016) Fbk-hlt-nlp at semeval-2016 task 2 : A multitask, deep learning approach for interpretable semantic textual similarity. In: Proceedings of the 10th international workshop on semantic evaluation, SemEval ’16. ACM, San Diego, pp 783–789

  36. Majumder G, Pakray P, Avendaño DEP (2018) Interpretable semantic textual similarity using lexical and cosine similarity. In: Social transformation – digital way. Springer, Singapore, pp 717–732. https://doi.org/10.1007/978-981-13-1343-1_59

  37. Majumder G, Pakray P, Pinto D (2019) Measuring interpretable semantic similarity of sentences using a multi chunk aligner. J Intell Fuzzy Syst 36(5):4797–4808. https://doi.org/10.3233/JIFS-179028

    Article  Google Scholar 

  38. Mihalcea R, Corley C, Strapparava C (2006) Corpus–based and knowledge–based measures of text semantic similarity. In: Proceedings of the 21st national conference on artificial intelligence - Volume 1. AAAI Press, Boston, pp 775–780

  39. Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Proceedings of the 26th international conference on neural information processing systems - Volume 2, NIPS’13. Curran Associates Inc., USA, pp 3111–3119. http://dl.acm.org/citation.cfm?id=2999792.2999959

  40. Nielsen RD, Ward W, Martin JH (2009) Recognizing entailment in intelligent tutoring systems*. In: Natural language engineering, vol 15. Cambridge University Press, New York, pp 479–501. https://doi.org/10.1017/S135132490999012X

  41. Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Müller A, Nothman J, Louppe G, Prettenhofer P, Weiss R, Dubourg V, Vanderplas J, Passos A, Cournapeau D, Brucher M, Perrot M, Duchesnay É (2011) Scikit–learn: Machine learning in python. J Mach Learn Res, 2825– 2830

  42. Pennington J, Socher R, Manning C (2014) Glove: Global vectors for word representation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). Association for Computational Linguistics, Doha, pp 1532–1543. https://doi.org/10.3115/v1/D14-1162. https://www.aclweb.org/anthology/D14-1162

  43. Rashtchian C, Young P, Hodosh M, Hockenmaier J (2010) Collecting image annotations using amazon’s mechanical turk. In: Proceedings of the NAACL HLT 2010 workshop on creating speech and language data with Amazon’s mechanical turk, CSLDAMT ’10. Association for Computational Linguistics, Stroudsburg, pp 139–147. http://dl.acm.org/citation.cfm?id=1866696.1866717

  44. Ritter A, Mausam, Etzioni O (2010) A latent dirichlet allocation method for selectional preferences. In: Proceedings of the 48th annual meeting of the association for computational linguistics, ACL’10. Association for Computational Linguistics, Stroudsburg, pp 424–434

  45. Salton G, Singhal A, Mitra M, Buckley C (1997) Automatic text structuring and summarization. Inform Process Manag 33(2):193–207

    Article  Google Scholar 

  46. Sang EFTK, Buchholz S (2000) Introduction to the conll-2000 shared task: chunking. In: Proceedings of the 2nd workshop on learning language in logic and the 4th conference on computational natural language learning-Volume 7, ConLL ’00. Association for Computational Linguistics, pp 127–132. https://doi.org/10.3115/1117601.1117631

  47. Schütze H (1998) Automatic word sense discrimination. Comput Linguist 24(1):97–123

    MathSciNet  Google Scholar 

  48. Self J (1990) Theoretical foundations for intelligent tutoring systems. J Artif Intell Ed 1(4):3–14. http://dl.acm.org/citation.cfm?id=95885.95888

    Google Scholar 

  49. Sultan MA, Bethard S, Sumner T (2014) Back to basics for monolingual alignment: Exploiting word similarity and contextual evidence. Trans Assoc Comput Linguist 2:219–230. http://dl.acm.org/citation.cfm?id=1614025.1614037

    Article  Google Scholar 

  50. Tekumalla L, Jat S (2016) IISCNLP at SemEval-2016 task 2: Interpretable STS with ILP based multiple chunk aligner. In: Proceedings of the 10th international workshop on semantic evaluation (SemEval-2016). Association for Computational Linguistics, San Diego, pp 790–795, https://doi.org/10.18653/v1/s16-1122

  51. Thadani K, McKeown K (2011) Optimal and syntactically–informed decoding for monolingual phrase-based alignment. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies: short papers-volume 2. Association for Computational Linguistics, Portland, pp 254–259. https://www.aclweb.org/anthology/P11-2044

  52. Šarić F, Glavaš G, Karan M, Šnajder J, Bašić BD (2012) Takelab: systems for measuring semantic text similarity. In: Proceedings of the first joint conference on lexical and computational semantics - volume 1: Proceedings of the main conference and the shared task, and volume 2: Proceedings of the sixth international workshop on semantic evaluation, SemEval ’12. Association for Computational Linguistics, Stroudsburg, pp 441–448

  53. Wegrzyn-Wolska K, Szczepaniak PS (2005) Classification of RSS-formatted documents using full text similarity measures. In: Lecture notes in computer science. Springer, Berlin, pp 400–405. https://doi.org/10.1007/11531371_52

  54. Wu Z, Palmer M (1994) Verb semantics and lexical selection. In: Proceedings of the 32nd annual meeting on association for computational linguistics, ACL’94, pp 133–138

  55. Yao X, Durme BV, Callison-Burch C, Clark P (2013) A lightweight and high performance monolingual word aligner. In: Proceedings of the 51st annual meeting of the association for computational linguistics (volume 2: Short Papers). Association for Computational Linguistics, Sofia, pp 702–707. https://www.aclweb.org/anthology/P13-2123

Download references

Acknowledgments

The work presented here falls under the Research Project Grant No. IFC/4130/DST-NRS/2018-19/IT25 (DST-CNRS targeted program). The First author is also thankful to the Google Colab product for providing the supports for experiments and also wants to acknowledge the Introduction to Machine Learning course of SWYAM NPTEL program of Govt. of India.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Goutam Majumder.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Majumder, G., Pakray, P., Das, R. et al. Interpretable semantic textual similarity of sentences using alignment of chunks with classification and regression. Appl Intell 51, 7322–7349 (2021). https://doi.org/10.1007/s10489-020-02144-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-020-02144-x

Keywords

Navigation