Sentence Similarity by Combining Explicit Semantic Analysis and Overlapping N-Grams

Vu, Hai Hieu; Villaneau, Jeanne; Saïd, Farida; Marteau, Pierre-François

doi:10.1007/978-3-319-10816-2_25

Hai Hieu Vu²¹,
Jeanne Villaneau²¹,
Farida Saïd²² &
…
Pierre-François Marteau²¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8655))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1562 Accesses
3 Citations

Abstract

We propose a similarity measure between sentences which combines a knowledge-based measure, that is a lighter version of ESA (Explicit Semantic Analysis), and a distributional measure, Rouge. We used this hybrid measure with two French domain-orientated corpora collected from the Web and we compared its similarity scores to those of human judges. In both domains, ESA and Rouge perform better when they are mixed than they do individually. Besides, using the whole Wikipedia base in ESA did not prove necessary since the best results were obtained with a low number of well selected concepts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Achananuparp, P., Hu, X., Shen, X.: The evaluation of sentence similarity measures. In: Song, I.-Y., Eder, J., Nguyen, T.M. (eds.) DaWaK 2008. LNCS, vol. 5182, pp. 305–316. Springer, Heidelberg (2008)
Chapter Google Scholar
Agirre, E., et al.: *Sem 2013 shared task: Semantic textual similarity. In: Second Joint Conference on Lexical and Computational Semantics (*SEM). Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, vol. 1, pp. 32–43. Association for Computational Linguistics, Atlanta (2013), http://www.aclweb.org/anthology/S13-1004
Google Scholar
Balasubramanian, N., Allan, J., Croft, W.B.: A comparison of sentence retrieval techniques. In: Kraaij, W., de Vries, A.P., Clarke, C.L.A., Fuhr, N., Kando, N. (eds.) SIGIR, pp. 813–814. ACM (2007)
Google Scholar
Barzilay, R., Elhadad, N.: Sentence alignment for monolingual comparable corpora. In: Proceedings of the 2003 Conference on Empirical Methods in Natural Language Processing, EMNLP 2003, pp. 25–32. Association for Computational Linguistics, Stroudsburg (2003), http://dx.doi.org/10.3115/1119355.1119359
Google Scholar
Buscaldi, D., Le Roux, J., Garcia Flores, J.J., Popescu, A.: Lipn-core: Semantic text similarity using n-grams, wordnet, syntactic analysis, esa and information retrieval based features. In: Second Joint Conference on Lexical and Computational Semantics (*SEM). Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, vol. 1, pp. 162–168. Association for Computational Linguistics, Atlanta (2013), http://www.aclweb.org/anthology/S13-1023
Google Scholar
Dan, A., Bhattacharyya, P.: Cfilt-core: Semantic textual similarity using universal networking language. In: Second Joint Conference on Lexical and Computational Semantics (*SEM). Proceedings of the Main Conference and the Shared Task: Semantic Textual Similarity, vol. 1, pp. 216–220. Association for Computational Linguistics, Atlanta (2013), http://www.aclweb.org/anthology/S13-1031
Google Scholar
Dasari, D.B., Rao, V.G.: A text categorization on semantic analysis. International Journal of Advanced Computational Engineering and Networking 1(9) (2013)
Google Scholar
Egozi, O., Markovitch, S., Gabrilovich, E.: Concept-based information retrieval using explicit semantic analysis. ACM Trans. Inf. Syst. 29(2), 8:1–8:34 (2011), http://doi.acm.org/10.1145/1961209.1961211
Erkan, G., Radev, D.R.: Lexrank: Graph-based lexical centrality as salience in text summarization. J. Artif. Intell. Res. (JAIR) 22, 457–479 (2004)
Google Scholar
Gabrilovich, E., Markovitch, S.: Computing semantic relatedness using wikipedia-based explicit semantic analysis. In: Proceedings of the 20th International Joint Conference on Artifical Intelligence, IJCAI 2007, pp. 1606–1611. Morgan Kaufmann Publishers Inc., San Francisco (2007), http://dl.acm.org/citation.cfm?id=1625275.1625535
Google Scholar
Gottron, T., Anderka, M., Stein, B.: Insights into explicit semantic analysis. In: CIKM 2011: Proceedings of 20th ACM Conference on Information and Knowledge Management (2011), http://dl.dropbox.com/u/20411070/Publications/2011-CIKM-Gottron-AS.pdf
Gupta, R., Ratinov, L.: Text categorization with knowledge transfer from heterogeneous data sources. In: Proceedings of the 23rd National Conference on Artificial Intelligence, AAAI 2008, vol. 2, pp. 842–847. AAAI Press (2008), http://dl.acm.org/citation.cfm?id=1620163.1620203
Ko, Y., Park, J., Seo, J.: Automatic text categorization using the importance of sentences. In: Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), pp. 65–79 (2002)
Google Scholar
Li, Y., McLean, D., Bandar, Z.A., O’Shea, J.D., Crockett, K.: Sentence similarity based on semantic nets and corpus statistics. IEEE Trans. on Knowl. and Data Eng. 18(8), 1138–1150 (2006), http://dx.doi.org/10.1109/TKDE.2006.130
Article Google Scholar
Lin, C.: Rouge: a package for automatic evaluation of summaries, pp. 25–26 (2004)
Google Scholar
Lin, C.Y., Hovy., E.: Automatic evaluation of summaries using n-gram co-occurrence statistics. In: Proceedings of 2003 Language Technology Conference (HLT-NAACL 2003), Edmonton, Canada (May-June 2003)
Google Scholar
Lin, D.: An information-theoretic definition of similarity. In: Proceedings of the 15th International Conference on Machine Learning, pp. 296–304. Morgan Kaufmann (1998)
Google Scholar
Müller, C., Gurevych, I.: A study on the semantic relatedness of query and document terms in information retrieval. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, vol. 3, pp. 1338–1347. Association for Computational Linguistics, Stroudsburg (2009), http://dl.acm.org/citation.cfm?id=1699648.1699680
Google Scholar
Nakayama, K., Hara, T., Nishio, S.: Wikipedia mining for an association web thesaurus construction. In: Proceedings of IEEE International Conference on Web Information Systems Engineering, pp. 322–334 (2007)
Google Scholar
Potthast, M., Barrón-Cedeño, A., Stein, B., Rosso, P.: Cross-language plagiarism detection. Lang. Resour. Eval. 45(1), 45–62 (2011), http://dx.doi.org/10.1007/s10579-009-9114-z
Article Google Scholar
Sorg, P., Cimiano, P.: Cross-lingual information retrieval with explicit semantic analysis. In: Working Notes for the CLEF 2008 Workshop (2008), http://www.aifb.kit.edu/images/7/7c/2008_1837_Sorg_Cross-lingual_I_1.pdf
Tsatsaronis, G., Panagiotopoulou, V.: A generalized vector space model for text retrieval based on semantic relatedness. In: Proceedings of the 12th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop, EACL 2009, pp. 70–78. Association for Computational Linguistics, Stroudsburg (2009), http://dl.acm.org/citation.cfm?id=1609179.1609188
Google Scholar
Wong, S.K.M., Ziarko, W., Wong, P.C.N.: Generalized vector spaces model in information retrieval. In: Proceedings of the 8th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 1985, pp. 18–25. ACM, New York (1985), http://doi.acm.org/10.1145/253495.253506
Google Scholar

Download references

Author information

Authors and Affiliations

IRISA, Université de Bretagne Sud (UBS), France
Hai Hieu Vu, Jeanne Villaneau & Pierre-François Marteau
LMBA, Université de Bretagne Sud, France
Farida Saïd

Authors

Hai Hieu Vu
View author publications
You can also search for this author in PubMed Google Scholar
Jeanne Villaneau
View author publications
You can also search for this author in PubMed Google Scholar
Farida Saïd
View author publications
You can also search for this author in PubMed Google Scholar
Pierre-François Marteau
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Informatics, Masaryk University, Botanicá 6a, 60200, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Department of Information Technologies, Masaryk University, 602 00, Brno, Czech Republic
Aleš Horák , Ivan Kopeček & Karel Pala , &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Vu, H.H., Villaneau, J., Saïd, F., Marteau, PF. (2014). Sentence Similarity by Combining Explicit Semantic Analysis and Overlapping N-Grams. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech and Dialogue. TSD 2014. Lecture Notes in Computer Science(), vol 8655. Springer, Cham. https://doi.org/10.1007/978-3-319-10816-2_25

Download citation

DOI: https://doi.org/10.1007/978-3-319-10816-2_25
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10815-5
Online ISBN: 978-3-319-10816-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics