Spectral Text Similarity Measures

vor der Brück, Tim; Pouly, Marc

doi:10.1007/978-3-031-24340-0_7

Tim vor der Brück⁸ &
Marc Pouly⁸

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13452))

Included in the following conference series:

International Conference on Computational Linguistics and Intelligent Text Processing

359 Accesses

Abstract

Estimating semantic similarity between texts is of vital importance in many areas of natural language processing like information retrieval, question answering, text reuse, or plagiarism detection.

Prevalent semantic similarity estimates based on word embeddings are noise sensitive. Thus, small individual term similarities can have in aggregate a considerable influence on the total estimation value. In contrast, the methods proposed here exploit the spectrum of the product of embedding matrices, which leads to increased robustness when compared with conventional methods.

We apply these estimate on two tasks, which are the assignment of people to the best matching marketing target group and finding the correct match between sentences belonging to two independent translations of the same novel. The evaluation revealed that our proposed method based on the spectral norm could increase the accuracy compared to several baseline methods in both scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
This corpus can be obtained under the URL https://www.researchgate.net/publication/332072718_alignmentPurloinedLettertar.
2.
http://www.20min.ch.
3.
https://github.com/faraday/wikiprep-esa.

References

Attig, A., Perner, P.: The problem of normalization and a normalized similarity measure by online data. Trans. Case-Based Reason. 4(1), 3–17 (2011)
Google Scholar
Belanche, L., Orozco, J.: Things to know about a (dis)similarity measure. In: König, A., Dengel, A., Hinkelmann, K., Kise, K., Howlett, R.J., Jain, L.C. (eds.) KES 2011. LNCS (LNAI), vol. 6881, pp. 100–109. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23851-2_11
Chapter Google Scholar
Brokos, G.I., Malakasiotis, P., Androutsopoulos, I.: Using centroids of word embeddings and word mover’s distance for biomedical document retrieval in question answering. In: Proceedings of the 15th Workshop on Biomedical Natural Language Processing, Berlin, Germany, pp. 114–118 (2016)
Google Scholar
Budanitsky, A., Hirst, G.: Evaluating wordnet-based measures of semantic relatedness. Comput. Linguist. 32(1), 13–47 (2006)
Article Google Scholar
Chatelin, F.: Eigenvalues of Matrices - Revised Edition. Society for Industrial and Applied Mathematics, Philadelphia, Pennsylvania (1993)
Google Scholar
Gabrilovic, E., Markovitch, S.: Wikipedia-based semantic interpretation for natural language processing. J. Artif. Intell. Res. 34, 443–498 (2009)
Article Google Scholar
Gottron, T., Anderka, M., Stein, B.: Insights into explicit semantic analysis. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, Glasgow, UK, pp. 1961–1964 (2011)
Google Scholar
Gupta, V.: Improving word embeddings using kernel principal component analysis. Master’s thesis, Bonn-Aachen International Center for Information Technology (B-IT) (2018)
Google Scholar
Hong, K.J., Lee, G.H., Kom, H.J.: Enhanced document clustering using Wikipedia-based document representation. In: Proceedings of the 2015 International Conference on Applied System Innovation (ICASI), Osaka, Japan (2015)
Google Scholar
Kiros, R., et al.: Skip-thought vectors. In: Proceedings of the Conference on Neural Information Processing Systems (NIPS), Montréal, Canada (2015)
Google Scholar
Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25, 259–284 (1998)
Article Google Scholar
Lynn, M.: Segmenting and targeting your market: strategies and limitations. Technical report, Cornell University (2011). http://scholorship.sha.cornell.edu/articles/243
Mercer, J.: Functions of positive and negative type and their connection with the theory of integral equations. Phil. Trans. R. Soc. A 209, 441–458 (1909)
Google Scholar
Mijangos, V., Sierra, G., Montes, A.: Sentence level matrix representation for document spectral clustering. Pattern Recognit. Lett. 85, 29–34 (2017)
Article Google Scholar
Mikolov, T., Sutskever, I., Ilya, C., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the Conference on Neural Information Processing Systems (NIPS), Lake Tahoe, Nevada, pp. 3111–3119 (2013)
Google Scholar
Murphy, K.P.: Machine Learning - A Probabilistic Perspective. MIT Press, Cambridge (2012)
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Katar (2014)
Google Scholar
Song, Y., Roth, D.: Unsupervised sparse vector densification for short text similarity. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Denver, Colorado (2015)
Google Scholar
Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)
Google Scholar

Download references

Acknowledgement

Hereby we thank the Jaywalker GmbH as well as the Jaywalker Digital AG for their support regarding this publication and especially for annotating the contest data with the best-fitting youth milieus.

Author information

Authors and Affiliations

School of Computer Science and Information Technology, Lucerne University of Applied Sciences and Arts, Lucerne, Switzerland
Tim vor der Brück & Marc Pouly

Authors

Tim vor der Brück
View author publications
You can also search for this author in PubMed Google Scholar
Marc Pouly
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Tim vor der Brück or Marc Pouly .

Editor information

Editors and Affiliations

Instituto Politécnico Nacional, Mexico City, Mexico
Alexander Gelbukh

A Example Contest Answer

The following snippet is an example user answer for the travel contest (contest 1):

1.
Jordanien: Ritt durch die Wüste und Petra im Morgengrauen bestaunen bevor die Touristenbusse kommen
2.
Cook Island: Schnorcheln mit Walhaien und die Seele baumeln lassen
3.
USA: Eine abgespaceste Woche am Burning Man Festival erleben

English translation:

1.
Jordan: Ride through the desert and marveling Petra during sunrise before the arrival of tourist buses
2.
Cook Island: Snorkeling with whale sharks and relaxing
3.
USA: Experience an awesome week at the Burning Man Festival

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

vor der Brück, T., Pouly, M. (2023). Spectral Text Similarity Measures. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2019. Lecture Notes in Computer Science, vol 13452. Springer, Cham. https://doi.org/10.1007/978-3-031-24340-0_7

Download citation

DOI: https://doi.org/10.1007/978-3-031-24340-0_7
Published: 26 February 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-24339-4
Online ISBN: 978-3-031-24340-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Spectral Text Similarity Measures

Abstract

Access this chapter

Notes

References

Acknowledgement

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

A Example Contest Answer

A Example Contest Answer

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation