Abstract
Estimating semantic similarity between texts is of vital importance in many areas of natural language processing like information retrieval, question answering, text reuse, or plagiarism detection.
Prevalent semantic similarity estimates based on word embeddings are noise sensitive. Thus, small individual term similarities can have in aggregate a considerable influence on the total estimation value. In contrast, the methods proposed here exploit the spectrum of the product of embedding matrices, which leads to increased robustness when compared with conventional methods.
We apply these estimate on two tasks, which are the assignment of people to the best matching marketing target group and finding the correct match between sentences belonging to two independent translations of the same novel. The evaluation revealed that our proposed method based on the spectral norm could increase the accuracy compared to several baseline methods in both scenarios.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
This corpus can be obtained under the URL https://www.researchgate.net/publication/332072718_alignmentPurloinedLettertar.
- 2.
- 3.
References
Attig, A., Perner, P.: The problem of normalization and a normalized similarity measure by online data. Trans. Case-Based Reason. 4(1), 3–17 (2011)
Belanche, L., Orozco, J.: Things to know about a (dis)similarity measure. In: König, A., Dengel, A., Hinkelmann, K., Kise, K., Howlett, R.J., Jain, L.C. (eds.) KES 2011. LNCS (LNAI), vol. 6881, pp. 100–109. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23851-2_11
Brokos, G.I., Malakasiotis, P., Androutsopoulos, I.: Using centroids of word embeddings and word mover’s distance for biomedical document retrieval in question answering. In: Proceedings of the 15th Workshop on Biomedical Natural Language Processing, Berlin, Germany, pp. 114–118 (2016)
Budanitsky, A., Hirst, G.: Evaluating wordnet-based measures of semantic relatedness. Comput. Linguist. 32(1), 13–47 (2006)
Chatelin, F.: Eigenvalues of Matrices - Revised Edition. Society for Industrial and Applied Mathematics, Philadelphia, Pennsylvania (1993)
Gabrilovic, E., Markovitch, S.: Wikipedia-based semantic interpretation for natural language processing. J. Artif. Intell. Res. 34, 443–498 (2009)
Gottron, T., Anderka, M., Stein, B.: Insights into explicit semantic analysis. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, Glasgow, UK, pp. 1961–1964 (2011)
Gupta, V.: Improving word embeddings using kernel principal component analysis. Master’s thesis, Bonn-Aachen International Center for Information Technology (B-IT) (2018)
Hong, K.J., Lee, G.H., Kom, H.J.: Enhanced document clustering using Wikipedia-based document representation. In: Proceedings of the 2015 International Conference on Applied System Innovation (ICASI), Osaka, Japan (2015)
Kiros, R., et al.: Skip-thought vectors. In: Proceedings of the Conference on Neural Information Processing Systems (NIPS), Montréal, Canada (2015)
Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25, 259–284 (1998)
Lynn, M.: Segmenting and targeting your market: strategies and limitations. Technical report, Cornell University (2011). http://scholorship.sha.cornell.edu/articles/243
Mercer, J.: Functions of positive and negative type and their connection with the theory of integral equations. Phil. Trans. R. Soc. A 209, 441–458 (1909)
Mijangos, V., Sierra, G., Montes, A.: Sentence level matrix representation for document spectral clustering. Pattern Recognit. Lett. 85, 29–34 (2017)
Mikolov, T., Sutskever, I., Ilya, C., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the Conference on Neural Information Processing Systems (NIPS), Lake Tahoe, Nevada, pp. 3111–3119 (2013)
Murphy, K.P.: Machine Learning - A Probabilistic Perspective. MIT Press, Cambridge (2012)
Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Katar (2014)
Song, Y., Roth, D.: Unsupervised sparse vector densification for short text similarity. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Denver, Colorado (2015)
Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)
Acknowledgement
Hereby we thank the Jaywalker GmbH as well as the Jaywalker Digital AG for their support regarding this publication and especially for annotating the contest data with the best-fitting youth milieus.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
A Example Contest Answer
A Example Contest Answer
The following snippet is an example user answer for the travel contest (contest 1):
-
1.
Jordanien: Ritt durch die Wüste und Petra im Morgengrauen bestaunen bevor die Touristenbusse kommen
-
2.
Cook Island: Schnorcheln mit Walhaien und die Seele baumeln lassen
-
3.
USA: Eine abgespaceste Woche am Burning Man Festival erleben
English translation:
-
1.
Jordan: Ride through the desert and marveling Petra during sunrise before the arrival of tourist buses
-
2.
Cook Island: Snorkeling with whale sharks and relaxing
-
3.
USA: Experience an awesome week at the Burning Man Festival
Rights and permissions
Copyright information
© 2023 Springer Nature Switzerland AG
About this paper
Cite this paper
vor der Brück, T., Pouly, M. (2023). Spectral Text Similarity Measures. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2019. Lecture Notes in Computer Science, vol 13452. Springer, Cham. https://doi.org/10.1007/978-3-031-24340-0_7
Download citation
DOI: https://doi.org/10.1007/978-3-031-24340-0_7
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-24339-4
Online ISBN: 978-3-031-24340-0
eBook Packages: Computer ScienceComputer Science (R0)