Skip to main content

Spectral Text Similarity Measures

  • Conference paper
  • First Online:
Computational Linguistics and Intelligent Text Processing (CICLing 2019)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13452))

  • 359 Accesses

Abstract

Estimating semantic similarity between texts is of vital importance in many areas of natural language processing like information retrieval, question answering, text reuse, or plagiarism detection.

Prevalent semantic similarity estimates based on word embeddings are noise sensitive. Thus, small individual term similarities can have in aggregate a considerable influence on the total estimation value. In contrast, the methods proposed here exploit the spectrum of the product of embedding matrices, which leads to increased robustness when compared with conventional methods.

We apply these estimate on two tasks, which are the assignment of people to the best matching marketing target group and finding the correct match between sentences belonging to two independent translations of the same novel. The evaluation revealed that our proposed method based on the spectral norm could increase the accuracy compared to several baseline methods in both scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    This corpus can be obtained under the URL https://www.researchgate.net/publication/332072718_alignmentPurloinedLettertar.

  2. 2.

    http://www.20min.ch.

  3. 3.

    https://github.com/faraday/wikiprep-esa.

References

  1. Attig, A., Perner, P.: The problem of normalization and a normalized similarity measure by online data. Trans. Case-Based Reason. 4(1), 3–17 (2011)

    Google Scholar 

  2. Belanche, L., Orozco, J.: Things to know about a (dis)similarity measure. In: König, A., Dengel, A., Hinkelmann, K., Kise, K., Howlett, R.J., Jain, L.C. (eds.) KES 2011. LNCS (LNAI), vol. 6881, pp. 100–109. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-23851-2_11

    Chapter  Google Scholar 

  3. Brokos, G.I., Malakasiotis, P., Androutsopoulos, I.: Using centroids of word embeddings and word mover’s distance for biomedical document retrieval in question answering. In: Proceedings of the 15th Workshop on Biomedical Natural Language Processing, Berlin, Germany, pp. 114–118 (2016)

    Google Scholar 

  4. Budanitsky, A., Hirst, G.: Evaluating wordnet-based measures of semantic relatedness. Comput. Linguist. 32(1), 13–47 (2006)

    Article  Google Scholar 

  5. Chatelin, F.: Eigenvalues of Matrices - Revised Edition. Society for Industrial and Applied Mathematics, Philadelphia, Pennsylvania (1993)

    Google Scholar 

  6. Gabrilovic, E., Markovitch, S.: Wikipedia-based semantic interpretation for natural language processing. J. Artif. Intell. Res. 34, 443–498 (2009)

    Article  Google Scholar 

  7. Gottron, T., Anderka, M., Stein, B.: Insights into explicit semantic analysis. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management, Glasgow, UK, pp. 1961–1964 (2011)

    Google Scholar 

  8. Gupta, V.: Improving word embeddings using kernel principal component analysis. Master’s thesis, Bonn-Aachen International Center for Information Technology (B-IT) (2018)

    Google Scholar 

  9. Hong, K.J., Lee, G.H., Kom, H.J.: Enhanced document clustering using Wikipedia-based document representation. In: Proceedings of the 2015 International Conference on Applied System Innovation (ICASI), Osaka, Japan (2015)

    Google Scholar 

  10. Kiros, R., et al.: Skip-thought vectors. In: Proceedings of the Conference on Neural Information Processing Systems (NIPS), Montréal, Canada (2015)

    Google Scholar 

  11. Landauer, T.K., Foltz, P.W., Laham, D.: An introduction to latent semantic analysis. Discourse Process. 25, 259–284 (1998)

    Article  Google Scholar 

  12. Lynn, M.: Segmenting and targeting your market: strategies and limitations. Technical report, Cornell University (2011). http://scholorship.sha.cornell.edu/articles/243

  13. Mercer, J.: Functions of positive and negative type and their connection with the theory of integral equations. Phil. Trans. R. Soc. A 209, 441–458 (1909)

    Google Scholar 

  14. Mijangos, V., Sierra, G., Montes, A.: Sentence level matrix representation for document spectral clustering. Pattern Recognit. Lett. 85, 29–34 (2017)

    Article  Google Scholar 

  15. Mikolov, T., Sutskever, I., Ilya, C., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Proceedings of the Conference on Neural Information Processing Systems (NIPS), Lake Tahoe, Nevada, pp. 3111–3119 (2013)

    Google Scholar 

  16. Murphy, K.P.: Machine Learning - A Probabilistic Perspective. MIT Press, Cambridge (2012)

    Google Scholar 

  17. Pennington, J., Socher, R., Manning, C.D.: Glove: global vectors for word representation. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Katar (2014)

    Google Scholar 

  18. Song, Y., Roth, D.: Unsupervised sparse vector densification for short text similarity. In: Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics (NAACL), Denver, Colorado (2015)

    Google Scholar 

  19. Vapnik, V.: Statistical Learning Theory. Wiley, New York (1998)

    Google Scholar 

Download references

Acknowledgement

Hereby we thank the Jaywalker GmbH as well as the Jaywalker Digital AG for their support regarding this publication and especially for annotating the contest data with the best-fitting youth milieus.

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Tim vor der Brück or Marc Pouly .

Editor information

Editors and Affiliations

A Example Contest Answer

A Example Contest Answer

The following snippet is an example user answer for the travel contest (contest 1):

  1. 1.

    Jordanien: Ritt durch die Wüste und Petra im Morgengrauen bestaunen bevor die Touristenbusse kommen

  2. 2.

    Cook Island: Schnorcheln mit Walhaien und die Seele baumeln lassen

  3. 3.

    USA: Eine abgespaceste Woche am Burning Man Festival erleben

English translation:

  1. 1.

    Jordan: Ride through the desert and marveling Petra during sunrise before the arrival of tourist buses

  2. 2.

    Cook Island: Snorkeling with whale sharks and relaxing

  3. 3.

    USA: Experience an awesome week at the Burning Man Festival

Rights and permissions

Reprints and permissions

Copyright information

© 2023 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

vor der Brück, T., Pouly, M. (2023). Spectral Text Similarity Measures. In: Gelbukh, A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2019. Lecture Notes in Computer Science, vol 13452. Springer, Cham. https://doi.org/10.1007/978-3-031-24340-0_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-24340-0_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-24339-4

  • Online ISBN: 978-3-031-24340-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics