Similarity Measures for Short Segments of Text

Metzler, Donald; Dumais, Susan; Meek, Christopher

doi:10.1007/978-3-540-71496-5_5

Donald Metzler¹,
Susan Dumais² &
Christopher Meek²

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4425))

Included in the following conference series:

European Conference on Information Retrieval

2930 Accesses
125 Citations
3 Altmetric

Abstract

Measuring the similarity between documents and queries has been extensively studied in information retrieval. However, there are a growing number of tasks that require computing the similarity between two very short segments of text. These tasks include query reformulation, sponsored search, and image retrieval. Standard text similarity measures perform poorly on such tasks because of data sparseness and the lack of context. In this work, we study this problem from an information retrieval perspective, focusing on text representations and similarity measures. We examine a range of similarity measures, including purely lexical measures, stemming, and language modeling-based measures. We formally evaluate and analyze the methods on a query-query similarity task using 363,822 queries from a web search log. Our analysis provides insights into the strengths and weaknesses of each method, including important tradeoffs between effectiveness and efficiency.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Improved sqrt-cosine similarity measurement

Article Open access 25 July 2017

An Analysis of Semantic Similarity Measures for Information Retrieval

Short text similarity measurement methods: a review

Article 03 January 2021

References

Berger, A., Lafferty, J.: Information retrieval as statistical translation. In: Proceedings of SIGIR ’99, pp. 222–229 (1999)
Google Scholar
Cucerzan, S., Brill, E.: Extracting semantically related queries by exploiting user session information. Technical Report, Microsoft Research (2005)
Google Scholar
Deerwester, S., et al.: Indexing by latent semantic analysis. JASIST 41(6), 391–407 (1990)
Article Google Scholar
Jones, R.: Generating query substitutions. In: Proceedings of WWW 2006, pp. 387-396 (2006)
Google Scholar
Krovetz, R.: Viewing morphology as an inference process. In: Proceedings of SIGIR ’93, pp. 191–202 (1993)
Google Scholar
Lavrenko, V., Croft, W.B.: Relevance based language models. In: Proceedings of SIGIR ‘01, pp. 120-127 (2001)
Google Scholar
Metzler, D., et al.: Similarity measures for tracking information flow. In: Proceedings of CIKM ‘05, pp. 517-524 (2005)
Google Scholar
Murdock, V., Croft, W.B.: A Translation Model for Sentence Retrieval. In: Proceedings of HLT/EMNLP ‘05, pp. 684-691 (2005)
Google Scholar
Porter, M.F.: An algorithm for suffix stripping. Program 14(3), 130–137 (1980)
Google Scholar
Rocchio, J.J.: Relevance Feedback in Information Retrieval, pp. 313–323. Prentice-Hall, Englewood Cliffs (1971)
Google Scholar
Sahami, M., Heilman, T.: A web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of WWW 2006, pp. 377-386 (2006)
Google Scholar
Strohman, T., et al.: Indri: A language model-based search engine for complex queries. In: Proceedings of the International Conference on Intelligence Analysis (2005)
Google Scholar
Zhai, C., Lafferty, J.: A study of smoothing methods for language models applied to ad hoc information retrieval. In: Proceedings of SIGIR ‘01, pp. 334-342 (2001)
Google Scholar
Zhai, C., Lafferty, J.: Model-based feedback in the language modeling approach to information retrieval. In: Proceedings of CIKM ‘01, pp. 403-410 (2001)
Google Scholar

Download references

Author information

Authors and Affiliations

University of Massachusetts, Amherst, MA,
Donald Metzler
Microsoft Research, Redmond, WA,
Susan Dumais & Christopher Meek

Authors

Donald Metzler
View author publications
You can also search for this author in PubMed Google Scholar
Susan Dumais
View author publications
You can also search for this author in PubMed Google Scholar
Christopher Meek
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Giambattista Amati Claudio Carpineto Giovanni Romano

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Metzler, D., Dumais, S., Meek, C. (2007). Similarity Measures for Short Segments of Text. In: Amati, G., Carpineto, C., Romano, G. (eds) Advances in Information Retrieval. ECIR 2007. Lecture Notes in Computer Science, vol 4425. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71496-5_5

Download citation

DOI: https://doi.org/10.1007/978-3-540-71496-5_5
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71494-1
Online ISBN: 978-3-540-71496-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics