Short text similarity based on probabilistic topics

Quan, Xiaojun; Liu, Gang; Lu, Zhi; Ni, Xingliang; Wenyin, Liu

doi:10.1007/s10115-009-0250-y

Short text similarity based on probabilistic topics

Regular Paper
Published: 17 September 2009

Volume 25, pages 473–491, (2010)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Xiaojun Quan¹,
Gang Liu¹,
Zhi Lu¹,
Xingliang Ni^1,2,3 &
…
Liu Wenyin^1,3

1597 Accesses
60 Citations
3 Altmetric
Explore all metrics

Abstract

In this paper, we propose a new method for measuring the similarity between two short text snippets by comparing each of them with the probabilistic topics. Specifically, our method starts by firstly finding the distinguishing terms between the two short text snippets and comparing them with a series of probabilistic topics, extracted by Gibbs sampling algorithm. The relationship between the distinguishing terms of the short text snippets can be discovered by examining their probabilities under each topic. The similarity between two short text snippets is calculated based on their common terms and the relationship of their distinguishing terms. Extensive experiments on paraphrasing and question categorization show that the proposed method can calculate the similarity of short text snippets more accurately than other methods including the pure TF-IDF measure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Wenyin L, Hao TY, Chen W, Feng M (2009) A web-based platform for user-interactive question-answering. World Wide Web: Internet Web Inform Syst 12(2): 107–124
Google Scholar
Park EK, Ra DY, Jang MG (2005) Techniques for improving web retrieval effectiveness. Inform Process Manag 41: 1207–1223
Article Google Scholar
Atkinson-Abutridy J, Mellish C, Aitken S (2004) Combining information extraction with genetic algorithms for text mining. IEEE Intell Syst 19: 22–30
Article Google Scholar
Metzler D, Dumais S, Meek C (2007) Similarity measures for short segments of text. In: Proceedings of the 29th European conference on information retrieval (ECIR 2007). Lecture notes in computer science, vol 4425, Springer, Berlin (2007) pp 16–27
Phan XH, Nguyen ML, Horiguchi S (2008) Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th international conference on World Wide Web. ACM Press, New York, pp 91–100
Salon G (1989) Automatic text processing: the transformation, analysis and retrieval of information by computer. Addison-Wesley, Reading
Google Scholar
Bollegala D, Matsuo Y, Ishizuka M (2007) Measuring semantic similarity between words using Web search engines. In: Proceedings of the 16th international conference on World Wide Web (WWW2007). ACM Press, New York, pp 757–766
Sahami M, Heilman T (2006) A Web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of the 15th international conference on World Wide Web (WWW2006). ACM Press, New York, pp 377–386
Yih W, Meek C (2007) Improving similarity measures for short segments of text. In: Proceedings of twenty-second conference on artificial intelligence (AAAI-07), Vancouver, July 22–26, pp 1489–1494
Fellbaum C (1998) WordNet: an electronic lexical database. MIT Press, Cambridge
MATH Google Scholar
Li YH, McLean D, Bandar ZA et al (2006) Sentence similarity based on semantic nets and corpus statistics. IEEE Trans Knowl Data Eng 18: 1138–1150
Article Google Scholar
Griffiths T, Steyvers M (2004) Finding scientific topics. Natl Acad Sci 101: 5228–5235
Article Google Scholar
Salon G, Yang CS (1973) On the specification of term values in automatic indexing. J Documentation 29(4): 351–372
Article Google Scholar
Hatzivassiloglou V, Klavans J, Eskin E (1999) Detecting text similarity over short passages: exploring linguistic feature combinations via machine learning. In: Proceedings of joint SIGDAT conference on empirical methods in NLP and very large corpora., College Park, MD, USA, June 21–22
Okazaki N, Matsuo Y, Matsumura N et al (2003) Sentence extraction by spreading activation through sentence similarity. IEICE Trans Inform Syst E86D(9): 1686–1694
Google Scholar
Mihalcea R, Corley C, Strapparava C (2006) Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of the American association for artificial intelligence (AAAI 2006), Boston, July 2006, pp 775–780
Lavrenko V, Croft WB (2001) Relevance based language models. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, New Orleans, Louisiana, September 9–13. ACM Press, New York, pp 120–127
Zhai C, Lafferty J (2001) Model-based feedback in the language modeling approach to information retrieval. In: Proceedings of the tenth international conference on Information and knowledge management, Atlanta, Georgia, October 5–10. ACM Press, New York, pp 403–410
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3: 993–1022
Article MATH Google Scholar
http://answers.yahoo.com
Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of the fifteenth conference on uncertainty in artificial intelligence, Stockholm, Sweden, July 30–August 1, pp 289–296
Li X, Roth D (2002) Learning question classifiers. In: Proceedings of the 19th international conference on computational linguistics, Taipei, Taiwan, August 24–September 01, pp 1–7
Zhang D, Lee WS (2003) Question classification using support vector machine. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval, Toronto, Canada, July 28–August 01. ACM Press, New York, pp 26–32
Dolan WB, Quirk C, Brockett C (2004) Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th international conference on computational linguistics, Geneva, Switzerland, August 23–27, No 350
Cesario E, Folino F, Locane A et al (2008) Boosting text segmentation via progressive classification. Knowl Inform Syst 15(3): 285–320
Article Google Scholar
Peng T, Zuo W, He F (2008) SVM based adaptive learning method for text classification from positive and unlabeled documents. Knowl Inform Syst 16(3): 281–301
Article Google Scholar
Chang C, Lin C (2001) LIBSVM: a library for support vector machines. http://www.csie.ntu.edu.tw/cjlin/libsvm/
Fragoudis D, Meretakis D, Likothanassis SD (2005) Best terms: an efficient feature-selection algorithm for text categorization. Knowl Inform Syst 8(1): 16–33
Article Google Scholar
Salton G, McGill M (1983) Introduction to modern information retrieval. McGraw-Hill, New York
MATH Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China
Xiaojun Quan, Gang Liu, Zhi Lu, Xingliang Ni & Liu Wenyin
Department of Computer Science and Technology, University of Science and Technology of China, Hefei, China
Xingliang Ni
Joint Research Lab of Excellence, CityU-USTC Advanced Research Institute, Suzhou, China
Xingliang Ni & Liu Wenyin

Authors

Xiaojun Quan
View author publications
You can also search for this author in PubMed Google Scholar
Gang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zhi Lu
View author publications
You can also search for this author in PubMed Google Scholar
Xingliang Ni
View author publications
You can also search for this author in PubMed Google Scholar
Liu Wenyin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liu Wenyin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Quan, X., Liu, G., Lu, Z. et al. Short text similarity based on probabilistic topics. Knowl Inf Syst 25, 473–491 (2010). https://doi.org/10.1007/s10115-009-0250-y

Download citation

Received: 20 November 2008
Revised: 19 May 2009
Accepted: 21 August 2009
Published: 17 September 2009
Issue Date: December 2010
DOI: https://doi.org/10.1007/s10115-009-0250-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Short text similarity based on probabilistic topics

Abstract

Access this article

Similar content being viewed by others

Improved sqrt-cosine similarity measurement

LDA-PSTR: A Topic Modeling Method for Short Text

A Semantic Textual Similarity Calculation Model Based on Pre-training Model

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Short text similarity based on probabilistic topics

Abstract

Access this article

Similar content being viewed by others

Improved sqrt-cosine similarity measurement

LDA-PSTR: A Topic Modeling Method for Short Text

A Semantic Textual Similarity Calculation Model Based on Pre-training Model

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation