Abstract
In this paper, we propose a new method for measuring the similarity between two short text snippets by comparing each of them with the probabilistic topics. Specifically, our method starts by firstly finding the distinguishing terms between the two short text snippets and comparing them with a series of probabilistic topics, extracted by Gibbs sampling algorithm. The relationship between the distinguishing terms of the short text snippets can be discovered by examining their probabilities under each topic. The similarity between two short text snippets is calculated based on their common terms and the relationship of their distinguishing terms. Extensive experiments on paraphrasing and question categorization show that the proposed method can calculate the similarity of short text snippets more accurately than other methods including the pure TF-IDF measure.
Similar content being viewed by others
References
Wenyin L, Hao TY, Chen W, Feng M (2009) A web-based platform for user-interactive question-answering. World Wide Web: Internet Web Inform Syst 12(2): 107–124
Park EK, Ra DY, Jang MG (2005) Techniques for improving web retrieval effectiveness. Inform Process Manag 41: 1207–1223
Atkinson-Abutridy J, Mellish C, Aitken S (2004) Combining information extraction with genetic algorithms for text mining. IEEE Intell Syst 19: 22–30
Metzler D, Dumais S, Meek C (2007) Similarity measures for short segments of text. In: Proceedings of the 29th European conference on information retrieval (ECIR 2007). Lecture notes in computer science, vol 4425, Springer, Berlin (2007) pp 16–27
Phan XH, Nguyen ML, Horiguchi S (2008) Learning to classify short and sparse text & web with hidden topics from large-scale data collections. In: Proceedings of the 17th international conference on World Wide Web. ACM Press, New York, pp 91–100
Salon G (1989) Automatic text processing: the transformation, analysis and retrieval of information by computer. Addison-Wesley, Reading
Bollegala D, Matsuo Y, Ishizuka M (2007) Measuring semantic similarity between words using Web search engines. In: Proceedings of the 16th international conference on World Wide Web (WWW2007). ACM Press, New York, pp 757–766
Sahami M, Heilman T (2006) A Web-based kernel function for measuring the similarity of short text snippets. In: Proceedings of the 15th international conference on World Wide Web (WWW2006). ACM Press, New York, pp 377–386
Yih W, Meek C (2007) Improving similarity measures for short segments of text. In: Proceedings of twenty-second conference on artificial intelligence (AAAI-07), Vancouver, July 22–26, pp 1489–1494
Fellbaum C (1998) WordNet: an electronic lexical database. MIT Press, Cambridge
Li YH, McLean D, Bandar ZA et al (2006) Sentence similarity based on semantic nets and corpus statistics. IEEE Trans Knowl Data Eng 18: 1138–1150
Griffiths T, Steyvers M (2004) Finding scientific topics. Natl Acad Sci 101: 5228–5235
Salon G, Yang CS (1973) On the specification of term values in automatic indexing. J Documentation 29(4): 351–372
Hatzivassiloglou V, Klavans J, Eskin E (1999) Detecting text similarity over short passages: exploring linguistic feature combinations via machine learning. In: Proceedings of joint SIGDAT conference on empirical methods in NLP and very large corpora., College Park, MD, USA, June 21–22
Okazaki N, Matsuo Y, Matsumura N et al (2003) Sentence extraction by spreading activation through sentence similarity. IEICE Trans Inform Syst E86D(9): 1686–1694
Mihalcea R, Corley C, Strapparava C (2006) Corpus-based and knowledge-based measures of text semantic similarity. In: Proceedings of the American association for artificial intelligence (AAAI 2006), Boston, July 2006, pp 775–780
Lavrenko V, Croft WB (2001) Relevance based language models. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval, New Orleans, Louisiana, September 9–13. ACM Press, New York, pp 120–127
Zhai C, Lafferty J (2001) Model-based feedback in the language modeling approach to information retrieval. In: Proceedings of the tenth international conference on Information and knowledge management, Atlanta, Georgia, October 5–10. ACM Press, New York, pp 403–410
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3: 993–1022
Hofmann T (1999) Probabilistic latent semantic analysis. In: Proceedings of the fifteenth conference on uncertainty in artificial intelligence, Stockholm, Sweden, July 30–August 1, pp 289–296
Li X, Roth D (2002) Learning question classifiers. In: Proceedings of the 19th international conference on computational linguistics, Taipei, Taiwan, August 24–September 01, pp 1–7
Zhang D, Lee WS (2003) Question classification using support vector machine. In: Proceedings of the 26th annual international ACM SIGIR conference on research and development in informaion retrieval, Toronto, Canada, July 28–August 01. ACM Press, New York, pp 26–32
Dolan WB, Quirk C, Brockett C (2004) Unsupervised construction of large paraphrase corpora: exploiting massively parallel news sources. In: Proceedings of the 20th international conference on computational linguistics, Geneva, Switzerland, August 23–27, No 350
Cesario E, Folino F, Locane A et al (2008) Boosting text segmentation via progressive classification. Knowl Inform Syst 15(3): 285–320
Peng T, Zuo W, He F (2008) SVM based adaptive learning method for text classification from positive and unlabeled documents. Knowl Inform Syst 16(3): 281–301
Chang C, Lin C (2001) LIBSVM: a library for support vector machines. http://www.csie.ntu.edu.tw/cjlin/libsvm/
Fragoudis D, Meretakis D, Likothanassis SD (2005) Best terms: an efficient feature-selection algorithm for text categorization. Knowl Inform Syst 8(1): 16–33
Salton G, McGill M (1983) Introduction to modern information retrieval. McGraw-Hill, New York
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Quan, X., Liu, G., Lu, Z. et al. Short text similarity based on probabilistic topics. Knowl Inf Syst 25, 473–491 (2010). https://doi.org/10.1007/s10115-009-0250-y
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-009-0250-y