Skip to main content
Log in

Short text clustering by finding core terms

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

A new clustering strategy, TermCut, is presented to cluster short text snippets by finding core terms in the corpus. We model the collection of short text snippets as a graph in which each vertex represents a piece of short text snippet and each weighted edge between two vertices measures the relationship between the two vertices. TermCut is then applied to recursively select a core term and bisect the graph such that the short text snippets in one part of the graph contain the term, whereas those snippets in the other part do not. We apply the proposed method on different types of short text snippets, including questions and search results. Experimental results show that the proposed method outperforms state-of-the-art clustering algorithms for clustering short text snippets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Banerjee A, Merugu S, Dhillon I, Ghosh J (2004) Clustering with Bregaman Divergences. In: Proceedings of 4th SIAM international conference data mining (SDM 2004), pp 234–245

  2. Banerjee S, Ramanathan K, Gupta A (2007) Clustering short text using Wikipedia. In: Proceedings of the 30th international ACM SIGIR conference on research and development in information retrieval (SIGIR 2007), pp 787–788

  3. Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3: 993–1022

    Article  MATH  Google Scholar 

  4. Bolelli L, Ertekin S, Zhou D, Giles CL (2007) K-SVMeans: a hybrid clustering algorithm for multi-type interrelated datasets. In: Proceedings of international conference on web intelligence (WI 2007), pp 198–204

  5. BuyAns (2009) http://www.buyans.com

  6. Chen K, Liu L (2009) Best K: critical clustering structures in categorical datasets. Knowl Inf Syst 20: 1–33

    Article  Google Scholar 

  7. Chuang S, Chien L (2004) A practical web-based approach to generating topic hierarchy for text segments. In: Proceedings of the 13th ACM international conference on Information and knowledge management (CIKM 2004), pp 127–136

  8. CLUTO (2009) http://glaros.dtc.umn.edu/gkhome/views/cluto/

  9. Cutting DR, Karger DR, Pedersen JO (1993) Constant interaction-time scatter/gather browsing of very large document collections. In: Proceedings of the 16th international ACM SIGIR conference on research and development in information retrieval, pp 126–134

  10. Cutting DR, Karger DR, Pedersen P, Tukey JW (1992) Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the 5th international ACM SIGIR conference on research and development in information retrieval (SIGIR 1992), pp 318–329

  11. Dempster A, Laird N, Rubin D (1977) Maximum likelihood estimation from incomplete data via the EM algorithm. J R Stat Soc 39(1): 1–38

    MathSciNet  MATH  Google Scholar 

  12. Ding C, He X, Zha H (2001) A min-max cut algorithm for graph partitioning and data clustering. In: Proceedings of the international conference on data mining (ICDM 2001), pp 107–114

  13. Dittenbach M, Merkl D, Rauber A (2002) Organizing and exploring high dimensional data with the growing hierarchical self organizing map. In: Proceedings of the 1st international conference on fuzzy systems and knowledge discovery (FSKD 2002), vol 2, pp 626–630

  14. Ester M, Kriegal HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining (KDD 1996), pp 226–231

  15. Fragoudis D, Meretakis D, Likothanassis S (2005) Best terms: an efficient feature-selection algorithm for text categorization. Knowl Inf Syst 8: 16–33

    Article  Google Scholar 

  16. Gluck MA, Corter JE (1985) Information, uncertainty, and the utility of categories. In: Proceedings of the 7th annual conference of the cognitive science society (CogSci 1985), pp 283–287

  17. Google (2009) http://www.google.com

  18. Gusfield D (1997) Algorithms on strings, trees, and sequences: computer science and computational biology. Cambridge University Press, Cambridge

    Book  MATH  Google Scholar 

  19. Hachey B, Grover C (2005) Sequence modelling for sentence classification in a legal summarisation system. In: Proceedings of the 2005 ACM symposium on applied computing (SAC 2005), pp 292–296

  20. ICTCLAS (2009) http://www.ictclas.org

  21. Jin R, Goswami A, Agrawal G (2006) Fast and exact out-of-core and distributed k-means clustering. Knowl Inf Syst 10(1): 17–40

    Article  Google Scholar 

  22. Kaufman L, Rousseeuw P (1990) Finding groups in data: an introduction to cluster analysis. John Wiley and Sons, New York

    Google Scholar 

  23. Kim H, Lee S (2004) An intelligent information system for organizing online text documents. Knowl Inf Syst 6: 125–149

    Google Scholar 

  24. Kummamuru K, Lotlikar R, Roy S, Singal K, Krishnapuram R (2004) A hierarchical monothetic document clustering algorithm for summarization and browsing search results. In: Proceedings of the 13th international conference on World Wide Web (WWW 2004), pp 658–665

  25. Larsen B, Aone C (1999) Fast and effective text mining using linear-time document clustering. In: Proceedings of the 15th ACM SIGKDD international conference on knowledge discovery and data mining (SIGKDD 1999), pp 16–22

  26. Liu W, Hao T, Chen W, Feng M (2009) A web-based platform for user-interactive question-answering. In: World Wide Web: Internet Web Inf Syst 12(2): 107–124

    Google Scholar 

  27. Lloyd SP (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2): 129–137

    Article  MathSciNet  MATH  Google Scholar 

  28. Lucene (2009) http://lucene.apache.org/

  29. MacQueen J (1967) Some method for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley symposium on mathematical statistics and probability I: (Statistics), pp 281–297

  30. Ng RT, Han J (1994) Clustering methods for spatial data mining. In: Proceedings of 20th international conference very large data bases (VLDB 1994), pp 144–155

  31. Ni X, Lu Z, Quan X, Liu W, Hua B (2009) Short text clustering for search results. In: Proceedings of the joint international conferences on Asia-Pacific web conference (APWeb) and web-age information management (WAIM). LNCS, pp 584–589

  32. Ordonez C, Omiecinski E (2005) Accelerating EM clustering to find high-quality solutions. Knowl Inf Syst 7(2): 135–157

    Article  Google Scholar 

  33. Phan X, Nguyen L, Horiguchi S (2008) Learn to classify short and sparse text and web with hidden topics from large-scale data collections. In: Proceedings of the 17th international conference on World Wide Web (WWW 2008), pp 91–100

  34. Quan X, Liu G, Lu Z, Ni X, Wenyin L (2009) Short text similarity based on probabilistic topics. Knowl Inf Syst. doi:10.1007/s10115-009-0250-y, published online first

  35. Su Z, Yang Q, Zhang H, Xu X, Hu Y, Ma S (2002) Correlation-based web document clustering for adaptive web interface design. Knowl Inf Syst 4(2): 151–167

    Article  Google Scholar 

  36. Treeratpituk P, Callan J (2006) An experimental study on automatically labeling hierarchical clusters using statistical features. In: Proceedings of the 29th international ACM SIGIR conference on research and development in information retrieval, pp 707–708

  37. Treeratpituk P, Callan J (2006) Automatically labeling hierarchical clusters. In: Proceedings of the 7th international conference on digital government research (dg.o 2006), pp 167–176

  38. Wang X, Zhai C (2007) Learn from web search logs to organize search results. In: Proceedings of the 15th international ACM SIGIR conference on research and development in information retrieval, pp 87–94

  39. Wikipedia (2009) http://www.wikipedia.org

  40. Yahoo! Answers (2009) http://answers.yahoo.com

  41. Yahoo! Groups (2009) http://groups.yahoo.com

  42. Zamir O, Etzioni O (1999) Grouper: a dynamic clustering interface to web search results. In: Proceedings of the 8th international conference on World Wide Web (WWW1999), pp 1361–1374

  43. Zamir O, Etzioni O (1998) Web document clustering: a feasibility demonstration. In: Proceedings of the 21th international ACM SIGIR conference on research and development in information retrieval (SIGIR 1998), pp 46–54

  44. Zeng H, He Q, Chen Z, Ma W, Ma J (2004) Learning to cluster web search results. In: Proceedings of the 27th international ACM SIGIR conference on research and development in information retrieval (SIGIR 2004), pp 210–217

  45. Zhang D, Lee WS (2003) Question classification using support vector machines. In: Proceedings of the 26th international ACM SIGIR conference on research and development in information retrieval (SIGIR 2003), pp 26–32

  46. Zhao Y, Karypis G (2002) Evaluation of hierarchical clustering algorithms for document datasets. In: Proceedings of the 7th international conference on Information and knowledge management (CIKM 2002), pp 515–524

  47. Zhong S, Ghosh J (2005) Generative model-based document clustering: a comparative study. Knowl Inf Syst 8(3): 374–384

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Liu Wenyin.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ni, X., Quan, X., Lu, Z. et al. Short text clustering by finding core terms. Knowl Inf Syst 27, 345–365 (2011). https://doi.org/10.1007/s10115-010-0299-7

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-010-0299-7

Keywords

Navigation