Skip to main content
Log in

Short text clustering based on Pitman-Yor process mixture model

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

For finding the appropriate number of clusters in short text clustering, models based on Dirichlet Multinomial Mixture (DMM) require the maximum possible cluster number before inferring the real number of clusters. However, it is difficult to choose a proper number as we do not know the true number of clusters in short texts beforehand. The cluster distribution in DMM based on Dirichlet process as prior goes down exponentially as the number of clusters increases. Therefore, we propose a novel model based on Pitman-Yor Process to capture the power-law phenomenon of the cluster distribution in the paper. Specifically, each text chooses one of the active clusters or a new cluster with probabilities derived from the Pitman-Yor Process Mixture model (PYPM). Discriminative words and nondiscriminative words are identified automatically to help enhance text clustering. Parameters are estimated efficiently by collapsed Gibbs sampling and experimental results show PYPM is robust and effective comparing with the state-of-the-art models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. http://news.google.com

  2. http://trec.nist.gov/data/microblog.html

  3. http://trec.nist.gov/data/microblog.html

  4. http://www.nltk.org

  5. http://scikit-learn.org

  6. Our code is open-sources at https://github.com/qiang2100/PYPM.

  7. http://psiexp.ss.uci.edu/research/programs_data/toolbox.htm

References

  1. Anastasiu D, Tagarelli A, Karypis G (2013) Document clustering: The next frontier. Technical report, University of Minnesota

  2. Andrews N, Fox E (2007) Recent developments in document clustering. Technical report, Computer Science, Virginia Tech

  3. Blei D, Ng N, Jordan M (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  4. Catherine F, Coke R, Zhang R, Ye X, Radev D (2016) Effects of creativity and cluster tightness on short text clustering performance. In: Proceedings of the 54th annual meeting of the association for computational linguistics. Berlin, Germany, pp 654–665

  5. El Ghali B, El Qadi A (2017) Context-aware query expansion method using language models and latent semantic analyses. Knowl Inf Syst 50(3):751–62

    Article  Google Scholar 

  6. Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Kdd, pp 226–231

  7. Frey B, Dueck D (2007) Clustering by passing messages between data points Science, vol 315, pp 972–976

  8. Griffiths T, Steyvers M (2004) Finding scientific topics. In: Proceedings of the National Academy of Sciences, pp 5228–5235

  9. Hamza AB, Brady DJ (2006) Reconstruction of reflectance spectra using robust nonnegative matrix factorization. IEEE Trans Signal Process 54(9):3637–42

    Article  MATH  Google Scholar 

  10. Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques: concepts and techniques. Elsevier, Amsterdam

    MATH  Google Scholar 

  11. Hofmann T (1999) Probabilistic latent semantic indexing. In: SIGIR, pp 50–57

  12. Huang R, Yu G, Wang Z, Zhang J, Shi L (2013) Dirichlet process mixture model for document clustering with feature partition. In: IEEE Transactions on knowledge and data engineering, vol 25, pp 1748–1759

  13. Hubert L, Arabie P (1985) Comparing partitions. In: Journal of classification, vol 2, pp 193–218

  14. Jain A (2010) Data clustering: 50 years beyond k-means. In: Pattern recognition letters, vol 31, pp 651–666

  15. Lau L, Collier N, Baldwin T (2012) On-line trend analysis with topic models: twitter trends detection topic model online. In: COLING, pp 1519–1534

  16. Lee D, Seung S (2001) Algorithms for non-negative matrix factorization. In: NIPS, pp 556–562

  17. Liang S, Yilmaz E, Kanoulas E (2016) Dynamic clustering of streaming short documents. In: SIGKDD, pp 995–1004

  18. Lu Y, Mei Q, Zhai C (2011) Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA. Inf Retr 14(2):178–203

    Article  Google Scholar 

  19. Mojahed A, de la Iglesia B (2017) An adaptive version of k-medoids to deal with the uncertainty in clustering heterogeneous data using an intermediary fusion approach. Knowl Inf Syst 50(1):27–52

    Article  Google Scholar 

  20. Vinh NX, Epps J, Bailey J (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. In: Journal of machine learning research, vol 11, pp 2837–2854

  21. Nigam K, McCallum AK, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. In: Machine learning, vol 39, pp 103–134

  22. Olson CF, Hunn DC, Lyons HJ (2017) Efficient Monte Carlo clustering in subspaces. Knowl Inf Syst 52(3):1–22

    Article  Google Scholar 

  23. Quan X, Kit C, Ge Y, Pan SJ (2015) Short and sparse text topic modeling via self-aggregation. In: AAAI, pp 2270–2276

  24. Gaussian RD (2009) Mixture models. Encyclopedia of Biometrics pp 659–663

  25. Ros F, Guillaume S (2017) DIDES: A fast and effective sampling for clustering algorithm. Knowl Inf Syst 50(2):543–68

    Article  Google Scholar 

  26. Rosenberg A, Hirschberg J V-measure: A conditional entropy-based external cluster evaluation measure. In: AAAI, pp 410–420

  27. Sang CY, Sun DH (2014) Co-clustering over multiple dynamic data streams based on non-negative matrix factorization. Appl Intell 41(2):487–502

    Article  Google Scholar 

  28. Sato I, Nakagawa H (2010) Topic models with power-law using Pitman-Yor process. In: SIGKDD, pp 673–682

  29. Sun L, Guo C, Liu C, Xiong H (2017) Fast affinity propagation clustering based on incomplete similarity matrix. Knowl Inf Syst 51(3):941–63

    Article  Google Scholar 

  30. Teh YW, Jordan M, Beal M, Blei D (2006) Hierarchical Dirichlet process. J Am Stat Assoc 101(476):280–287

    Article  Google Scholar 

  31. Teh YW (2011) Dirichlet process. In: Encyclopedia of machine learning. Springer, pp 280–287

  32. Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY, Zhou ZH (2008) Top 10 algorithms in data mining. In: Knowledge and Information Systems, vol 14, pp 1–37

  33. Yin J, Wang J (2014) A dirichlet multinomial mixture model-based approach for short text clustering. In: SIGKDD, pp 233– 242

  34. Yin J, Wang J (2016) A text clustering algorithm using an online clustering scheme for initialization. In: SIGKDD, pp 1995–2004

  35. Yu G, Huang R, Wang Z (2010) Document clustering via dirichlet process mixture model with feature selection. In: SIGKDD, pp 763–772

  36. Zhao Y, Karypis G (2004) Empirical and theoretical comparisons of selected criterion functions for document clustering. In: Machine learning, vol 55, pp 311–331

Download references

Acknowledgements

This research is partially supported by the Natural Science Foundation of Jiangsu Province of China under grants (BK20170513, BK20161338), the National Natural Science Foundation of China under grants (61703362, 61402203) the Natural Science Foundation of the Higher Education Institutions of Jiangsu Province of China under grant 17KJB520045, and the Science and Technology Planning Project of Yangzhou of China under grant YZ2016238.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jipeng Qiang.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qiang, J., Li, Y., Yuan, Y. et al. Short text clustering based on Pitman-Yor process mixture model. Appl Intell 48, 1802–1812 (2018). https://doi.org/10.1007/s10489-017-1055-4

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-017-1055-4

Keywords

Navigation