Abstract
For finding the appropriate number of clusters in short text clustering, models based on Dirichlet Multinomial Mixture (DMM) require the maximum possible cluster number before inferring the real number of clusters. However, it is difficult to choose a proper number as we do not know the true number of clusters in short texts beforehand. The cluster distribution in DMM based on Dirichlet process as prior goes down exponentially as the number of clusters increases. Therefore, we propose a novel model based on Pitman-Yor Process to capture the power-law phenomenon of the cluster distribution in the paper. Specifically, each text chooses one of the active clusters or a new cluster with probabilities derived from the Pitman-Yor Process Mixture model (PYPM). Discriminative words and nondiscriminative words are identified automatically to help enhance text clustering. Parameters are estimated efficiently by collapsed Gibbs sampling and experimental results show PYPM is robust and effective comparing with the state-of-the-art models.
Similar content being viewed by others
References
Anastasiu D, Tagarelli A, Karypis G (2013) Document clustering: The next frontier. Technical report, University of Minnesota
Andrews N, Fox E (2007) Recent developments in document clustering. Technical report, Computer Science, Virginia Tech
Blei D, Ng N, Jordan M (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Catherine F, Coke R, Zhang R, Ye X, Radev D (2016) Effects of creativity and cluster tightness on short text clustering performance. In: Proceedings of the 54th annual meeting of the association for computational linguistics. Berlin, Germany, pp 654–665
El Ghali B, El Qadi A (2017) Context-aware query expansion method using language models and latent semantic analyses. Knowl Inf Syst 50(3):751–62
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Kdd, pp 226–231
Frey B, Dueck D (2007) Clustering by passing messages between data points Science, vol 315, pp 972–976
Griffiths T, Steyvers M (2004) Finding scientific topics. In: Proceedings of the National Academy of Sciences, pp 5228–5235
Hamza AB, Brady DJ (2006) Reconstruction of reflectance spectra using robust nonnegative matrix factorization. IEEE Trans Signal Process 54(9):3637–42
Han J, Kamber M, Pei J (2011) Data mining: concepts and techniques: concepts and techniques. Elsevier, Amsterdam
Hofmann T (1999) Probabilistic latent semantic indexing. In: SIGIR, pp 50–57
Huang R, Yu G, Wang Z, Zhang J, Shi L (2013) Dirichlet process mixture model for document clustering with feature partition. In: IEEE Transactions on knowledge and data engineering, vol 25, pp 1748–1759
Hubert L, Arabie P (1985) Comparing partitions. In: Journal of classification, vol 2, pp 193–218
Jain A (2010) Data clustering: 50 years beyond k-means. In: Pattern recognition letters, vol 31, pp 651–666
Lau L, Collier N, Baldwin T (2012) On-line trend analysis with topic models: twitter trends detection topic model online. In: COLING, pp 1519–1534
Lee D, Seung S (2001) Algorithms for non-negative matrix factorization. In: NIPS, pp 556–562
Liang S, Yilmaz E, Kanoulas E (2016) Dynamic clustering of streaming short documents. In: SIGKDD, pp 995–1004
Lu Y, Mei Q, Zhai C (2011) Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA. Inf Retr 14(2):178–203
Mojahed A, de la Iglesia B (2017) An adaptive version of k-medoids to deal with the uncertainty in clustering heterogeneous data using an intermediary fusion approach. Knowl Inf Syst 50(1):27–52
Vinh NX, Epps J, Bailey J (2010) Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. In: Journal of machine learning research, vol 11, pp 2837–2854
Nigam K, McCallum AK, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. In: Machine learning, vol 39, pp 103–134
Olson CF, Hunn DC, Lyons HJ (2017) Efficient Monte Carlo clustering in subspaces. Knowl Inf Syst 52(3):1–22
Quan X, Kit C, Ge Y, Pan SJ (2015) Short and sparse text topic modeling via self-aggregation. In: AAAI, pp 2270–2276
Gaussian RD (2009) Mixture models. Encyclopedia of Biometrics pp 659–663
Ros F, Guillaume S (2017) DIDES: A fast and effective sampling for clustering algorithm. Knowl Inf Syst 50(2):543–68
Rosenberg A, Hirschberg J V-measure: A conditional entropy-based external cluster evaluation measure. In: AAAI, pp 410–420
Sang CY, Sun DH (2014) Co-clustering over multiple dynamic data streams based on non-negative matrix factorization. Appl Intell 41(2):487–502
Sato I, Nakagawa H (2010) Topic models with power-law using Pitman-Yor process. In: SIGKDD, pp 673–682
Sun L, Guo C, Liu C, Xiong H (2017) Fast affinity propagation clustering based on incomplete similarity matrix. Knowl Inf Syst 51(3):941–63
Teh YW, Jordan M, Beal M, Blei D (2006) Hierarchical Dirichlet process. J Am Stat Assoc 101(476):280–287
Teh YW (2011) Dirichlet process. In: Encyclopedia of machine learning. Springer, pp 280–287
Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan GJ, Ng A, Liu B, Philip SY, Zhou ZH (2008) Top 10 algorithms in data mining. In: Knowledge and Information Systems, vol 14, pp 1–37
Yin J, Wang J (2014) A dirichlet multinomial mixture model-based approach for short text clustering. In: SIGKDD, pp 233– 242
Yin J, Wang J (2016) A text clustering algorithm using an online clustering scheme for initialization. In: SIGKDD, pp 1995–2004
Yu G, Huang R, Wang Z (2010) Document clustering via dirichlet process mixture model with feature selection. In: SIGKDD, pp 763–772
Zhao Y, Karypis G (2004) Empirical and theoretical comparisons of selected criterion functions for document clustering. In: Machine learning, vol 55, pp 311–331
Acknowledgements
This research is partially supported by the Natural Science Foundation of Jiangsu Province of China under grants (BK20170513, BK20161338), the National Natural Science Foundation of China under grants (61703362, 61402203) the Natural Science Foundation of the Higher Education Institutions of Jiangsu Province of China under grant 17KJB520045, and the Science and Technology Planning Project of Yangzhou of China under grant YZ2016238.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Qiang, J., Li, Y., Yuan, Y. et al. Short text clustering based on Pitman-Yor process mixture model. Appl Intell 48, 1802–1812 (2018). https://doi.org/10.1007/s10489-017-1055-4
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-017-1055-4