Skip to main content
Log in

Hierarchical evolving Dirichlet processes for modeling nonlinear evolutionary traces in temporal data

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

Clustering analysis aims to group a set of similar data objects into the same cluster. Topic models, which belong to the soft clustering methods, are powerful tools to discover latent clusters/topics behind large data sets. Due to the dynamic nature of temporal data, clusters often exhibit complicated patterns such as birth, branch and death. However, most existing temporal clustering models assume that clusters evolve as a linear chain, and they cannot model and detect branching of clusters. In this paper, we present evolving Dirichlet processes (EDP for short) to model nonlinear evolutionary traces behind temporal data, especially for temporal text collections. In the setting of EDP, temporal collections are divided into epochs. In order to model cluster branching over time, EDP allows each cluster in an epoch to form Dirichlet processes (DP) and uses a combination of the cluster-specific DPs as the prior for cluster distributions in the next epoch. To model hierarchical temporal data, such as online document collections, we propose a new class of evolving hierarchical Dirichlet processes (EHDP for short) which extends the hierarchical Dirichlet processes (HDP) to model evolving temporal data. We design an online learning framework based on Gibbs sampling to infer the evolutionary traces of clusters over time. In experiments, we validate that EDP and EHDP can capture nonlinear evolutionary traces of clusters on both synthetic and real-world text collections and achieve better results than its peers.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

References

  • Ahmed A, Ho Q, Teo C, Eisenstein J, Smola A, Xing E (2011) Online inference for the infinite cluster-topic model: storylines from streaming text. In: Proceedings of the 14th conference on artificial intelligence and statistics (AISTATS), pp 101–109

  • Ahmed A, Hong L, Smola A (2013) Nested chinese restaurant franchise process: Applications to user tracking and document modeling. In: Proceedings of the 30th international conference on machine learning (ICML-13), pp 1426–1434

  • Ahmed A, Xing E (2008) Dynamic non-parametric mixture models and the recurrent chinese restaurant process: with applications to evolutionary clustering. In: Proceedings of the 2008 SIAM international conference on data mining. SIAM, pp 219–230

  • Ahmed A, Xing EP (2010) Timeline: a dynamic hierarchical dirichlet process model for recovering birth/death and evolution of topics in text stream. In: Proceedings of the 26th Uncertainty in Artificial Intelligence (UAI), UAI ’10, pp 20–29

  • Antoniak CE et al (1974) Mixtures of dirichlet processes with applications to bayesian nonparametric problems. Ann Stat 2(6):1152–1174

    Article  MathSciNet  MATH  Google Scholar 

  • Banerjee A, Basu S (2007) Topic models over text streams: a study of batch and online unsupervised learning. In: SDM. SIAM, vol 7, pp 437–442

  • Blei DM, Frazier PI (2011) Distance dependent chinese restaurant processes. J Mach Learn Res 12:2461–2488

    MathSciNet  MATH  Google Scholar 

  • Blei DM, Frazier PI (2011) Distance dependent chinese restaurant processes. J Mach Learn Res 12:2461–2488

    MathSciNet  MATH  Google Scholar 

  • Blei DM, Jordan MI et al (2006) Variational inference for dirichlet process mixtures. Bayesian Anal 1(1):121–143

    Article  MathSciNet  MATH  Google Scholar 

  • Blei DM, Lafferty JD (2006) Dynamic topic models. In: Proceedings of the 23rd international conference on Machine learning. ACM, pp 113–120

  • Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  • Boyles L, Welling M (2012) The time-marginalized coalescent prior for hierarchical clustering. Advances in neural information processing systems. MIT Press, London, pp 2969–2977

    Google Scholar 

  • Chakrabarti D, Kumar R, Tomkins A (2006) Evolutionary clustering. In: Proceedings of the 12th ACM SIGKDD international conference on knowledge discovery and data mining, KDD ’06. ACM, New York, pp 554–560

  • Chen C, Ding N, Buntine W (2012) Dependent hierarchical normalized random measures for dynamic topic modeling. arXiv preprint arXiv:1206.4671 p 8

  • Chi Y, Song X, Zhou D, Hino K, Tseng BL (2007) Evolutionary spectral clustering by incorporating temporal smoothness. In: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 153–162

  • De Smet W, Moens MF (2013) Representations for multi-document event clustering. Data Min Knowl Discov 26(3):533–558. doi:10.1007/s10618-012-0270-1

    Article  MathSciNet  MATH  Google Scholar 

  • Diao Q, Jiang J, Zhu F, Lim EP (2012) Finding bursty topics from microblogs. In: Proceedings of the 50th annual meeting of the Association for Computational Linguistics: Long Papers-Volume 1. Association for Computational Linguistics, pp 536–544

  • Gao Z, Song Y, Liu S, Wang H, Wei H, Chen Y, Cui W (2011) Tracking and connecting topics via incremental hierarchical dirichlet processes. In: 2011 IEEE 11th International Conference on Data Mining (ICDM). IEEE, pp 1056–1061

  • Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB (2013) Bayesian data analysis. CRC Press, Boca Raton

    MATH  Google Scholar 

  • Gordon N, Ristic B, Arulampalam S (2004) Beyond the kalman filter: particle filters for tracking applications. Artech House, London

  • Griffin JE, Steel MJ (2006) Order-based dependent dirichlet processes. J Am Stat Assoc 101(473):179–194

    Article  MathSciNet  MATH  Google Scholar 

  • Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101(suppl 1):5228–5235

    Article  Google Scholar 

  • Griffiths DMBTL, Tenenbaum MIJJB (2004) Hierarchical topic models and the nested Chinese restaurant process. Adv Neural Inf Process Syst 16:17

  • Kawamae N (2011) Trend analysis model: trend consists of temporal words, topics, and timestamps. In: Proceedings of the fourth ACM international conference on Web search and data mining. ACM, pp 317–326

  • Kawamae N (2012) Theme chronicle model: Chronicle consists of timestamp and topical words over each theme. In: Proceedings of the 21st ACM international conference on information and knowledge management, CIKM ’12. ACM, New York, pp 2065–2069

  • Kingman JF (1982a) On the genealogy of large populations. J Appl Probab 19:27–43

  • Kingman JFC (1982b) The coalescent. Stoch Process Appl 13(3):235–248

  • Leskovec J, Backstrom L, Kleinberg J (2009) Meme-tracking and the dynamics of the news cycle. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 497–506

  • Li AQ, Ahmed A, Ravi S, Smola AJ (2014) Reducing the sampling complexity of topic models. In: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 891–900

  • Lin D, Grimson E, Fisher III JW (2010) Construction of dependent dirichlet processes based on poisson processes. Neural Inf Process Syst Found pp 1396–1404

  • MacEachern SN (2000) Dependent dirichlet processes. Unpublished manuscript, Department of Statistics, The Ohio State University pp 1–40

  • Neal RM (2000) Markov chain sampling methods for dirichlet process mixture models. J Comput Graph Stat 9(2):249–265

    MathSciNet  Google Scholar 

  • Neal RM (2003) Density modeling and clustering using dirichlet diffusion trees. Bayesian Stat 7:619–629

    MathSciNet  Google Scholar 

  • Ren L, Dunson DB, Carin L (2008) The dynamic hierarchical dirichlet process. In: Proceedings of the 25th international conference on Machine learning. ACM, pp 824–831

  • Shahaf D, Yang J, Suen C, Jacobs J, Wang H, Leskovec J (2013) Information cartography: creating zoomable, large-scale maps of information. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 1097–1105

  • Sun Y, Tang J, Han J, Chen C, Gupta M (2013) Co-evolution of multi-typed objects in dynamic star networks. IEEE Trans Knowl Data Eng 99:1

    Google Scholar 

  • Teh YW (2006) A hierarchical bayesian language model based on pitman-yor processes. In: Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, pp 985–992

  • Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical Dirichlet processes. J Am Stat Assoc 101(476):1566–1581

    Article  MathSciNet  MATH  Google Scholar 

  • Teh YW, Kurihara K, Welling M (2008) Collapsed variational inference for HDP. Advances in neural information processing systems. MIT Press, London, pp 1481–1488

    Google Scholar 

  • Thibaux R, Jordan MI (2007) Hierarchical beta processes and the indian buffet process. In: International conference on artificial intelligence and statistics, pp 564–571

  • Wallach HM, Murray I, Salakhutdinov R, Mimno D (2009) Evaluation methods for topic models. In: Proceedings of the 26th annual international conference on machine learning. ACM, pp 1105–1112

  • Wang C, Paisley JW, Blei DM (2011) Online variational inference for the hierarchical dirichlet process. In: International conference on artificial intelligence and statistics, pp 752–760

  • Wang X, Ma X, Grimson WEL (2009) Unsupervised activity perception in crowded and complicated scenes using hierarchical bayesian models. IEEE Trans Pattern Anal Mach Intell 31(3):539–555

    Article  Google Scholar 

  • Wang X, McCallum A (2006) Topics over time: a non-markov continuous-time model of topical trends. In: Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp 424–433. ACM

  • Xu K, Kliger M, Hero A III (2014) Adaptive evolutionary clustering. Data Min Knowl Discov 28(2):304–336. doi:10.1007/s10618-012-0302-x

    Article  MathSciNet  MATH  Google Scholar 

  • Xu MEKJ (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the second international conference on knowledge discovery and data mining (KDD-96). AAAI, pp 226–231

  • Yang T, Chi Y, Zhu S, Gong Y, Jin R (2011) Detecting communities and their evolutions in dynamic social networksa bayesian approach. Mach Learn 82(2):157–189

    Article  MathSciNet  MATH  Google Scholar 

  • Yao L, Mimno D, McCallum A (2009) Efficient methods for topic model inference on streaming document collections. In: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 937–946

  • Zhang J, Song Y, Zhang C, Liu S (2010) Evolutionary hierarchical dirichlet processes for multiple correlated time-varying corpora. In: Proceedings of the 16th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 1079–1088

  • Zhang P, Gao BJ, Liu P, Shi Y, Guo L (2012) A framework for application-driven classification of data streams. Neurocomputing 92:170–182

    Article  Google Scholar 

  • Zhang P, Zhou C, Wang P, Gao BJ, Zhu X, Guo L (2015) E-tree: an efficient indexing structure for ensemble models on data streams. IEEE Trans Knowl Data Eng 27(2):461–474

    Article  Google Scholar 

  • Zhang W, Li R, Feng D, Chernikov A, Chrisochoides N, Osgood C, Ji S (2015) Evolutionary soft co-clustering: formulations, algorithms, and applications. Data Min Knowl Discov 29(3):765–791

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgments

This work was supported by NSFC (61370025, 61502479), Australia ARC Discovery Project (DP140102206) and the Strategic Leading Science and Technology Projects of CAS (No. XDA06030200).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Peng Zhang.

Additional information

Responsible editor: Charu Aggarwal.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, P., Zhang, P., Zhou, C. et al. Hierarchical evolving Dirichlet processes for modeling nonlinear evolutionary traces in temporal data. Data Min Knowl Disc 31, 32–64 (2017). https://doi.org/10.1007/s10618-016-0454-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-016-0454-1

Keywords

Navigation