Skip to main content
Log in

Labeled Phrase Latent Dirichlet Allocation and its online learning algorithm

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

There is a mass of user-marked text data on the Internet, such as web pages with categories, papers with corresponding keywords, and tweets with hashtags. In recent years, supervised topic models, such as Labeled Latent Dirichlet Allocation, have been widely used to discover the abstract topics in labeled text corpora. However, none of these topic models have taken into consideration word order under the bag-of-words assumption, which will obviously lose a lot of semantic information. In this paper, in order to synchronously model semantical label information and word order, we propose a novel topic model, called Labeled Phrase Latent Dirichlet Allocation (LPLDA), which regards each document as a mixture of phrases and partly considers the word order. In order to obtain the parameter estimation for the proposed LPLDA model, we develop a batch inference algorithm based on Gibbs sampling technique. Moreover, to accelerate the LPLDA’s processing speed for large-scale stream data, we further propose an online inference algorithm for LPLDA. Extensive experiments were conducted among LPLDA and four state-of-the-art baselines. The results show (1) batch LPLDA significantly outperforms baselines in terms of case study, perplexity and scalability, and the third party task in most cases; (2) the online algorithm for LPLDA is obviously more efficient than batch method under the premise of good results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. https://www.domo.com/learn/data-never-sleeps-2 (Accessed date: March 1, 2017).

  2. http://twitter.com/ (Accessed date: March 1, 2017).

  3. https://answers.yahoo.com/ (Accessed date: March 1, 2017).

References

  • Agrawal R, Srikant R et al (1994) Fast algorithms for mining association rules. In: Proceedings of the 20th very large data bases (VLDB) conference, vol 1215, pp 487–499

  • AlSumait L, Barbará D, Domeniconi C (2008) On-line LDA: adaptive topic models for mining text streams with applications to topic detection and tracking. In: 2008 Eighth IEEE international conference on data mining. IEEE, pp 3–12

  • Banerjee A, Basu S (2007) Topic models over text streams: a study of batch and online unsupervised learning. In: Proceedings of the 2007 SIAM international conference on data mining. SIAM, pp 431–436

  • Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3:993–1022

    MATH  Google Scholar 

  • Broderick T, Boyd N, Wibisono A, Wilson AC, Jordan MI (2013) Streaming variational Bayes. In: Advances in neural information processing systems. Curran Associates Inc., pp 1727–1735

  • Canini KR, Shi L, Griffiths TL (2009) Online inference of topics with latent Dirichlet allocation. In: Proceedings of the twelfth international conference on artificial intelligence and statistics, vol 5. PMLR, pp 65–72

  • Eddy SR (1996) Hidden Markov models. Curr Opin Struct Biol 6(3):361–365

    Article  MathSciNet  Google Scholar 

  • Elkishky A, Song Y, Wang C, Voss CR, Han J (2014) Scalable topical phrase mining from text corpora. In: Proceedings of The VLDB endowment, vol 8, no 3, pp 305–316

  • Foulds J, Boyles L, DuBois C, Smyth P, Welling M (2013) Stochastic collapsed variational Bayesian inference for latent Dirichlet allocation. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 446–454

  • Gao Y, Chen J, Zhu J (2016) Streaming Gibbs sampling for LDA model. ArXiv preprint arXiv:1601.01142

  • Ghahramani Z, Attias H (2000) Online variational Bayesian learning. Slides from talk presented at neural information processing systems workshop on online learning

  • Griffiths TL, Steyvers M (2004) Finding scientific topics. Proc Natl Acad Sci 101(suppl 1):5228–5235

    Article  Google Scholar 

  • Griffiths TL, Steyvers M, Blei DM, Tenenbaum JB (2005) Integrating topics and syntax. In: Advances in neural information processing systems, vol 17. MIT Press, pp 537–544

  • Han J, Pei J, Kamber M (2011) Data mining: concepts and techniques. Elsevier, Amsterdam

    MATH  Google Scholar 

  • Hoffman M, Blei D (2015) Stochastic structured variational inference. In: Proceedings of the eighteenth international conference on artificial intelligence and statistics, vol 38. PMLR, pp 361–369

  • Hoffman M, Bach FR, Blei DM (2010) Online learning for latent Dirichlet allocation. In: Advances in neural information processing systems, vol 23. Curran Associates Inc., pp 856–864

  • Hoffman MD, Blei DM, Wang C, Paisley JW (2013) Stochastic variational inference. J Mach Learn Res 14(1):1303–1347

    MathSciNet  MATH  Google Scholar 

  • Kingma DP, Welling M (2014) Auto-encoding variational Bayes. In: 2nd international conference on learning representations (ICLR2014), Ithaca, NY.

  • Lacoste-Julien S, Sha F, Jordan MI (2009) DiscLDA: discriminative learning for dimensionality reduction and classification. In: Advances in neural information processing systems, vol 21. Curran Associates Inc., pp 897–904

  • Lakkaraju H, Bhattacharyya C, Bhattacharya I, Merugu S (2011) Exploiting coherence for the simultaneous discovery of latent facets and associated sentiments. In: Proceedings of the 2011 SIAM international conference on data mining. SIAM, pp 498–509

  • Li X, Ouyang J, Zhou X (2016) Labelset topic model for multi-label document classification. J Intell Inf Syst 46(1):83–97

    Article  Google Scholar 

  • Liang S, Ren Z, Zhao Y, Ma J, Yilmaz E, Rijke MD (2017) Inferring dynamic user interests in short text streams for user clustering. ACM Trans Inf Syst (TOIS) 36(1):10:1–10:37

    Google Scholar 

  • Lindsey RV, Headden III WP, Stipicevic MJ (2012) A phrase-discovering topic model using hierarchical Pitman–Yor processes. In: Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, association for computational linguistics, pp 214–222

  • Magnusson M, Jonsson L, Villani M (2016) DOLDA—a regularized supervised topic model for high-dimensional multi-class regression. ArXiv preprint arXiv:1602.00260

  • Mao XL, Ming ZY, Chua TS, Li S, Yan H, Li X (2012) SSHLDA: a semi-supervised hierarchical topic model. In: Proceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning. Association for Computational Linguistics, pp 800–809

  • Mcauliffe JD, Blei DM (2008) Supervised topic models. In: Advances in neural information processing systems, vol 20. Curran Associates Inc., pp 121–128

  • McInerney J, Ranganath R, Blei D (2015) The population posterior and Bayesian modeling on streams. In: Advances in neural information processing systems, vol 28. Curran Associates Inc., pp 1153–1161

  • Mukherjee S, Basu G, Joshi S (2014) Joint author sentiment topic model. In: Proceedings of the 2014 SIAM international conference on data mining. SIAM, pp 370–378

  • Perotte AJ, Wood F, Elhadad N, Bartlett N (2011) Hierarchically supervised latent Dirichlet allocation. In: Advances in neural information processing systems, vol 24. Curran Associates Inc., pp 2609–2617

  • Petinot Y, McKeown K, Thadani K (2011) A hierarchical model of web summaries. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies: short papers, vol 2. Association for Computational Linguistics, pp 670–675

  • Ramage D, Hall D, Nallapati R, Manning CD (2009a) Labeled LDA: a supervised topic model for credit attribution in multi-labeled corpora. In: Proceedings of the 2009 conference on empirical methods in natural language processing, vol 1. Association for Computational Linguistics, pp 248–256

  • Ramage D, Heymann P, Manning CD, Garcia-Molina H (2009b) Clustering the tagged web. In: Proceedings of the second ACM international conference on web search and data mining. ACM, pp 54–63

  • Ramage D, Manning CD, Dumais S (2011) Partially labeled topic models for interpretable text mining. In: Proceedings of the 17th ACM SIGKDD international conference on knowledge discovery and data mining. ACM, pp 457–465

  • Ren Z, Liang S, Meij E, de Rijke M (2013) Personalized time-aware tweets summarization. In: Proceedings of the 36th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 513–522

  • Ren Z, Liang S, Li P, Wang S, de Rijke M (2017) Social collaborative viewpoint regression with explainable recommendations. In: Proceedings of the tenth ACM international conference on web search and data mining. ACM, pp 485–494

  • Rosen-Zvi M, Griffiths T, Steyvers M, Smyth P (2004) The author-topic model for authors and documents. In: Proceedings of the 20th conference on uncertainty in artificial intelligence. AUAI Press, pp 487–494

  • Rubin TN, Chambers A, Smyth P, Steyvers M (2012) Statistical topic models for multi-label document classification. Mach Learn 88:157–208

    Article  MathSciNet  MATH  Google Scholar 

  • Schapire RE, Singer Y (2000) BoosTexter: a boosting-based systemfor text categorization. Mach Learn 39:135–168

    Article  MATH  Google Scholar 

  • Shi T, Zhu J (2014) Online Bayesian passive-aggressive learning. In: Proceedings of the 31st international conference on international conference on machine learning, vol 32. JMLR.org, pp I-378–I-386

  • Slutsky A, Hu X, An Y (2013) Tree labeled LDA: a hierarchical model for web summaries. In: IEEE international conference on big data. IEEE, pp 134–140

  • Song X, Lin CY, Tseng BL, Sun MT (2005) Modeling and predicting personal information dissemination behavior. In: Proceedings of the eleventh ACM SIGKDD international conference on knowledge discovery in data mining. ACM, pp 479–488

  • Spagnola S, Lagoze C (2011) Word order matters: measuring topic coherence with lexical argument structure. In: Proceedings of the 11th annual international ACM/IEEE joint conference on Digital libraries. ACM, pp 21–24

  • Tang J, Zhang M, Mei Q (2014) “Look ma, no hands!” A parameter-free topic model. ArXiv preprint arXiv:1409.2993

  • Tang YK, Mao XL, Huang H (2016) Labeled phrase latent Dirichlet allocation. In: International conference on web information systems engineering. Springer, pp 525–536

  • Tang YK, Mao XL, Huang H, Shi X, Wen G (2018) Conceptualization topic modeling. Multimedia Tools Appl 77(3):3455–3471

    Article  Google Scholar 

  • Wallach HM (2006) Topic modeling: beyond bag-of-words. In: Proceedings of the 23rd international conference on Machine learning. ACM, pp 977–984

  • Wang C, Danilevsky M, Desai N, Zhang Y, Nguyen P, Taula T, Han J (2013) A phrase mining framework for recursive construction of a topical hierarchy. In: Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 437–445

  • Wang X, McCallum A, Wei X (2007) Topical n-grams: phrase and topic discovery, with an application to information retrieval. In: Seventh IEEE international conference on data mining (ICDM 2007). IEEE, pp 697–702

  • Wang Y, Agichtein E, Benzi M (2012) TM-LDA: efficient online modeling of latent topic transitions in social media. In: Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, pp 123–131

  • Xiao H, Wang X, Du C (2009) Injecting structured data to generative topic model in enterprise settings. In: Advances in machine learning: first Asian conference on machine learning, ACML 2009. Springer, Berlin, pp 382–395

  • Xiao X, Xiong D, Zhang M, Liu Q, Lin S (2012) A topic similarity model for hierarchical phrase-based translation. In: Proceedings of the 50th annual meeting of the association for computational linguistics: long papers, vol 1. Association for Computational Linguistics, pp 750–758

  • Zhang A, Zhu J, Zhang B (2013) Sparse online topic models. In: Proceedings of the 22nd international conference on world wide web. ACM, pp 1489–1500

  • Zhao WX, Wang J, He Y, Nie J, Wen J, Li X (2015) Incorporating social role theory into topic models for social media content analysis. IEEE Trans Knowl Data Eng 27(4):1032–1044

    Article  Google Scholar 

  • Zhao Y, Liang S, Ren Z, Ma J, Yilmaz E, de Rijke M (2016) Explainable user clustering in short text streams. In: Proceedings of the 39th international ACM SIGIR conference on research and development in information retrieval. ACM, pp 155–164

  • Zhou Q, Huang H, Mao XL (2015) An online inference algorithm for labeled latent Dirichlet allocation. In: Proceedings on web technologies and applications: 17th Asia-Pacific web conference, APWeb 2015, Guangzhou, China, 18–20 Sept 2015. Springer, pp 17–28

  • Zhu J, Chen N, Perkins H, Zhang B (2013) Gibbs max-margin topic models with fast sampling algorithms. In: Proceedings of the 30th international conference on machine learning, vol 28. PMLR, pp 124–132

Download references

Acknowledgements

This work was supported by National Key Research and Development Program of China (2016YFB1000902), China National Science Foundation (61402036, 61772076), Beijing Advanced Innovation Center for Imaging Technology (BAICIT-2016007), Open Fund Project from Beijing Key Laboratory of Internet Culture and Digital Dissemination Research (ICDD201701) and Open Fund Project of Fujian Provincial Key Laboratory of Information Processing and Intelligent Control (Minjiang University) (MJUKF201738). A preliminary version of this work appears in Tang et al. (2016).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xian-Ling Mao.

Additional information

Responsible editor: Pauli Miettinen.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tang, YK., Mao, XL. & Huang, H. Labeled Phrase Latent Dirichlet Allocation and its online learning algorithm. Data Min Knowl Disc 32, 885–912 (2018). https://doi.org/10.1007/s10618-018-0555-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-018-0555-0

Keywords

Navigation