research-article

Peacock: Learning Long-Tail Topic Features for Industrial Applications

Authors:

Jia ZengAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology (TIST), Volume 6, Issue 4

Article No.: 47, Pages 1 - 23

https://doi.org/10.1145/2700497

Published: 15 July 2015 Publication History

Abstract

Latent Dirichlet allocation (LDA) is a popular topic modeling technique in academia but less so in industry, especially in large-scale applications involving search engine and online advertising systems. A main underlying reason is that the topic models used have been too small in scale to be useful; for example, some of the largest LDA models reported in literature have up to 10³ topics, which difficultly cover the long-tail semantic word sets. In this article, we show that the number of topics is a key factor that can significantly boost the utility of topic-modeling systems. In particular, we show that a “big” LDA model with at least 10⁵ topics inferred from 10⁹ search queries can achieve a significant improvement on industrial search engine and online advertising systems, both of which serve hundreds of millions of users. We develop a novel distributed system called Peacock to learn big LDA models from big data. The main features of Peacock include hierarchical distributed architecture, real-time prediction, and topic de-duplication. We empirically demonstrate that the Peacock system is capable of providing significant benefits via highly scalable LDA topic models for several industrial applications.

References

[1]

Sungjin Ahn, Babak Shahbaba, and Max Welling. 2014. Distributed stochastic gradient MCMC. In Proceedings of ICML. 1044--1052.

[2]

Amr Ahmed, Mohamed Aly, Joseph Gonzalez, Shravan M. Narayanamurthy, and Alexander J. Smola. 2012. Scalable inference in latent variable models. In Proceedings of WSDM. 123--132.

Digital Library

[3]

Galen Andrew and Jianfeng Gao. 2007. Scalable training of L¹-regularized log-linear models. In Proceedings of ICML. 33--40.

Digital Library

[4]

Arthur Asuncion, Max Welling, Padhraic Smyth, and Yee Whye Teh. 2009. On smoothing and inference for topic models. In Proceedings of UAI. 27--34.

Digital Library

[5]

Arthur U. Asuncion, Padhraic Smyth, and Max Welling. 2008. Asynchronous distributed learning of topic models. In Proceedings of NIPS. 81--88.

[6]

Petra Berenbrink, Tom Friedetzky, Zengjian Hu, and Russell Martin. 2008. On weighted balls-into-bins games. Theoretical Computer Science 409, 3, 511--520.

Digital Library

[7]

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993--1022.

Digital Library

[8]

Andrei Broder and Vanja Josifovski. 2013. Lecture Introduction to Computational Advertising. Stanford University, Computer Science, Online Lecture Notes.

[9]

Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. 2003. Efficient query evaluation using a two-level retrieval process. In Proceedings of CIKM. 426--434.

Digital Library

[10]

Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, and Michael I. Jordan. 2013. Streaming variational Bayes. In Proceedings of NIPS. 1727--1735.

[11]

Wray L. Buntine and Aleks Jakulin. 2005. Discrete component analysis. In Proceedings of SLSFS. 1--33.

Digital Library

[12]

N. de Freitas and K. Barnard. 2001. Bayesian Latent Semantic Analysis of Multimedia Databases. Technical Report. University of British Columbia.

Digital Library

[13]

Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, and Andrew Y. Ng. 2012. Large scale distributed deep networks. In Proceedings of NIPS. 1232--1240.

Digital Library

[14]

Jeffery Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Communications of the ACM 51, 1, 107--113.

Digital Library

[15]

Benjamin Edelman, Michael Ostrovsky, and Michael Schwarz. 2007. Internet advertising and the generalized second-price auction: Selling billions of dollars worth of keywords. American Economic Review 97, 1, 242--259.

[16]

James R. Foulds, Levi Boyles, Christopher DuBois, Padhraic Smyth, and Max Welling. 2013. Stochastic collapsed variational Bayesian inference for latent Dirichlet allocation. In Proceedings of KDD. 446--454.

Digital Library

[17]

James R. Foulds and Padhraic Smyth. 2014. Annealing paths for the evaluation of topic models. In Proceedings of UAI.

[18]

Rainer Gemulla, Erik Nijkamp, Peter J. Haas, and Yannis Sismanis. 2011. Large-scale matrix factorization with distributed stochastic gradient descent. In Proceedings of KDD. 69--77.

Digital Library

[19]

Thore Graepel, Joaquin Quiñonero Candela, Thomas Borchert, and Ralf Herbrich. 2010. Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoft’s Bing search engine. In Proceedings of ICML. 13--20.

[20]

David Graff and Christopher Cieri. 2003. English Gigaword. Linguistic Data Consortium.

[21]

Thomas Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America 101, 5228--5235.

[22]

Matthew D. Hoffman, David M. Blei, and Francis R. Bach. 2010. Online learning for latent Dirichlet allocation. In Proceedings of NIPS. 856--864.

[23]

Matthew D. Hoffman, David M. Blei, Chong Wang, and John William Paisley. 2013. Stochastic variational inference. Journal of Machine Learning Research 14, 1, 1303--1347.

Digital Library

[24]

Matthew Johnson, James Saunderson, and Alan Willsky. 2013. Analyzing hogwild parallel Gaussian Gibbs sampling. In Proceedings of NIPS.

[25]

Haoyuan Li, Yi Wang, Dong Zhang, Ming Zhang, and Edward Y. Chang. 2008. PFP: Parallel FP-growth for query recommendation. In Proceedings of RecSys. 107--114.

Digital Library

[26]

Zhiyuan Liu, Yuzhou Zhang, Edward Y. Chang, and Maosong Sun. 2011. PLDA+: Parallel latent Dirichlet allocation with data placement and pipeline processing. ACM Transactions on Intelligent Systems and Technology 2, 3, 26.

Digital Library

[27]

David M. Mimno, Matthew D. Hoffman, and David M. Blei. 2012. Sparse stochastic inference for latent Dirichlet allocation. In Proceedings of ICML.

[28]

Thomas P. Minka and John D. Lafferty. 2002. Expectation-propogation for the generative aspect model. In Proceedings of UAI. 352--359.

Digital Library

[29]

Kevin P. Murphy. 2012. Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge, MA.

Digital Library

[30]

David Newman, Arthur U. Asuncion, Padhraic Smyth, and Max Welling. 2007. Distributed inference for latent Dirichlet allocation. In Proceedings of NIPS.

[31]

David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. Automatic evaluation of topic coherence. In Proceedings of HLT-NAACL. 100--108.

Digital Library

[32]

Feng Niu, Benjamin Recht, Christopher Re, and Stephen J. Wright. 2011. HOGWILD&excl;: A lock-free approach to parallelizing stochastic gradient descent. In Proceedings of NIPS. 693--701.

Digital Library

[33]

Sam Patterson and Yee Whye Teh. 2013. Stochastic gradient Riemannian Langevin dynamics on the probability simplex. In Proceedings of NIPS. 3102--3110.

[34]

Ian Porteous, David Newman, Alexander T. Ihler, Arthur U. Asuncion, Padhraic Smyth, and Max Welling. 2008. Fast collapsed Gibbs sampling for latent Dirichlet allocation. In Proceedings of KDD. 569--577.

Digital Library

[35]

Matthew Richardson, Ewa Dominowska, and Robert Ragno. 2007. Predicting clicks: Estimating the click-through rate for new ads. In Proceedings of WWW. 521--530.

Digital Library

[36]

Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. Annals of Mathematical Statistics 22, 3, 400--407.

[37]

Issei Sato and Hiroshi Nakagawa. 2012. Rethinking collapsed variational Bayes inference for LDA. In Proceedings of ICML.

[38]

Mark W. Schmidt, Nicolas Le Roux, and Francis Bach. 2013. Minimizing finite sums with the stochastic average gradient. CoRR abs/1309.2388.

[39]

Alexander J. Smola and Shravan M. Narayanamurthy. 2010. An architecture for parallel topic models. Proceedings of the VLDB Endowment 3, 1, 703--710.

Digital Library

[40]

Mark Steyvers, Padhraic Smyth, Michal Rosen-Zvi, and Thomas L. Griffiths. 2004. Probabilistic author-topic models for information discovery. In Proceedings of KDD. 306--315.

Digital Library

[41]

Yee Whye Teh, Michael Jordan, Matthew Beal, and David Blei. 2004. Hierarchical Dirichlet processes. Journal of the American Statistical Association 101, 476, 1566--1581.

[42]

Yee Whye Teh, David Newman, and Max Welling. 2006. A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. In Proceedings of NIPS. 1353--1360.

[43]

Ellen M. Voorhees and Donna K. Harman (Eds.). 2005. TREC: Experiment and Evaluation in Information Retrieval. MIT Press, Cambridge MA.

Digital Library

[44]

Hanna M. Wallach, David M. Mimno, and Andrew McCallum. 2009a. Rethinking LDA: Why priors matter. In Proceedings of NIPS. 1973--1981.

[45]

Hanna M. Wallach, Iain Murray, Ruslan Salakhutdinov, and David M. Mimno. 2009b. Evaluation methods for topic models. In Proceedings of ICML. 1105--1112.

Digital Library

[46]

Yi Wang, Hongjie Bai, Matt Stanton, Wen-Yen Chen, and Edward Y. Chang. 2009. PLDA: Parallel latent Dirichlet allocation for large-scale applications. In Proceedings of AAIM. 301--314.

Digital Library

[47]

Feng Yan, Ningyi Xu, and Yuan Qi. 2009. Parallel inference for latent Dirichlet allocation on graphics processing units. In Proceedings of NIPS. 2134--2142.

[48]

Jian-Feng Yan, Jia Zeng, Yang Gao, and Zhi-Qiang Liu. 2014. Communication-efficient algorithms for parallel latent Dirichlet allocation. Soft Computing 19, 1, 3--11.

Digital Library

[49]

Jian-Feng Yan, Jia Zeng, Zhi-Qiang Liu, and Yang Gao. 2013. Towards big topic modeling. arXiv:1311.4150.

[50]

Limin Yao, David M. Mimno, and Andrew McCallum. 2009. Efficient methods for topic model inference on streaming document collections. In Proceedings of KDD. 937--946.

Digital Library

[51]

Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin. 2010. A comparison of optimization methods and software for large-scale L¹-regularized linear classification. Journal of Machine Learning Research 11, 3183--3234.

Digital Library

[52]

Jia Zeng. 2012. A topic modeling toolbox using belief propagation. Journal of Machine Learning Research 13, 2233--2236.

Digital Library

[53]

Jia Zeng, William K. Cheung, and Jiming Liu. 2013. Learning topic models by belief propagation. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 5, 1121--1134.

Digital Library

[54]

Jia Zeng, Zhi-Qiang Liu, and Xiao-Qin Cao. 2012a. A new approach to speeding up topic modeling. arXiv:1204.0170 {cs.LG}.

[55]

Jia Zeng, Zhi-Qiang Liu, and Xiao-Qin Cao. 2012b. Online belief propagation for topic modeling. arXiv:1210.2179 {cs.LG}.

[56]

Ke Zhai, Jordan L. Boyd-Graber, Nima Asadi, and Mohamad L. Alkhouja. 2012. Mr. LDA: A flexible large scale topic modeling package using variational inference in MapReduce. In Proceedings of WWW. 879--888.

Digital Library

[57]

Yong Zhuang, Wei-Sheng Chin, Yu-Chin Juan, and Chih-Jen Lin. 2013. A fast parallel SGD for matrix factorization in shared memory systems. In Proceedings of RecSys. 249--256.

Digital Library

Cited By

Wu YMeng XZhang JHe YRomo JDong YLu D(2024)Effective LSTMs with seasonal-trend decomposition and adaptive learning and niching-based backtracking search algorithm for time series forecastingExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121202236:COnline publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1016/j.eswa.2023.121202
Bobunov AKorobkin DFomenkov S(2023)Development of the Concept and Architecture of an Automated System for Updating Physical Knowledge for Information Support of Search Design2023 International Russian Smart Industry Conference (SmartIndustryCon)10.1109/SmartIndustryCon57312.2023.10110764(281-288)Online publication date: 27-Mar-2023
https://doi.org/10.1109/SmartIndustryCon57312.2023.10110764
Apishev M(2021)Effective Implementations of Topic Modeling AlgorithmsProgramming and Computer Software10.1134/S036176882107002147:7(483-492)Online publication date: 3-Dec-2021
https://doi.org/10.1134/S0361768821070021
Show More Cited By

Index Terms

Peacock: Learning Long-Tail Topic Features for Industrial Applications
1. Information systems
  1. Information systems applications

Recommendations

Joint sentiment/topic model for sentiment analysis
CIKM '09: Proceedings of the 18th ACM conference on Information and knowledge management

Sentiment analysis or opinion mining aims to use automated tools to detect subjective information such as opinions, attitudes, and feelings expressed in text. This paper proposes a novel probabilistic modeling framework based on Latent Dirichlet ...
Latent dirichlet allocation based multi-document summarization
AND '08: Proceedings of the second workshop on Analytics for noisy unstructured text data

Extraction based Multi-Document Summarization Algorithms consist of choosing sentences from the documents using some weighting mechanism and combining them into a summary. In this article we use Latent Dirichlet Allocation to capture the events being ...
ADM-LDA: An aspect detection model based on topic modelling using the structure of review sentences

Probabilistic topic models are statistical methods whose aim is to discover the latent structure in a large collection of documents. The intuition behind topic models is that, by generating documents by latent topics, the word distribution for each ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology

ACM Transactions on Intelligent Systems and Technology Volume 6, Issue 4

Regular Papers and Special Section on Intelligent Healthcare Informatics

August 2015

419 pages

ISSN:2157-6904

EISSN:2157-6912

DOI:10.1145/2801030

Editor:
Yu Zheng
Microsoft Research, China

Issue’s Table of Contents

Copyright © 2015 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 July 2015

Accepted: 01 December 2014

Revised: 01 October 2014

Received: 01 May 2014

Published in TIST Volume 6, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Innovative Research Team in Soochow University
Natural Science Foundation of the Jiangsu Higher Education Institutions of China
National Grant Fundamental Research (973 Program) of China
National Natural Science Foundation of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

26
Total Citations
View Citations
563
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)1

Reflects downloads up to 16 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wu YMeng XZhang JHe YRomo JDong YLu D(2024)Effective LSTMs with seasonal-trend decomposition and adaptive learning and niching-based backtracking search algorithm for time series forecastingExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121202236:COnline publication date: 1-Feb-2024
https://dl.acm.org/doi/10.1016/j.eswa.2023.121202
Bobunov AKorobkin DFomenkov S(2023)Development of the Concept and Architecture of an Automated System for Updating Physical Knowledge for Information Support of Search Design2023 International Russian Smart Industry Conference (SmartIndustryCon)10.1109/SmartIndustryCon57312.2023.10110764(281-288)Online publication date: 27-Mar-2023
https://doi.org/10.1109/SmartIndustryCon57312.2023.10110764
Apishev M(2021)Effective Implementations of Topic Modeling AlgorithmsProgramming and Computer Software10.1134/S036176882107002147:7(483-492)Online publication date: 3-Dec-2021
https://doi.org/10.1134/S0361768821070021
Korencic DRistov SRepar JSnajder J(2021)A Topic Coverage Approach to Evaluation of Topic ModelsIEEE Access10.1109/ACCESS.2021.31094259(123280-123312)Online publication date: 2021
https://doi.org/10.1109/ACCESS.2021.3109425
Li KChen JChen WZhu J(2020)SaberLDA: Sparsity-Aware Learning of Topic Models on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.297970231:9(2112-2124)Online publication date: 1-Sep-2020
https://doi.org/10.1109/TPDS.2020.2979702
Chen JYing HLiu XGu JFeng RChen TGao HWu J(2020)A Transfer Learning Based Super-Resolution Microscopy for Biopsy Slice Images: The Joint Methods PerspectiveIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2020.2991173(1-1)Online publication date: 2020
https://doi.org/10.1109/TCBB.2020.2991173
Wang TWang XQu HZhang J(2020)Thinking on the Application of Big Data in Fault Diagnosis of Military Equipment2020 IEEE 9th Joint International Information Technology and Artificial Intelligence Conference (ITAIC)10.1109/ITAIC49862.2020.9339172(1451-1457)Online publication date: 11-Dec-2020
https://doi.org/10.1109/ITAIC49862.2020.9339172
Li XLei L(2019)A bibliometric analysis of topic modelling studies (2000–2017)Journal of Information Science10.1177/0165551519877049(016555151987704)Online publication date: 20-Sep-2019
https://doi.org/10.1177/0165551519877049
Tian HYu MWang W(2018)ContinuumProceedings of the ACM Symposium on Cloud Computing10.1145/3267809.3267817(26-40)Online publication date: 11-Oct-2018
https://dl.acm.org/doi/10.1145/3267809.3267817
Wang TZhou ZHu XLiu ZDing YCai Z(2018)Latent topics resonance in scientific literature and commentaries: evidences from natural language processing approachHeliyon10.1016/j.heliyon.2018.e006594:6(e00659)Online publication date: Jun-2018
https://doi.org/10.1016/j.heliyon.2018.e00659
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents