skip to main content
research-article

Peacock: Learning Long-Tail Topic Features for Industrial Applications

Published: 15 July 2015 Publication History

Abstract

Latent Dirichlet allocation (LDA) is a popular topic modeling technique in academia but less so in industry, especially in large-scale applications involving search engine and online advertising systems. A main underlying reason is that the topic models used have been too small in scale to be useful; for example, some of the largest LDA models reported in literature have up to 103 topics, which difficultly cover the long-tail semantic word sets. In this article, we show that the number of topics is a key factor that can significantly boost the utility of topic-modeling systems. In particular, we show that a “big” LDA model with at least 105 topics inferred from 109 search queries can achieve a significant improvement on industrial search engine and online advertising systems, both of which serve hundreds of millions of users. We develop a novel distributed system called Peacock to learn big LDA models from big data. The main features of Peacock include hierarchical distributed architecture, real-time prediction, and topic de-duplication. We empirically demonstrate that the Peacock system is capable of providing significant benefits via highly scalable LDA topic models for several industrial applications.

References

[1]
Sungjin Ahn, Babak Shahbaba, and Max Welling. 2014. Distributed stochastic gradient MCMC. In Proceedings of ICML. 1044--1052.
[2]
Amr Ahmed, Mohamed Aly, Joseph Gonzalez, Shravan M. Narayanamurthy, and Alexander J. Smola. 2012. Scalable inference in latent variable models. In Proceedings of WSDM. 123--132.
[3]
Galen Andrew and Jianfeng Gao. 2007. Scalable training of L1-regularized log-linear models. In Proceedings of ICML. 33--40.
[4]
Arthur Asuncion, Max Welling, Padhraic Smyth, and Yee Whye Teh. 2009. On smoothing and inference for topic models. In Proceedings of UAI. 27--34.
[5]
Arthur U. Asuncion, Padhraic Smyth, and Max Welling. 2008. Asynchronous distributed learning of topic models. In Proceedings of NIPS. 81--88.
[6]
Petra Berenbrink, Tom Friedetzky, Zengjian Hu, and Russell Martin. 2008. On weighted balls-into-bins games. Theoretical Computer Science 409, 3, 511--520.
[7]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993--1022.
[8]
Andrei Broder and Vanja Josifovski. 2013. Lecture Introduction to Computational Advertising. Stanford University, Computer Science, Online Lecture Notes.
[9]
Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. 2003. Efficient query evaluation using a two-level retrieval process. In Proceedings of CIKM. 426--434.
[10]
Tamara Broderick, Nicholas Boyd, Andre Wibisono, Ashia C. Wilson, and Michael I. Jordan. 2013. Streaming variational Bayes. In Proceedings of NIPS. 1727--1735.
[11]
Wray L. Buntine and Aleks Jakulin. 2005. Discrete component analysis. In Proceedings of SLSFS. 1--33.
[12]
N. de Freitas and K. Barnard. 2001. Bayesian Latent Semantic Analysis of Multimedia Databases. Technical Report. University of British Columbia.
[13]
Jeffrey Dean, Greg Corrado, Rajat Monga, Kai Chen, Matthieu Devin, Quoc V. Le, Mark Z. Mao, Marc’Aurelio Ranzato, Andrew W. Senior, Paul A. Tucker, Ke Yang, and Andrew Y. Ng. 2012. Large scale distributed deep networks. In Proceedings of NIPS. 1232--1240.
[14]
Jeffery Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Communications of the ACM 51, 1, 107--113.
[15]
Benjamin Edelman, Michael Ostrovsky, and Michael Schwarz. 2007. Internet advertising and the generalized second-price auction: Selling billions of dollars worth of keywords. American Economic Review 97, 1, 242--259.
[16]
James R. Foulds, Levi Boyles, Christopher DuBois, Padhraic Smyth, and Max Welling. 2013. Stochastic collapsed variational Bayesian inference for latent Dirichlet allocation. In Proceedings of KDD. 446--454.
[17]
James R. Foulds and Padhraic Smyth. 2014. Annealing paths for the evaluation of topic models. In Proceedings of UAI.
[18]
Rainer Gemulla, Erik Nijkamp, Peter J. Haas, and Yannis Sismanis. 2011. Large-scale matrix factorization with distributed stochastic gradient descent. In Proceedings of KDD. 69--77.
[19]
Thore Graepel, Joaquin Quiñonero Candela, Thomas Borchert, and Ralf Herbrich. 2010. Web-scale Bayesian click-through rate prediction for sponsored search advertising in Microsoft’s Bing search engine. In Proceedings of ICML. 13--20.
[20]
David Graff and Christopher Cieri. 2003. English Gigaword. Linguistic Data Consortium.
[21]
Thomas Griffiths and Mark Steyvers. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America 101, 5228--5235.
[22]
Matthew D. Hoffman, David M. Blei, and Francis R. Bach. 2010. Online learning for latent Dirichlet allocation. In Proceedings of NIPS. 856--864.
[23]
Matthew D. Hoffman, David M. Blei, Chong Wang, and John William Paisley. 2013. Stochastic variational inference. Journal of Machine Learning Research 14, 1, 1303--1347.
[24]
Matthew Johnson, James Saunderson, and Alan Willsky. 2013. Analyzing hogwild parallel Gaussian Gibbs sampling. In Proceedings of NIPS.
[25]
Haoyuan Li, Yi Wang, Dong Zhang, Ming Zhang, and Edward Y. Chang. 2008. PFP: Parallel FP-growth for query recommendation. In Proceedings of RecSys. 107--114.
[26]
Zhiyuan Liu, Yuzhou Zhang, Edward Y. Chang, and Maosong Sun. 2011. PLDA+: Parallel latent Dirichlet allocation with data placement and pipeline processing. ACM Transactions on Intelligent Systems and Technology 2, 3, 26.
[27]
David M. Mimno, Matthew D. Hoffman, and David M. Blei. 2012. Sparse stochastic inference for latent Dirichlet allocation. In Proceedings of ICML.
[28]
Thomas P. Minka and John D. Lafferty. 2002. Expectation-propogation for the generative aspect model. In Proceedings of UAI. 352--359.
[29]
Kevin P. Murphy. 2012. Machine Learning: A Probabilistic Perspective. MIT Press, Cambridge, MA.
[30]
David Newman, Arthur U. Asuncion, Padhraic Smyth, and Max Welling. 2007. Distributed inference for latent Dirichlet allocation. In Proceedings of NIPS.
[31]
David Newman, Jey Han Lau, Karl Grieser, and Timothy Baldwin. 2010. Automatic evaluation of topic coherence. In Proceedings of HLT-NAACL. 100--108.
[32]
Feng Niu, Benjamin Recht, Christopher Re, and Stephen J. Wright. 2011. HOGWILD!: A lock-free approach to parallelizing stochastic gradient descent. In Proceedings of NIPS. 693--701.
[33]
Sam Patterson and Yee Whye Teh. 2013. Stochastic gradient Riemannian Langevin dynamics on the probability simplex. In Proceedings of NIPS. 3102--3110.
[34]
Ian Porteous, David Newman, Alexander T. Ihler, Arthur U. Asuncion, Padhraic Smyth, and Max Welling. 2008. Fast collapsed Gibbs sampling for latent Dirichlet allocation. In Proceedings of KDD. 569--577.
[35]
Matthew Richardson, Ewa Dominowska, and Robert Ragno. 2007. Predicting clicks: Estimating the click-through rate for new ads. In Proceedings of WWW. 521--530.
[36]
Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method. Annals of Mathematical Statistics 22, 3, 400--407.
[37]
Issei Sato and Hiroshi Nakagawa. 2012. Rethinking collapsed variational Bayes inference for LDA. In Proceedings of ICML.
[38]
Mark W. Schmidt, Nicolas Le Roux, and Francis Bach. 2013. Minimizing finite sums with the stochastic average gradient. CoRR abs/1309.2388.
[39]
Alexander J. Smola and Shravan M. Narayanamurthy. 2010. An architecture for parallel topic models. Proceedings of the VLDB Endowment 3, 1, 703--710.
[40]
Mark Steyvers, Padhraic Smyth, Michal Rosen-Zvi, and Thomas L. Griffiths. 2004. Probabilistic author-topic models for information discovery. In Proceedings of KDD. 306--315.
[41]
Yee Whye Teh, Michael Jordan, Matthew Beal, and David Blei. 2004. Hierarchical Dirichlet processes. Journal of the American Statistical Association 101, 476, 1566--1581.
[42]
Yee Whye Teh, David Newman, and Max Welling. 2006. A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. In Proceedings of NIPS. 1353--1360.
[43]
Ellen M. Voorhees and Donna K. Harman (Eds.). 2005. TREC: Experiment and Evaluation in Information Retrieval. MIT Press, Cambridge MA.
[44]
Hanna M. Wallach, David M. Mimno, and Andrew McCallum. 2009a. Rethinking LDA: Why priors matter. In Proceedings of NIPS. 1973--1981.
[45]
Hanna M. Wallach, Iain Murray, Ruslan Salakhutdinov, and David M. Mimno. 2009b. Evaluation methods for topic models. In Proceedings of ICML. 1105--1112.
[46]
Yi Wang, Hongjie Bai, Matt Stanton, Wen-Yen Chen, and Edward Y. Chang. 2009. PLDA: Parallel latent Dirichlet allocation for large-scale applications. In Proceedings of AAIM. 301--314.
[47]
Feng Yan, Ningyi Xu, and Yuan Qi. 2009. Parallel inference for latent Dirichlet allocation on graphics processing units. In Proceedings of NIPS. 2134--2142.
[48]
Jian-Feng Yan, Jia Zeng, Yang Gao, and Zhi-Qiang Liu. 2014. Communication-efficient algorithms for parallel latent Dirichlet allocation. Soft Computing 19, 1, 3--11.
[49]
Jian-Feng Yan, Jia Zeng, Zhi-Qiang Liu, and Yang Gao. 2013. Towards big topic modeling. arXiv:1311.4150.
[50]
Limin Yao, David M. Mimno, and Andrew McCallum. 2009. Efficient methods for topic model inference on streaming document collections. In Proceedings of KDD. 937--946.
[51]
Guo-Xun Yuan, Kai-Wei Chang, Cho-Jui Hsieh, and Chih-Jen Lin. 2010. A comparison of optimization methods and software for large-scale L1-regularized linear classification. Journal of Machine Learning Research 11, 3183--3234.
[52]
Jia Zeng. 2012. A topic modeling toolbox using belief propagation. Journal of Machine Learning Research 13, 2233--2236.
[53]
Jia Zeng, William K. Cheung, and Jiming Liu. 2013. Learning topic models by belief propagation. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 5, 1121--1134.
[54]
Jia Zeng, Zhi-Qiang Liu, and Xiao-Qin Cao. 2012a. A new approach to speeding up topic modeling. arXiv:1204.0170 {cs.LG}.
[55]
Jia Zeng, Zhi-Qiang Liu, and Xiao-Qin Cao. 2012b. Online belief propagation for topic modeling. arXiv:1210.2179 {cs.LG}.
[56]
Ke Zhai, Jordan L. Boyd-Graber, Nima Asadi, and Mohamad L. Alkhouja. 2012. Mr. LDA: A flexible large scale topic modeling package using variational inference in MapReduce. In Proceedings of WWW. 879--888.
[57]
Yong Zhuang, Wei-Sheng Chin, Yu-Chin Juan, and Chih-Jen Lin. 2013. A fast parallel SGD for matrix factorization in shared memory systems. In Proceedings of RecSys. 249--256.

Cited By

View all
  • (2024)Effective LSTMs with seasonal-trend decomposition and adaptive learning and niching-based backtracking search algorithm for time series forecastingExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121202236:COnline publication date: 1-Feb-2024
  • (2023)Development of the Concept and Architecture of an Automated System for Updating Physical Knowledge for Information Support of Search Design2023 International Russian Smart Industry Conference (SmartIndustryCon)10.1109/SmartIndustryCon57312.2023.10110764(281-288)Online publication date: 27-Mar-2023
  • (2021)Effective Implementations of Topic Modeling AlgorithmsProgramming and Computer Software10.1134/S036176882107002147:7(483-492)Online publication date: 3-Dec-2021
  • Show More Cited By

Index Terms

  1. Peacock: Learning Long-Tail Topic Features for Industrial Applications

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Intelligent Systems and Technology
    ACM Transactions on Intelligent Systems and Technology  Volume 6, Issue 4
    Regular Papers and Special Section on Intelligent Healthcare Informatics
    August 2015
    419 pages
    ISSN:2157-6904
    EISSN:2157-6912
    DOI:10.1145/2801030
    • Editor:
    • Yu Zheng
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 15 July 2015
    Accepted: 01 December 2014
    Revised: 01 October 2014
    Received: 01 May 2014
    Published in TIST Volume 6, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Latent Dirichlet allocation
    2. big data
    3. big topic models
    4. long-tail topic features
    5. online advertising systems
    6. search engine

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • Innovative Research Team in Soochow University
    • Natural Science Foundation of the Jiangsu Higher Education Institutions of China
    • National Grant Fundamental Research (973 Program) of China
    • National Natural Science Foundation of China

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)20
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 16 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Effective LSTMs with seasonal-trend decomposition and adaptive learning and niching-based backtracking search algorithm for time series forecastingExpert Systems with Applications: An International Journal10.1016/j.eswa.2023.121202236:COnline publication date: 1-Feb-2024
    • (2023)Development of the Concept and Architecture of an Automated System for Updating Physical Knowledge for Information Support of Search Design2023 International Russian Smart Industry Conference (SmartIndustryCon)10.1109/SmartIndustryCon57312.2023.10110764(281-288)Online publication date: 27-Mar-2023
    • (2021)Effective Implementations of Topic Modeling AlgorithmsProgramming and Computer Software10.1134/S036176882107002147:7(483-492)Online publication date: 3-Dec-2021
    • (2021)A Topic Coverage Approach to Evaluation of Topic ModelsIEEE Access10.1109/ACCESS.2021.31094259(123280-123312)Online publication date: 2021
    • (2020)SaberLDA: Sparsity-Aware Learning of Topic Models on GPUsIEEE Transactions on Parallel and Distributed Systems10.1109/TPDS.2020.297970231:9(2112-2124)Online publication date: 1-Sep-2020
    • (2020)A Transfer Learning Based Super-Resolution Microscopy for Biopsy Slice Images: The Joint Methods PerspectiveIEEE/ACM Transactions on Computational Biology and Bioinformatics10.1109/TCBB.2020.2991173(1-1)Online publication date: 2020
    • (2020)Thinking on the Application of Big Data in Fault Diagnosis of Military Equipment2020 IEEE 9th Joint International Information Technology and Artificial Intelligence Conference (ITAIC)10.1109/ITAIC49862.2020.9339172(1451-1457)Online publication date: 11-Dec-2020
    • (2019)A bibliometric analysis of topic modelling studies (2000–2017)Journal of Information Science10.1177/0165551519877049(016555151987704)Online publication date: 20-Sep-2019
    • (2018)ContinuumProceedings of the ACM Symposium on Cloud Computing10.1145/3267809.3267817(26-40)Online publication date: 11-Oct-2018
    • (2018)Latent topics resonance in scientific literature and commentaries: evidences from natural language processing approachHeliyon10.1016/j.heliyon.2018.e006594:6(e00659)Online publication date: Jun-2018
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media