skip to main content
10.1145/2983323.2983765acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Understanding Sparse Topical Structure of Short Text via Stochastic Variational-Gibbs Inference

Published: 24 October 2016 Publication History

Abstract

With the soaring popularity of online social media like Twitter, analyzing short text has emerged as an increasingly important task which is challenging to classical topic models, as topic sparsity exists in short text. Topic sparsity refers to the observation that individual document usually concentrates on several salient topics, which may be rare in entire corpus. Understanding this sparse topical structure of short text has been recognized as the key ingredient for mining user-generated Web content and social medium, which are featured in the form of extremely short posts and discussions. However, the existing sparsity-enhanced topic models all assume over-complicated generative process, which severely limits their scalability and makes them unable to automatically infer the number of topics from data.
In this paper, we propose a probabilistic Bayesian topic model, namely Sparse Dirichlet mixture Topic Model (SparseDTM), based on Indian Buffet Process (IBP) prior, and infer our model on the large text corpora through a novel inference procedure called stochastic variational-Gibbs inference. Unlike prior work, the proposed approach is able to achieve exact sparse topical structure of large short text collections, and automatically identify the number of topics with a good balance between completeness and homogeneity of topic coherence. Experiments on different genres of large text corpora demonstrate that our approach outperforms various existing sparse topic models. The improvement is significant on large-scale collections of short text.

References

[1]
C. Andrieu, N. De Freitas, A. Doucet, and M. I. Jordan. An introduction to mcmc for machine learning. Machine learning, 50(1--2):5--43, 2003.
[2]
C. Archambeau, B. Lakshminarayanan, and G. Bouchard. Latent ibp compound dirichlet allocation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):321--333, 2015.
[3]
D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993--1022, 2003.
[4]
X. Chen, M. Zhou, and L. Carin. The contextual focused topic model. In KDD, pages 96--104, 2012.
[5]
X. Cheng, X. Yan, Y. Lan, and J. Guo. Btm: Topic modeling over short texts. TKDE, 26(12):2928--2941, 2014.
[6]
F. Doshi, K. Miller, J. V. Gael, and Y. W. Teh. Variational inference for the indian buffet process. In AISTATS, pages 137--144, 2009.
[7]
J. Eisenstein, A. Ahmed, and E. P. Xing. Sparse additive generative models of text. In ICML, pages 1041--1048, 2011.
[8]
J. Foulds, L. Boyles, C. Dubois, P. Smyth, and M. Welling. Stochastic collapsed variational bayesian inference for latent dirichlet allocation. In KDD, pages 446--454, 2013.
[9]
T. L. Griffiths and Z. Ghahramani. The indian buffet process: An introduction and review. Journal of Machine Learning Reseach, 12:1185--1224, 2011.
[10]
M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303--1347, 2013.
[11]
L. Hong and B. D. Davison. Empirical study of topic modeling in twitter. In Proceedings of the first workshop on social media analytics, pages 80--88. ACM, 2010.
[12]
P. O. Hoyer. Non-negative matrix factorization with sparseness constraints. The Journal of Machine Learning Research, 5:1457--1469, 2004.
[13]
O. Jin, N. N. Liu, K. Zhao, Y. Yu, and Q. Yang. Transferring topical knowledge from auxiliary long texts for short text clustering. In CIKM, pages 775--784. ACM, 2011.
[14]
T. Kenter and M. de Rijke. Short text similarity with word embeddings. In CIKM, pages 1411--1420. ACM, 2015.
[15]
T. Lin, W. Tian, Q. Mei, and H. Cheng. The dual-sparse topic model: mining focused topics and focused terms in short text. In WWW, pages 539--550, 2014.
[16]
D. Mimno, M. D. Hoffman, and D. M. Blei. Sparse stochastic inference for latent dirichlet allocation. In ICML, 2012.
[17]
T. P. Minka. Divergence measures and message passing. Technical report, Microsoft Research, 2005.
[18]
D. Newman, J. H. Lau, K. Grieser, and T. Baldwin. Automatic evaluation of topic coherence. In NAACL, pages 100--108, 2010.
[19]
X. Quan, C. Kit, Y. Ge, and S. J. Pan. Short and sparse text topic modeling via self-aggregation. In IJCAI, pages 2270--2276. AAAI Press, 2015.
[20]
H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, pages 400--407, 1951.
[21]
M. Shashanka, B. Raj, and P. Smaragdis. Sparse overcomplete latent variable decomposition of counts data. In NIPS, pages 1313--1320, 2008.
[22]
L. Shou, Z. Wang, K. Chen, and G. Chen. Sumblr: continuous summarization of evolving tweet streams. In SIGIR, pages 533--542, 2013.
[23]
V. K. R. Sridhar. Unsupervised topic modeling for short texts using distributed representations of words. In NAACL-HLT, pages 192--200, 2015.
[24]
Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical dirichlet processes. Journal of the American Statistical Association, 101(476), 2006.
[25]
M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1--2):1--305, 2008.
[26]
C. Wang and D. M. Blei. Decoupling sparsity and smoothness in the discrete hierarchical dirichlet process. In NIPS, pages 1982--1989, 2009.
[27]
C. Wang and D. M. Blei. Truncation-free online variational inference for bayesian nonparametric models. In NIPS, pages 413--421, 2012.
[28]
S. Williamson, C. Wang, K. Heller, and D. M. Blei. The ibp compound dirichlet process and its application to focused topic modeling. In ICML, pages 1151--1158, 2010.
[29]
J. Xu, P. Liu, G. Wu, Z. Sun, B. Xu, and H. Hao. A fast matching method based on semantic similarity for short texts. In Natural Language Processing and Chinese Computing, pages 299--309. Springer, 2013.
[30]
X. Yan, J. Guo, Y. Lan, J. Xu, and X. Cheng. A probabilistic model for bursty topic discovery in microblogs. In AAAI, pages 353--359, 2015.
[31]
L. Yang, L. Jing, M. K. Ng, and J. Yu. A discriminative and sparse topic model for image classification and annotation. Image and Vision Computing, 2016.
[32]
J. Yin and J. Wang. A dirichlet multinomial mixture model-based approach for short text clustering. In KDD, pages 233--242, 2014.
[33]
A. Zhang, J. Zhu, and B. Zhang. Sparse online topic models. In WWW, pages 1489--1500, 2013.
[34]
W. X. Zhao, J. Jiang, J. Weng, J. He, E-P. Lim, H. Yan, and X. Li. Comparing twitter and traditional media using topic models. In Advances in Information Retrieval, pages 338--349. Springer, 2011.
[35]
J. Zhu and E. P. Xing. Sparse topical coding. In UAI, pages 831--838, 2011.

Cited By

View all

Index Terms

  1. Understanding Sparse Topical Structure of Short Text via Stochastic Variational-Gibbs Inference

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management
    October 2016
    2566 pages
    ISBN:9781450340731
    DOI:10.1145/2983323
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 October 2016

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. indian buffet process
    2. short text
    3. sparse topical structure
    4. stochastic variational-gibbs inference
    5. topic modeling

    Qualifiers

    • Research-article

    Funding Sources

    • The Chinese University of Hong Kong

    Conference

    CIKM'16
    Sponsor:
    CIKM'16: ACM Conference on Information and Knowledge Management
    October 24 - 28, 2016
    Indiana, Indianapolis, USA

    Acceptance Rates

    CIKM '16 Paper Acceptance Rate 160 of 701 submissions, 23%;
    Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

    Upcoming Conference

    CIKM '25

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 254
      Total Downloads
    • Downloads (Last 12 months)4
    • Downloads (Last 6 weeks)0
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media