research-article

Understanding Sparse Topical Structure of Short Text via Stochastic Variational-Gibbs Inference

Authors:

Hong ChengAuthors Info & Claims

CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

Pages 407 - 416

https://doi.org/10.1145/2983323.2983765

Published: 24 October 2016 Publication History

Abstract

With the soaring popularity of online social media like Twitter, analyzing short text has emerged as an increasingly important task which is challenging to classical topic models, as topic sparsity exists in short text. Topic sparsity refers to the observation that individual document usually concentrates on several salient topics, which may be rare in entire corpus. Understanding this sparse topical structure of short text has been recognized as the key ingredient for mining user-generated Web content and social medium, which are featured in the form of extremely short posts and discussions. However, the existing sparsity-enhanced topic models all assume over-complicated generative process, which severely limits their scalability and makes them unable to automatically infer the number of topics from data.

In this paper, we propose a probabilistic Bayesian topic model, namely Sparse Dirichlet mixture Topic Model (SparseDTM), based on Indian Buffet Process (IBP) prior, and infer our model on the large text corpora through a novel inference procedure called stochastic variational-Gibbs inference. Unlike prior work, the proposed approach is able to achieve exact sparse topical structure of large short text collections, and automatically identify the number of topics with a good balance between completeness and homogeneity of topic coherence. Experiments on different genres of large text corpora demonstrate that our approach outperforms various existing sparse topic models. The improvement is significant on large-scale collections of short text.

References

[1]

C. Andrieu, N. De Freitas, A. Doucet, and M. I. Jordan. An introduction to mcmc for machine learning. Machine learning, 50(1--2):5--43, 2003.

[2]

C. Archambeau, B. Lakshminarayanan, and G. Bouchard. Latent ibp compound dirichlet allocation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):321--333, 2015.

Digital Library

[3]

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. the Journal of machine Learning research, 3:993--1022, 2003.

Digital Library

[4]

X. Chen, M. Zhou, and L. Carin. The contextual focused topic model. In KDD, pages 96--104, 2012.

Digital Library

[5]

X. Cheng, X. Yan, Y. Lan, and J. Guo. Btm: Topic modeling over short texts. TKDE, 26(12):2928--2941, 2014.

[6]

F. Doshi, K. Miller, J. V. Gael, and Y. W. Teh. Variational inference for the indian buffet process. In AISTATS, pages 137--144, 2009.

[7]

J. Eisenstein, A. Ahmed, and E. P. Xing. Sparse additive generative models of text. In ICML, pages 1041--1048, 2011.

[8]

J. Foulds, L. Boyles, C. Dubois, P. Smyth, and M. Welling. Stochastic collapsed variational bayesian inference for latent dirichlet allocation. In KDD, pages 446--454, 2013.

Digital Library

[9]

T. L. Griffiths and Z. Ghahramani. The indian buffet process: An introduction and review. Journal of Machine Learning Reseach, 12:1185--1224, 2011.

Digital Library

[10]

M. D. Hoffman, D. M. Blei, C. Wang, and J. Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303--1347, 2013.

Digital Library

[11]

L. Hong and B. D. Davison. Empirical study of topic modeling in twitter. In Proceedings of the first workshop on social media analytics, pages 80--88. ACM, 2010.

Digital Library

[12]

P. O. Hoyer. Non-negative matrix factorization with sparseness constraints. The Journal of Machine Learning Research, 5:1457--1469, 2004.

Digital Library

[13]

O. Jin, N. N. Liu, K. Zhao, Y. Yu, and Q. Yang. Transferring topical knowledge from auxiliary long texts for short text clustering. In CIKM, pages 775--784. ACM, 2011.

Digital Library

[14]

T. Kenter and M. de Rijke. Short text similarity with word embeddings. In CIKM, pages 1411--1420. ACM, 2015.

Digital Library

[15]

T. Lin, W. Tian, Q. Mei, and H. Cheng. The dual-sparse topic model: mining focused topics and focused terms in short text. In WWW, pages 539--550, 2014.

Digital Library

[16]

D. Mimno, M. D. Hoffman, and D. M. Blei. Sparse stochastic inference for latent dirichlet allocation. In ICML, 2012.

[17]

T. P. Minka. Divergence measures and message passing. Technical report, Microsoft Research, 2005.

[18]

D. Newman, J. H. Lau, K. Grieser, and T. Baldwin. Automatic evaluation of topic coherence. In NAACL, pages 100--108, 2010.

Digital Library

[19]

X. Quan, C. Kit, Y. Ge, and S. J. Pan. Short and sparse text topic modeling via self-aggregation. In IJCAI, pages 2270--2276. AAAI Press, 2015.

Digital Library

[20]

H. Robbins and S. Monro. A stochastic approximation method. Annals of Mathematical Statistics, pages 400--407, 1951.

[21]

M. Shashanka, B. Raj, and P. Smaragdis. Sparse overcomplete latent variable decomposition of counts data. In NIPS, pages 1313--1320, 2008.

Digital Library

[22]

L. Shou, Z. Wang, K. Chen, and G. Chen. Sumblr: continuous summarization of evolving tweet streams. In SIGIR, pages 533--542, 2013.

Digital Library

[23]

V. K. R. Sridhar. Unsupervised topic modeling for short texts using distributed representations of words. In NAACL-HLT, pages 192--200, 2015.

[24]

Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. Hierarchical dirichlet processes. Journal of the American Statistical Association, 101(476), 2006.

[25]

M. J. Wainwright and M. I. Jordan. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1--2):1--305, 2008.

Digital Library

[26]

C. Wang and D. M. Blei. Decoupling sparsity and smoothness in the discrete hierarchical dirichlet process. In NIPS, pages 1982--1989, 2009.

Digital Library

[27]

C. Wang and D. M. Blei. Truncation-free online variational inference for bayesian nonparametric models. In NIPS, pages 413--421, 2012.

Digital Library

[28]

S. Williamson, C. Wang, K. Heller, and D. M. Blei. The ibp compound dirichlet process and its application to focused topic modeling. In ICML, pages 1151--1158, 2010.

Digital Library

[29]

J. Xu, P. Liu, G. Wu, Z. Sun, B. Xu, and H. Hao. A fast matching method based on semantic similarity for short texts. In Natural Language Processing and Chinese Computing, pages 299--309. Springer, 2013.

[30]

X. Yan, J. Guo, Y. Lan, J. Xu, and X. Cheng. A probabilistic model for bursty topic discovery in microblogs. In AAAI, pages 353--359, 2015.

Digital Library

[31]

L. Yang, L. Jing, M. K. Ng, and J. Yu. A discriminative and sparse topic model for image classification and annotation. Image and Vision Computing, 2016.

Digital Library

[32]

J. Yin and J. Wang. A dirichlet multinomial mixture model-based approach for short text clustering. In KDD, pages 233--242, 2014.

Digital Library

[33]

A. Zhang, J. Zhu, and B. Zhang. Sparse online topic models. In WWW, pages 1489--1500, 2013.

Digital Library

[34]

W. X. Zhao, J. Jiang, J. Weng, J. He, E-P. Lim, H. Yan, and X. Li. Comparing twitter and traditional media using topic models. In Advances in Information Retrieval, pages 338--349. Springer, 2011.

Digital Library

[35]

J. Zhu and E. P. Xing. Sparse topical coding. In UAI, pages 831--838, 2011.

Digital Library

Cited By

Index Terms

Understanding Sparse Topical Structure of Short Text via Stochastic Variational-Gibbs Inference
1. Information systems
  1. Information retrieval
    1. Document representation
      1. Document topic models

Recommendations

The dual-sparse topic model: mining focused topics and focused terms in short text
WWW '14: Proceedings of the 23rd international conference on World wide web

Topic modeling has been proved to be an effective method for exploratory text mining. It is a common assumption of most topic models that a document is generated from a mixture of topics. In real-world scenarios, individual documents usually concentrate ...
Short text topic modeling by exploring original documents

Topic modeling for short texts faces a tough challenge, owing to the sparsity problem. An effective solution is to aggregate short texts into long pseudo-documents before training a standard topic model. The main concern of this solution is the way of ...
Fuzzy topic modeling approach for text mining over short text
Highlights
- A fuzzy topic modeling method is proposed for short text documents.
- Local and global term frequencies are generated through the bag-of-words model.
- High dimensionality negative effect on global term weighting is eliminated.
- ...
Abstract
In this era, the proliferating role of social media in our lives has popularized the posting of the short text. The short texts contain limited context with unique characteristics which makes them difficult to handle. Every day billions of short ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

CIKM '16: Proceedings of the 25th ACM International on Conference on Information and Knowledge Management

October 2016

2566 pages

ISBN:9781450340731

DOI:10.1145/2983323

General Chairs:
Snehasis Mukhopadhyay
Indiana University Purdue University Indianapolis, USA
,
ChengXiang Zhai
University of Illinois at Urbana-Champaign, USA
,
Program Chairs:
Elisa Bertino
Purdue University
,
Fabio Crestani
University of Lugano
,
Javed Mostafa
University of North Carolina
,
Jie Tang
Tsinghua University
,
Luo Si
Alibaba Group Inc & Purdue University
,
Xiaofang Zhou
University of Queensland
,
Yi Chang
Yahoo Research
,
Yunyao Li
IBM Research - Almaden
,
Parikshit Sondhi
WalmartLabs

Copyright © 2016 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 October 2016

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

The Chinese University of Hong Kong

Conference

CIKM'16

Sponsor:

CIKM'16: ACM Conference on Information and Knowledge Management

October 24 - 28, 2016

Indiana, Indianapolis, USA

Acceptance Rates

CIKM '16 Paper Acceptance Rate 160 of 701 submissions, 23%;

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Sponsor:
sigir
sigir

The 34th ACM International Conference on Information and Knowledge Management

November 10 - 14, 2025

Seoul , Republic of Korea

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
254
Total Downloads

Downloads (Last 12 months)4
Downloads (Last 6 weeks)0

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten