research-article

The dual-sparse topic model: mining focused topics and focused terms in short text

Authors:

Hong ChengAuthors Info & Claims

WWW '14: Proceedings of the 23rd international conference on World wide web

Pages 539 - 550

https://doi.org/10.1145/2566486.2567980

Published: 07 April 2014 Publication History

Abstract

Topic modeling has been proved to be an effective method for exploratory text mining. It is a common assumption of most topic models that a document is generated from a mixture of topics. In real-world scenarios, individual documents usually concentrate on several salient topics instead of covering a wide variety of topics. A real topic also adopts a narrow range of terms instead of a wide coverage of the vocabulary. Understanding this sparsity of information is especially important for analyzing user-generated Web content and social media, which are featured as extremely short posts and condensed discussions. In this paper, we propose a dual-sparse topic model that addresses the sparsity in both the topic mixtures and the word usage. By applying a "Spike and Slab" prior to decouple the sparsity and smoothness of the document-topic and topic-word distributions, we allow individual documents to select a few focused topics and a topic to select focused terms, respectively. Experiments on different genres of large corpora demonstrate that the dual-sparse topic model outperforms both classical topic models and existing sparsity-enhanced topic models. This improvement is especially notable on collections of short documents.

References

[1]

C. Archambeau, B. Lakshminarayanan, and G. Bouchard. Latent IBP compound dirichlet allocation. In NIPS Bayesian Nonparametrics Workshop, 2011.

[2]

A. Asuncion, M. Welling, P. Smyth, and Y. W. Teh. On smoothing and inference for topic models. In UAI, pages 27--34, 2009.

Digital Library

[3]

Y. Bengio, A. C. Courville, and J. S. Bergstra. Unsupervised models of images by spike-and-slab rbms. In ICML, pages 1145--1152, 2011.

[4]

D. M. Blei. Probabilistic topic models. Communications of the ACM, 55(4):77--84, 2012.

Digital Library

[5]

D. M. Blei, T. L. Griffiths, M. I. Jordan, and J. B. Tenenbaum. Hierarchical topic models and the nested chinese restaurant process. In NIPS, pages 106--114, 2003.

[6]

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. JMLR, 3:993--1022, 2003.

Digital Library

[7]

J. Chang, J. L. Boyd-Graber, S. Gerrish, C. Wang, and D. M. Blei. Reading tea leaves: How humans interpret topic models. In NIPS, pages 288--296, 2009.

Digital Library

[8]

X. Chen, M. Zhou, and L. Carin. The contextual focused topic model. In KDD, pages 96--104, 2012.

Digital Library

[9]

A. C. Courville, J. Bergstra, and Y. Bengio. A spike and slab restricted boltzmann machine. In International Conference on Artificial Intelligence and Statistics, pages 233--241, 2011.

[10]

K. Crammer and Y. Singer. On the algorithmic implementation of multiclass kernel-based vector machines. JMLR, 2:265--292, 2002.

Digital Library

[11]

J. V. Graca, K. Ganchev, B. Taskar, and F. Pereira. Posterior vs. parameter sparsity in latent variable models. In NIPS, pages 664--672, 2009.

[12]

T. Griffiths and M. Steyvers. Finding scientific topics. PNAS, 101:5228--5235, 2004.

[13]

T. Hofmann. Probabilistic latent semantic analysis. In UAI, pages 289--296, 1999.

Digital Library

[14]

P. O. Hoyer. Non-negative matrix factorization with sparseness constraints. JMLR, 5:1457--1469, 2004.

Digital Library

[15]

H. Ishwaran and J. S. Rao. Spike and slab variable selection: Frequentist and bayesian strategies. The Annals of Statistics, 33(2):730--773, 2005.

[16]

A. Kabán, E. Bingham, and T. Hirsimäki. Learning to read between the lines: The aspect bernoulli model. In SDM, pages 462--466, 2004.

[17]

D. D. Lee and H. S. Seung. Learning the parts of objects by non-negative matrix factorization. Nature, 401(6755):788--791, 1999.

[18]

Y. Lu, Q. Mei, and C. Zhai. Investigating task performance of probabilistic topic models: an empirical study of plsa and lda. Information Retrieval, 14(2):178--203, 2011.

Digital Library

[19]

R. Mehrotra, S. Sanner, W. Buntine, and L. Xie. Improving lda topic models for microblogs via tweet pooling and automatic labeling. In SIGIR, pages 889--892, 2013.

Digital Library

[20]

D. Newman, J. H. Lau, K. Grieser, and T. Baldwin. Automatic evaluation of topic coherence. In NAACL, pages 100--108, 2010.

Digital Library

[21]

I. Sato and H. Nakagawa. Rethinking collapsed variational bayes inference for lda. In ICML, 2012.

Digital Library

[22]

E. Saund. A multiply cause mixture model for unsupervised learning. Neural Comput., 7(1):51--71, 1995.

Digital Library

[23]

M. Shashanka, B. Raj, and P. Smaragdis. Sparse overcomplete latent variable decomposition of counts data. In NIPS, pages 1313--1320, 2007.

[24]

J. Tang, M. Zhang, and Q. Mei. One theme in all views: Modeling consensus topics in multiple contexts authors. In KDD, pages 5--13, 2013.

Digital Library

[25]

Y. W. Teh, D. Newman, and M. Welling. A collapsed variational bayesian inference algorithm for latent dirichlet allocation. In NIPS, pages 1353--1360, 2006.

Digital Library

[26]

H. M. Wallach, D. Mimno, and A. McCallum. Rethinking lda: Why priors matter. In NIPS, pages 1973--1981, 2009.

Digital Library

[27]

C. Wang and D. M. Blei. Decoupling sparsity and smoothness in the discrete hierarchical dirichlet process. In NIPS, pages 1982--1989, 2009.

Digital Library

[28]

Q. Wang, J. Xu, H. Li, and N. Craswell. Regularized latent semantic indexing. In SIGIR, pages 685--694, 2011.

Digital Library

[29]

S. Williamson, C. Wang, K. A. Heller, and D. M. Blei. Focused topic models. In NIPS Workshop on Applications for Topic Models: Text and Beyond, 2009.

[30]

S. Williamson, C. Wang, K. A. Heller, and D. M. Blei. The ibp compound dirichlet process and its application to focused topic modeling. In ICML, pages 1151--1158, 2010.

Digital Library

[31]

W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, and X. Li. Comparing twitter and traditional media using topic models. In ECIR, pages 338--349, 2011.

Digital Library

[32]

J. Zhu and E. P. Xing. Sparse topical coding. In UAI, pages 831--838, 2011.

Digital Library

Cited By

Wu CHu HZhu DShan XYung KIp A(2024)A Study of Discriminatory Speech Classification Based on Improved Smote and SVM-RFApplied Sciences10.3390/app1415646814:15(6468)Online publication date: 24-Jul-2024
https://doi.org/10.3390/app14156468
Masuda TNakagawa KHoshino T(2024)Dynamic Dual Sparse Topic Model: Integrating Temporal Dynamics and Sparsity with Spike and Slab Priors into Topic Model2024 16th IIAI International Congress on Advanced Applied Informatics (IIAI-AAI)10.1109/IIAI-AAI63651.2024.00063(299-304)Online publication date: 6-Jul-2024
https://doi.org/10.1109/IIAI-AAI63651.2024.00063
Kelsey Sumpter APines E(2024)An Approach for Evaluating Topic Models for Knowledge Management2024 15th International Conference on Mechanical and Intelligent Manufacturing Technologies (ICMIMT)10.1109/ICMIMT61937.2024.10585675(46-51)Online publication date: 17-May-2024
https://doi.org/10.1109/ICMIMT61937.2024.10585675
Show More Cited By

Index Terms

The dual-sparse topic model: mining focused topics and focused terms in short text
1. Information systems
  1. Information retrieval

Recommendations

Sparse Biterm Topic Model for Short Texts
Web and Big Data
Abstract
Extracting meaningful and coherent topics from short texts is an important task for many real world applications. Biterm topic model (BTM) is a popular topic model for short texts by explicitly model word co-occurrence patterns in the corpus ...
Hidden Topic Sentiment Model
WWW '16: Proceedings of the 25th International Conference on World Wide Web

Various topic models have been developed for sentiment analysis tasks. But the simple topic-sentiment mixture assumption prohibits them from finding fine-grained dependency between topical aspects and sentiments. In this paper, we build a Hidden Topic ...
Twitter Opinion Topic Model: Extracting Product Opinions from Tweets by Leveraging Hashtags and Sentiment Lexicon
CIKM '14: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management

Aspect-based opinion mining is widely applied to review data to aggregate or summarize opinions of a product, and the current state-of-the-art is achieved with Latent Dirichlet Allocation (LDA)-based model. Although social media data like tweets are ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

WWW '14: Proceedings of the 23rd international conference on World wide web

April 2014

926 pages

ISBN:9781450327442

DOI:10.1145/2566486

General Chair:
Chin-Wan Chung
Korea Advanced Institute of Science and Technology, Korea
,
Program Chairs:
Andrei Broder
Google Inc., USA
,
Kyuseok Shim
Seoul National University, Korea
,
Torsten Suel
New York University, USA

Copyright © 2014 Copyright is held by the International World Wide Web Conference Committee (IW3C2).

Sponsors

IW3C2: International World Wide Web Conference Committee

In-Cooperation

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 April 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

WWW '14

Sponsor:

IW3C2

WWW '14: 23rd International World Wide Web Conference

April 7 - 11, 2014

Seoul, Korea

Acceptance Rates

WWW '14 Paper Acceptance Rate 84 of 645 submissions, 13%;

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

83
Total Citations
View Citations
1,178
Total Downloads

Downloads (Last 12 months)41
Downloads (Last 6 weeks)2

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Wu CHu HZhu DShan XYung KIp A(2024)A Study of Discriminatory Speech Classification Based on Improved Smote and SVM-RFApplied Sciences10.3390/app1415646814:15(6468)Online publication date: 24-Jul-2024
https://doi.org/10.3390/app14156468
Masuda TNakagawa KHoshino T(2024)Dynamic Dual Sparse Topic Model: Integrating Temporal Dynamics and Sparsity with Spike and Slab Priors into Topic Model2024 16th IIAI International Congress on Advanced Applied Informatics (IIAI-AAI)10.1109/IIAI-AAI63651.2024.00063(299-304)Online publication date: 6-Jul-2024
https://doi.org/10.1109/IIAI-AAI63651.2024.00063
Kelsey Sumpter APines E(2024)An Approach for Evaluating Topic Models for Knowledge Management2024 15th International Conference on Mechanical and Intelligent Manufacturing Technologies (ICMIMT)10.1109/ICMIMT61937.2024.10585675(46-51)Online publication date: 17-May-2024
https://doi.org/10.1109/ICMIMT61937.2024.10585675
Silva CGalster MGilson F(2024)Applying short text topic models to instant messaging communication of software developersJournal of Systems and Software10.1016/j.jss.2024.112111216:COnline publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1016/j.jss.2024.112111
Zhang JGao WJia Y(2023)WES-BTM: A Short Text-Based Topic Clustering ModelSymmetry10.3390/sym1510188915:10(1889)Online publication date: 9-Oct-2023
https://doi.org/10.3390/sym15101889
Cutolo DFerriani S(2023)Now It Makes More Sense: How Narratives Can Help Atypical Actors Increase Market AppealJournal of Management10.1177/0149206323115163750:5(1599-1642)Online publication date: 6-Feb-2023
https://doi.org/10.1177/01492063231151637
Xie JSalley CMohammadi NTaylor J(2023)Online Confirmation-Augmented Probabilistic Topic Modeling in Cyber-Physical Social Infrastructure SystemsProceedings of the 10th ACM International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation10.1145/3600100.3626341(390-397)Online publication date: 15-Nov-2023
https://dl.acm.org/doi/10.1145/3600100.3626341
Rashid JKim JNaseem U(2023)Incorporating Embedding to Topic Modeling for More Effective Short Text AnalysisCompanion Proceedings of the ACM Web Conference 202310.1145/3543873.3587316(73-76)Online publication date: 30-Apr-2023
https://dl.acm.org/doi/10.1145/3543873.3587316
Ma CDu JLiang MGuan Z(2023)Topic Model Based on Co-Occurrence Word Networks for Unbalanced Short Text Datasets2023 5th International Conference on Data-driven Optimization of Complex Systems (DOCS)10.1109/DOCS60977.2023.10294993(1-7)Online publication date: 22-Sep-2023
https://doi.org/10.1109/DOCS60977.2023.10294993
Laiq MAli NBörstler JEngström E(2023)A data-driven approach for understanding invalid bug reports: An industrial case studyInformation and Software Technology10.1016/j.infsof.2023.107305164(107305)Online publication date: Dec-2023
https://doi.org/10.1016/j.infsof.2023.107305
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten