Using topic-noise models to generate domain-specific topics across data sources

Churchill, Rob; Singh, Lisa

doi:10.1007/s10115-022-01805-2

Using topic-noise models to generate domain-specific topics across data sources

Regular Paper
Published: 16 January 2023

Volume 65, pages 2159–2186, (2023)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

2205 Accesses
4 Citations
1 Altmetric
Explore all metrics

Abstract

Domain-specific document collections, such as data sets about the COVID-19 pandemic, politics, and sports, have become more common as platforms grow and develop better ways to connect people whose interests align. These data sets come from many different sources, ranging from traditional sources like open-ended surveys and newspaper articles to one of the dozens of online social media platforms. Most topic models are equipped to generate topics from one or more of these data sources, but models rarely work well across all types of documents. The main problem that many models face is the varying noise levels inherent in different types of documents. We propose topic-noise models, a new type of topic model that jointly models topic and noise distributions to produce a more accurate, flexible representation of documents regardless of their origin and varying qualities. Our topic-noise model, Topic Noise Discriminator (TND) approximates topic and noise distributions side-by-side with the help of word embedding spaces. While topic-noise models are important for the types of short, noisy documents that often originate on social media platforms, TND can also be used with more traditional data sources like newspapers. TND itself generates a noise distribution that when ensembled with other generative topic models can produce more coherent and diverse topic sets. We show the effectiveness of this approach using Latent Dirichlet Allocation (LDA), and demonstrate the ability of TND to improve the quality of LDA topics in noisy document collections. Finally, researchers are beginning to generate topics using multiple sources and finding that they need a way to identify a core set based on text from different sources. We propose using cross-source topic blending (CSTB), an approach that maps topics sets to an s-partite graph and identifies core topics that blend topics from across s sources by identifying subgraphs with certain linkage properties. We demonstrate the effectiveness of topic-noise models and CSTB empirically on large real-world data sets from multiple domains and data sources.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Neural Embedded Dirichlet Processes for Topic Modeling

Leveraging external information in topic modelling

Article 12 May 2018

Dynamic Topic-Noise Models for Social Media

Notes

This paper is an extension of Topic-Noise Models: Modeling Topic and Noise Distributions in Social Media Post Collections, which appeared in the IEEE International Conference on Data Mining (ICDM) 2021 [5]. The first three contributions were introduced in the original paper and the next three are new contributions in this paper.
The code repository can be found here: https://github.com/GU-DataLab/gdtm.
The k value does not have to be the same for the two models.
Specifically the MALLET implementation of LDA [36].
Parameters for sensitivity analysis across models: $k = {10,20,30,50,100}$; $\alpha , \beta _0 = {0.01, 0.1, 1.0}$; $\beta _1 = {0, 16, 25, 36, 49}$; $\phi ={5, 10, 15, 20, 25, 30}$; $\mu = {0, 3, 5, 10}$.
A natural question here would be, given that there are 20 newsgroups, why not use $k=20$? We found that every model produced better results with $k=30$.
Examples of agreeing labels would be covid cases and covid stats, or masks and mask regulations.
The code repository can be found here: https://github.com/GU-DataLab/topic-modeling

References

Churchill R, Singh L (2020) Percolation-based topic modeling for tweets. In: WISDOM 2020: KDD workshop on issues of sentiment discovery and opinion mining
Churchill R, Singh L, Kirov C (2018) A temporal topic model for noisy mediums. In: pacific-asia conference on knowledge discovery and data mining (PAKDD)
Chemudugunta C, Smyth P, Steyvers M (2007) Modeling general and specific aspects of documents with a probabilistic topic model. In: Advances in neural information processing systems (NIPS)
Li C, Wang H, Zhang Z, Sun A, Ma Z (2016) Topic modeling for short texts with auxiliary word embeddings. In: ACM SIGIR conference on research and development in information retrieval, pp. 165–174
Churchill R, Singh L (2021) Topic-noise models: modeling topic and noise distributions in social media post collections. In: International conference on data mining (ICDM)
Churchill R, Singh L (2021) The evolution of topic modeling. ACM Comput. Surv
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
MATH Google Scholar
Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical dirichlet processes. J Am Stat Assoc 101(476):1566–1581
Article MathSciNet MATH Google Scholar
Blei DM, Lafferty JD (2006) Dynamic topic models. In: International conference on machine learning (ICML)
Lafferty JD, Blei DM (2006) Correlated topic models. In: Advances in neural information processing systems (NIPS)
Nigam K, McCallum AK, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39(2–3):103–134
Article MATH Google Scholar
Quan X, Kit C, Ge Y, Pan SJ (2015) Short and sparse text topic modeling via self-aggregation. In: International joint conference on artificial intelligence
Yin J, Wang J (2014) A dirichlet multinomial mixture model-based approach for short text clustering. In: ACM international conference on knowledge discovery and data mining (KDD)
Nguyen DQ, Billingsley R, Du L, Johnson M (2015) Improving topic models with latent feature word representations. Trans Assoc Comput Linguist 3:299–313
Article Google Scholar
Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3(Feb):1137–1155
MATH Google Scholar
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems (NIPS), pp. 3111–3119
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Moody CE (2016) Mixing dirichlet topic models and word embeddings to make lda2vec. CoRR arXiv:1605.02019
Wang J, Chen L, Qin L, Wu X (2018) Astm: An attentional segmentation based topic model for short texts. In: IEEE international conference on data mining (ICDM)
Dieng AB, Ruiz FJ, Blei DM (2019) Topic modeling in embedding spaces. arXiv preprint arXiv:1907.04907
Dieng AB, Ruiz FJR, Blei DM (2019) The dynamic embedded topic model. CoRR arXiv:1907.05545
Zhao WX, Jiang J, Weng J, He J, Lim E-P, Yan H, Li X (2011) Comparing twitter and traditional media using topic models. In: European conference on information retrieval (ECIR)
Qiang J, Chen P, Wang T, Wu X (2016) Topic modeling over short texts by incorporating word embeddings. CoRR arXiv:1609.08496
Zuo Y, Wu J, Zhang H, Lin H, Wang F, Xu K, Xiong H (2016) Topic modeling of short texts: a pseudo-document view. In: International conference on knowledge discovery & data mining (KDD), pp. 2105–2114
Yan X, Guo J, Lan Y, Cheng X (2013) A biterm topic model for short texts. In: International conference on world wide web (WWW)
Miao Y, Yu L, Blunsom P (2016) Neural variational inference for text processing. In: international conference on machine learning (ICML), vol. 48, pp. 1727–1736
Gui L, Leng J, Pergola G, Zhou Y, Xu R, He Y (2019) Neural topic model with reinforcement learning. In: Conference on empirical methods in natural language processing and the international joint conference on natural language processing (EMNLP-IJCNLP), pp. 3478–3483
Wang R, Zhou D, He Y (2019) Atm: adversarial-neural topic model. Inf Process Manag 56(6):102098
Article Google Scholar
Wang R, Hu X, Zhou D, He Y, Xiong Y, Ye C, Xu H (2020) Neural topic modeling with bidirectional adversarial training. arXiv preprint arXiv:2004.12331
Li X, Wang Y, Zhang A, Li C, Chi J, Ouyang J (2018) Filtering out the noise in short text topic modeling. Inf Sci 456:83–96
Article Google Scholar
Shahnaz F, Berry MW, Pauca VP, Plemmons RJ (2006) Document clustering using nonnegative matrix factorization. Inf Process Manag 42:373–386
Article MATH Google Scholar
Kasiviswanathan SP, Melville P, Banerjee A, Sindhwani V (2011) Emerging topic detection using dictionary learning. In: ACM international conference on information and knowledge management
Yan X, Guo J, Liu S, Cheng X, Wang Y (2013) Learning topics in short texts by non-negative matrix factorization on term correlation matrix. In: SIAM international conference on data mining (SDM)
Cataldi M, Di Caro L, Schifanella C (2010) Emerging topic detection on twitter based on temporal and social terms evaluation. In: ACM KDD workshop on multimedia data mining
de Arruda HF, da Fontoura Costa L, Amancio DR (2016) Topic segmentation via community detection in complex networks. Chaos 26
McCallum AK (2002) Mallet: a machine learning for language toolkit
Lang K (1995) 20 Newsgroups Dataset. http://people.csail.mit.edu/jrennie/20Newsgroups/
Churchill R, Singh L (2021) textprep: a text preprocessing toolkit for topic modeling on social media data. In: International conference on data science, technology, and applications (DATA)
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Empirical methods in natural language processing (EMNLP), pp. 1532–1543
Lau JH, Newman D, Baldwin T (2014) Machine reading tea leaves: automatically evaluating topic coherence and topic model quality. In: Conference of the european chapter of the association for computational linguistics, pp. 530–539

Download references

Acknowledgements

This work was supported by the National Science Foundation grant numbers #1934925 and #1934494, and by the Massive Data Institute (MDI) at Georgetown University. We would like to thank our funders. We would also like to thank the S3MC project and the CNN Breakthrough project for their help with identification of noise words for the 2020 US election.

Author information

R. Churchill and L. Singh have contributed equally to this work.

Authors and Affiliations

Department of Computer Science, Georgetown University, 3700 O Street, Washington, D.C., 20007, USA
Rob Churchill & Lisa Singh

Authors

Rob Churchill
View author publications
You can also search for this author inPubMed Google Scholar
Lisa Singh
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding authors

Correspondence to Rob Churchill or Lisa Singh.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Churchill, R., Singh, L. Using topic-noise models to generate domain-specific topics across data sources. Knowl Inf Syst 65, 2159–2186 (2023). https://doi.org/10.1007/s10115-022-01805-2

Download citation

Received: 08 February 2022
Revised: 04 December 2022
Accepted: 05 December 2022
Published: 16 January 2023
Issue Date: May 2023
DOI: https://doi.org/10.1007/s10115-022-01805-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Using topic-noise models to generate domain-specific topics across data sources

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Neural Embedded Dirichlet Processes for Topic Modeling

Leveraging external information in topic modelling

Dynamic Topic-Noise Models for Social Media

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding authors

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now