Abstract
Domain-specific document collections, such as data sets about the COVID-19 pandemic, politics, and sports, have become more common as platforms grow and develop better ways to connect people whose interests align. These data sets come from many different sources, ranging from traditional sources like open-ended surveys and newspaper articles to one of the dozens of online social media platforms. Most topic models are equipped to generate topics from one or more of these data sources, but models rarely work well across all types of documents. The main problem that many models face is the varying noise levels inherent in different types of documents. We propose topic-noise models, a new type of topic model that jointly models topic and noise distributions to produce a more accurate, flexible representation of documents regardless of their origin and varying qualities. Our topic-noise model, Topic Noise Discriminator (TND) approximates topic and noise distributions side-by-side with the help of word embedding spaces. While topic-noise models are important for the types of short, noisy documents that often originate on social media platforms, TND can also be used with more traditional data sources like newspapers. TND itself generates a noise distribution that when ensembled with other generative topic models can produce more coherent and diverse topic sets. We show the effectiveness of this approach using Latent Dirichlet Allocation (LDA), and demonstrate the ability of TND to improve the quality of LDA topics in noisy document collections. Finally, researchers are beginning to generate topics using multiple sources and finding that they need a way to identify a core set based on text from different sources. We propose using cross-source topic blending (CSTB), an approach that maps topics sets to an s-partite graph and identifies core topics that blend topics from across s sources by identifying subgraphs with certain linkage properties. We demonstrate the effectiveness of topic-noise models and CSTB empirically on large real-world data sets from multiple domains and data sources.











Similar content being viewed by others
Notes
This paper is an extension of Topic-Noise Models: Modeling Topic and Noise Distributions in Social Media Post Collections, which appeared in the IEEE International Conference on Data Mining (ICDM) 2021 [5]. The first three contributions were introduced in the original paper and the next three are new contributions in this paper.
The code repository can be found here: https://github.com/GU-DataLab/gdtm.
The k value does not have to be the same for the two models.
Specifically the MALLET implementation of LDA [36].
Parameters for sensitivity analysis across models: \(k = {10,20,30,50,100}\); \(\alpha , \beta _0 = {0.01, 0.1, 1.0}\); \(\beta _1 = {0, 16, 25, 36, 49}\); \(\phi ={5, 10, 15, 20, 25, 30}\); \(\mu = {0, 3, 5, 10}\).
A natural question here would be, given that there are 20 newsgroups, why not use \(k=20\)? We found that every model produced better results with \(k=30\).
Examples of agreeing labels would be covid cases and covid stats, or masks and mask regulations.
The code repository can be found here: https://github.com/GU-DataLab/topic-modeling
References
Churchill R, Singh L (2020) Percolation-based topic modeling for tweets. In: WISDOM 2020: KDD workshop on issues of sentiment discovery and opinion mining
Churchill R, Singh L, Kirov C (2018) A temporal topic model for noisy mediums. In: pacific-asia conference on knowledge discovery and data mining (PAKDD)
Chemudugunta C, Smyth P, Steyvers M (2007) Modeling general and specific aspects of documents with a probabilistic topic model. In: Advances in neural information processing systems (NIPS)
Li C, Wang H, Zhang Z, Sun A, Ma Z (2016) Topic modeling for short texts with auxiliary word embeddings. In: ACM SIGIR conference on research and development in information retrieval, pp. 165–174
Churchill R, Singh L (2021) Topic-noise models: modeling topic and noise distributions in social media post collections. In: International conference on data mining (ICDM)
Churchill R, Singh L (2021) The evolution of topic modeling. ACM Comput. Surv
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
Teh YW, Jordan MI, Beal MJ, Blei DM (2006) Hierarchical dirichlet processes. J Am Stat Assoc 101(476):1566–1581
Blei DM, Lafferty JD (2006) Dynamic topic models. In: International conference on machine learning (ICML)
Lafferty JD, Blei DM (2006) Correlated topic models. In: Advances in neural information processing systems (NIPS)
Nigam K, McCallum AK, Thrun S, Mitchell T (2000) Text classification from labeled and unlabeled documents using EM. Mach Learn 39(2–3):103–134
Quan X, Kit C, Ge Y, Pan SJ (2015) Short and sparse text topic modeling via self-aggregation. In: International joint conference on artificial intelligence
Yin J, Wang J (2014) A dirichlet multinomial mixture model-based approach for short text clustering. In: ACM international conference on knowledge discovery and data mining (KDD)
Nguyen DQ, Billingsley R, Du L, Johnson M (2015) Improving topic models with latent feature word representations. Trans Assoc Comput Linguist 3:299–313
Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3(Feb):1137–1155
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013) Distributed representations of words and phrases and their compositionality. In: Advances in neural information processing systems (NIPS), pp. 3111–3119
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781
Moody CE (2016) Mixing dirichlet topic models and word embeddings to make lda2vec. CoRR arXiv:1605.02019
Wang J, Chen L, Qin L, Wu X (2018) Astm: An attentional segmentation based topic model for short texts. In: IEEE international conference on data mining (ICDM)
Dieng AB, Ruiz FJ, Blei DM (2019) Topic modeling in embedding spaces. arXiv preprint arXiv:1907.04907
Dieng AB, Ruiz FJR, Blei DM (2019) The dynamic embedded topic model. CoRR arXiv:1907.05545
Zhao WX, Jiang J, Weng J, He J, Lim E-P, Yan H, Li X (2011) Comparing twitter and traditional media using topic models. In: European conference on information retrieval (ECIR)
Qiang J, Chen P, Wang T, Wu X (2016) Topic modeling over short texts by incorporating word embeddings. CoRR arXiv:1609.08496
Zuo Y, Wu J, Zhang H, Lin H, Wang F, Xu K, Xiong H (2016) Topic modeling of short texts: a pseudo-document view. In: International conference on knowledge discovery & data mining (KDD), pp. 2105–2114
Yan X, Guo J, Lan Y, Cheng X (2013) A biterm topic model for short texts. In: International conference on world wide web (WWW)
Miao Y, Yu L, Blunsom P (2016) Neural variational inference for text processing. In: international conference on machine learning (ICML), vol. 48, pp. 1727–1736
Gui L, Leng J, Pergola G, Zhou Y, Xu R, He Y (2019) Neural topic model with reinforcement learning. In: Conference on empirical methods in natural language processing and the international joint conference on natural language processing (EMNLP-IJCNLP), pp. 3478–3483
Wang R, Zhou D, He Y (2019) Atm: adversarial-neural topic model. Inf Process Manag 56(6):102098
Wang R, Hu X, Zhou D, He Y, Xiong Y, Ye C, Xu H (2020) Neural topic modeling with bidirectional adversarial training. arXiv preprint arXiv:2004.12331
Li X, Wang Y, Zhang A, Li C, Chi J, Ouyang J (2018) Filtering out the noise in short text topic modeling. Inf Sci 456:83–96
Shahnaz F, Berry MW, Pauca VP, Plemmons RJ (2006) Document clustering using nonnegative matrix factorization. Inf Process Manag 42:373–386
Kasiviswanathan SP, Melville P, Banerjee A, Sindhwani V (2011) Emerging topic detection using dictionary learning. In: ACM international conference on information and knowledge management
Yan X, Guo J, Liu S, Cheng X, Wang Y (2013) Learning topics in short texts by non-negative matrix factorization on term correlation matrix. In: SIAM international conference on data mining (SDM)
Cataldi M, Di Caro L, Schifanella C (2010) Emerging topic detection on twitter based on temporal and social terms evaluation. In: ACM KDD workshop on multimedia data mining
de Arruda HF, da Fontoura Costa L, Amancio DR (2016) Topic segmentation via community detection in complex networks. Chaos 26
McCallum AK (2002) Mallet: a machine learning for language toolkit
Lang K (1995) 20 Newsgroups Dataset. http://people.csail.mit.edu/jrennie/20Newsgroups/
Churchill R, Singh L (2021) textprep: a text preprocessing toolkit for topic modeling on social media data. In: International conference on data science, technology, and applications (DATA)
Pennington J, Socher R, Manning CD (2014) Glove: global vectors for word representation. In: Empirical methods in natural language processing (EMNLP), pp. 1532–1543
Lau JH, Newman D, Baldwin T (2014) Machine reading tea leaves: automatically evaluating topic coherence and topic model quality. In: Conference of the european chapter of the association for computational linguistics, pp. 530–539
Acknowledgements
This work was supported by the National Science Foundation grant numbers #1934925 and #1934494, and by the Massive Data Institute (MDI) at Georgetown University. We would like to thank our funders. We would also like to thank the S3MC project and the CNN Breakthrough project for their help with identification of noise words for the 2020 US election.
Author information
Authors and Affiliations
Corresponding authors
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Churchill, R., Singh, L. Using topic-noise models to generate domain-specific topics across data sources. Knowl Inf Syst 65, 2159–2186 (2023). https://doi.org/10.1007/s10115-022-01805-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-022-01805-2