skip to main content
10.1145/3485447.3512007acmconferencesArticle/Chapter ViewAbstractPublication PagesthewebconfConference Proceedingsconference-collections
research-article
Public Access

A Guided Topic-Noise Model for Short Texts

Published: 25 April 2022 Publication History

Abstract

Researchers using social media data want to understand the discussions occurring in and about their respective fields. These domain experts often turn to topic models to help them see the entire landscape of the conversation, but unsupervised topic models often produce topic sets that miss topics experts expect or want to see. To solve this problem, we propose Guided Topic-Noise Model (GTM), a semi-supervised topic model designed with large domain-specific social media data sets in mind. The input to GTM is a set of topics that are of interest to the user and a small number of words or phrases that belong to those topics. These seed topics are used to guide the topic generation process, and can be augmented interactively, expanding the seed word list as the model provides new relevant words for different topics. GTM uses a novel initialization and a new sampling algorithm called Generalized Polya Urn (GPU) seed word sampling to produce a topic set that includes expanded seed topics, as well as new unsupervised topics. We demonstrate the robustness of GTM on open-ended responses from a public opinion survey and four domain-specific Twitter data sets.

References

[1]
David Andrzejewski, Xiaojin Zhu, and Mark Craven. 2009. Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In International Conference on Machine Learning. 25–32.
[2]
David M Blei and Jon D McAuliffe. 2010. Supervised topic models. arXiv preprint arXiv:1003.0783(2010).
[3]
David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research 3 (2003), 993–1022.
[4]
Jamillah Bowman Williams, Naomi Mezey, and Lisa Singh. 2021. #BlackLivesMatter: Getting from Contemporary Social Movements to Structural Change. California Law Review Online 12 (2021).
[5]
Mario Cataldi, Luigi Di Caro, and Claudio Schifanella. 2010. Emerging Topic Detection on Twitter Based on Temporal and Social Terms Evaluation. In ACM KDD Workshop on Multimedia Data Mining. 1–10.
[6]
Rob Churchill and Lisa Singh. 2020. Percolation-based topic modeling for tweets. In KDD Workshop on Issues of Sentiment Discovery and Opinion Mining (WISDOM).
[7]
Rob Churchill and Lisa Singh. 2021. The Evolution of Topic Modeling. ACM Computing Surveys (CSUR)(2021).
[8]
Rob Churchill and Lisa Singh. 2021. textPrep: A Text Preprocessing Toolkit for Topic Modeling on Social Media Data. In International Conference on Data Science, Technology, and Applications (DATA).
[9]
Rob Churchill and Lisa Singh. 2021. Topic-Noise Models: Modeling Topic and Noise Distributions in Social Media Post Collections. In International Conference on Data Mining (ICDM). 71–80.
[10]
Rob Churchill, Lisa Singh, and Christo Kirov. 2018. A Temporal Topic Model for Noisy Mediums. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). 42–53.
[11]
P. Davis-Kean, R. Ryan, L. Singh, and N. Waters. 2021. Groundhog day: Homeschooling in the time of Covid-19. MOSAIC Data Brief: Measuring Online Social Attitudes and Information Collaborative (10 2021).
[12]
Adji B. Dieng, Francisco J. R. Ruiz, and David M. Blei. 2019. The Dynamic Embedded Topic Model. CoRR abs/1907.05545(2019). arxiv:1907.05545http://arxiv.org/abs/1907.05545
[13]
Jacob Eisenstein, Duen Horng Chau, Aniket Kittur, and Eric Xing. 2012. TopicViz: Interactive topic exploration in document collections. In Extended Abstracts on Human Factors in Computing Systems. 2177–2182.
[14]
Ryan J Gallagher, Kyle Reing, David Kale, and Greg Ver Steeg. 2017. Anchored correlation explanation: Topic modeling with minimal domain knowledge. Transactions of the Association for Computational Linguistics 5 (2017), 529–542.
[15]
Enamul Hoque and Giuseppe Carenini. 2015. Convisit: Interactive topic modeling for exploring asynchronous online conversations. In International Conference on Intelligent User Interfaces. 169–180.
[16]
Yuening Hu, Jordan Boyd-Graber, Brianna Satinoff, and Alison Smith. 2014. Interactive topic modeling. Machine learning 95, 3 (2014), 423–469.
[17]
Jagadeesh Jagarlamudi, Hal Daumé III, and Raghavendra Udupa. 2012. Incorporating lexical priors into topic models. In Conference of the European Chapter of the Association for Computational Linguistics (EACL). 204–213.
[18]
Hayato Kobayashi, Hiromi Wakaki, Tomohiro Yamasaki, and Masaru Suzuki. 2011. Topic Models with Logical Constraints on Words. In Workshop on Robust Unsupervised and Semisupervised Methods in Natural Language Processing. 33–40.
[19]
Tak Yeon Lee, Alison Smith, Kevin Seppi, Niklas Elmqvist, Jordan Boyd-Graber, and Leah Findlater. 2017. The human touch: How non-expert users perceive, interpret, and fix topic models. International Journal of Human-Computer Studies 105 (2017).
[20]
Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2016. Topic Modeling for Short Texts with Auxiliary Word Embeddings. In Conference on Research and Development in Information Retrieval (SIGIR). 165–174.
[21]
Fangtao Li, Sheng Wang, Shenghua Liu, and Ming Zhang. 2014. Suit: A supervised user-item based topic model for sentiment analysis. In AAAI Conference on Artificial Intelligence.
[22]
Andrew Kachites McCallum. 2002. MALLET: A Machine Learning for Language Toolkit.(2002).
[23]
Yu Meng, Jiaxin Huang, Guangyuan Wang, Zihan Wang, Chao Zhang, Yu Zhang, and Jiawei Han. 2020. Discriminative topic mining via category-name guided text embedding. In The Web Conference (WWW). 2121–2132.
[24]
David Mimno, Hanna Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. 2011. Optimizing semantic coherence in topic models. In Empirical Methods in Natural Language Processing (EMNLP). 262–272.
[25]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP). 1532–1543.
[26]
Jipeng Qiang, Ping Chen, Tong Wang, and Xindong Wu. 2017. Topic modeling over short texts by incorporating word embeddings. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. 363–374.
[27]
Xiaojun Quan, Chunyu Kit, Yong Ge, and Sinno Jialin Pan. 2015. Short and sparse text topic modeling via self-aggregation. In International Joint Conference on Artificial Intelligence.
[28]
Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D Manning. 2009. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Empirical Methods in Natural Language Processing (EMNLP). 248–256.
[29]
Alison Smith, Varun Kumar, Jordan Boyd-Graber, Kevin Seppi, and Leah Findlater. 2018. Closing the loop: User-centered design and evaluation of a human-in-the-loop topic modeling system. In International Conference on Intelligent User Interfaces. 293–304.
[30]
Yang Wang and Greg Mori. 2009. Human action recognition by semilatent topic models. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 10(2009), 1762–1774.
[31]
Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. 2013. A Biterm Topic Model for Short Texts. In The Web Conference (WWW). 1445–1456.
[32]
Xiaohui Yan, Jiafeng Guo, Shenghua Liu, Xueqi Cheng, and Yanfeng Wang. 2013. Learning topics in short texts by non-negative matrix factorization on term correlation matrix. In SIAM International Conference on Data Mining (SDM). 749–757.
[33]
Liansheng Zhuang, Haoyuan Gao, Jiebo Luo, and Zhouchen Lin. 2013. Regularized semi-supervised latent dirichlet allocation for visual concept learning. Neurocomputing 119(2013), 26–32.

Cited By

View all
  • (2025)The Advent of Topic-Noise ModelsText Mining in Educational Research10.1007/978-981-97-7858-4_3(25-42)Online publication date: 13-Jan-2025
  • (2024)Understanding the rationales and information environments for early, late, and nonadopters of the COVID-19 vaccinenpj Vaccines10.1038/s41541-024-00962-59:1Online publication date: 14-Sep-2024

Index Terms

  1. A Guided Topic-Noise Model for Short Texts
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        WWW '22: Proceedings of the ACM Web Conference 2022
        April 2022
        3764 pages
        ISBN:9781450390965
        DOI:10.1145/3485447
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 25 April 2022

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. guided topic model
        2. seed topics
        3. semi-supervised topic model
        4. social media
        5. topic modeling
        6. topic-noise model

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Funding Sources

        Conference

        WWW '22
        Sponsor:
        WWW '22: The ACM Web Conference 2022
        April 25 - 29, 2022
        Virtual Event, Lyon, France

        Acceptance Rates

        Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)235
        • Downloads (Last 6 weeks)28
        Reflects downloads up to 08 Mar 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2025)The Advent of Topic-Noise ModelsText Mining in Educational Research10.1007/978-981-97-7858-4_3(25-42)Online publication date: 13-Jan-2025
        • (2024)Understanding the rationales and information environments for early, late, and nonadopters of the COVID-19 vaccinenpj Vaccines10.1038/s41541-024-00962-59:1Online publication date: 14-Sep-2024

        View Options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Login options

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media