research-article

Public Access

A Guided Topic-Noise Model for Short Texts

Authors:

Robert Churchill,

Pamela Davis-KeanAuthors Info & Claims

WWW '22: Proceedings of the ACM Web Conference 2022

Pages 2870 - 2878

https://doi.org/10.1145/3485447.3512007

Published: 25 April 2022 Publication History

All formats PDF

Abstract

Researchers using social media data want to understand the discussions occurring in and about their respective fields. These domain experts often turn to topic models to help them see the entire landscape of the conversation, but unsupervised topic models often produce topic sets that miss topics experts expect or want to see. To solve this problem, we propose Guided Topic-Noise Model (GTM), a semi-supervised topic model designed with large domain-specific social media data sets in mind. The input to GTM is a set of topics that are of interest to the user and a small number of words or phrases that belong to those topics. These seed topics are used to guide the topic generation process, and can be augmented interactively, expanding the seed word list as the model provides new relevant words for different topics. GTM uses a novel initialization and a new sampling algorithm called Generalized Polya Urn (GPU) seed word sampling to produce a topic set that includes expanded seed topics, as well as new unsupervised topics. We demonstrate the robustness of GTM on open-ended responses from a public opinion survey and four domain-specific Twitter data sets.

References

[1]

David Andrzejewski, Xiaojin Zhu, and Mark Craven. 2009. Incorporating domain knowledge into topic modeling via Dirichlet forest priors. In International Conference on Machine Learning. 25–32.

Digital Library

[2]

David M Blei and Jon D McAuliffe. 2010. Supervised topic models. arXiv preprint arXiv:1003.0783(2010).

[3]

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet Allocation. Journal of Machine Learning Research 3 (2003), 993–1022.

Digital Library

[4]

Jamillah Bowman Williams, Naomi Mezey, and Lisa Singh. 2021. #BlackLivesMatter: Getting from Contemporary Social Movements to Structural Change. California Law Review Online 12 (2021).

[5]

Mario Cataldi, Luigi Di Caro, and Claudio Schifanella. 2010. Emerging Topic Detection on Twitter Based on Temporal and Social Terms Evaluation. In ACM KDD Workshop on Multimedia Data Mining. 1–10.

[6]

Rob Churchill and Lisa Singh. 2020. Percolation-based topic modeling for tweets. In KDD Workshop on Issues of Sentiment Discovery and Opinion Mining (WISDOM).

[7]

Rob Churchill and Lisa Singh. 2021. The Evolution of Topic Modeling. ACM Computing Surveys (CSUR)(2021).

[8]

Rob Churchill and Lisa Singh. 2021. textPrep: A Text Preprocessing Toolkit for Topic Modeling on Social Media Data. In International Conference on Data Science, Technology, and Applications (DATA).

[9]

Rob Churchill and Lisa Singh. 2021. Topic-Noise Models: Modeling Topic and Noise Distributions in Social Media Post Collections. In International Conference on Data Mining (ICDM). 71–80.

[10]

Rob Churchill, Lisa Singh, and Christo Kirov. 2018. A Temporal Topic Model for Noisy Mediums. In Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD). 42–53.

Digital Library

[11]

P. Davis-Kean, R. Ryan, L. Singh, and N. Waters. 2021. Groundhog day: Homeschooling in the time of Covid-19. MOSAIC Data Brief: Measuring Online Social Attitudes and Information Collaborative (10 2021).

[12]

Adji B. Dieng, Francisco J. R. Ruiz, and David M. Blei. 2019. The Dynamic Embedded Topic Model. CoRR abs/1907.05545(2019). arxiv:1907.05545http://arxiv.org/abs/1907.05545

[13]

Jacob Eisenstein, Duen Horng Chau, Aniket Kittur, and Eric Xing. 2012. TopicViz: Interactive topic exploration in document collections. In Extended Abstracts on Human Factors in Computing Systems. 2177–2182.

[14]

Ryan J Gallagher, Kyle Reing, David Kale, and Greg Ver Steeg. 2017. Anchored correlation explanation: Topic modeling with minimal domain knowledge. Transactions of the Association for Computational Linguistics 5 (2017), 529–542.

[15]

Enamul Hoque and Giuseppe Carenini. 2015. Convisit: Interactive topic modeling for exploring asynchronous online conversations. In International Conference on Intelligent User Interfaces. 169–180.

Digital Library

[16]

Yuening Hu, Jordan Boyd-Graber, Brianna Satinoff, and Alison Smith. 2014. Interactive topic modeling. Machine learning 95, 3 (2014), 423–469.

[17]

Jagadeesh Jagarlamudi, Hal Daumé III, and Raghavendra Udupa. 2012. Incorporating lexical priors into topic models. In Conference of the European Chapter of the Association for Computational Linguistics (EACL). 204–213.

[18]

Hayato Kobayashi, Hiromi Wakaki, Tomohiro Yamasaki, and Masaru Suzuki. 2011. Topic Models with Logical Constraints on Words. In Workshop on Robust Unsupervised and Semisupervised Methods in Natural Language Processing. 33–40.

[19]

Tak Yeon Lee, Alison Smith, Kevin Seppi, Niklas Elmqvist, Jordan Boyd-Graber, and Leah Findlater. 2017. The human touch: How non-expert users perceive, interpret, and fix topic models. International Journal of Human-Computer Studies 105 (2017).

[20]

Chenliang Li, Haoran Wang, Zhiqian Zhang, Aixin Sun, and Zongyang Ma. 2016. Topic Modeling for Short Texts with Auxiliary Word Embeddings. In Conference on Research and Development in Information Retrieval (SIGIR). 165–174.

[21]

Fangtao Li, Sheng Wang, Shenghua Liu, and Ming Zhang. 2014. Suit: A supervised user-item based topic model for sentiment analysis. In AAAI Conference on Artificial Intelligence.

[22]

Andrew Kachites McCallum. 2002. MALLET: A Machine Learning for Language Toolkit.(2002).

[23]

Yu Meng, Jiaxin Huang, Guangyuan Wang, Zihan Wang, Chao Zhang, Yu Zhang, and Jiawei Han. 2020. Discriminative topic mining via category-name guided text embedding. In The Web Conference (WWW). 2121–2132.

Digital Library

[24]

David Mimno, Hanna Wallach, Edmund Talley, Miriam Leenders, and Andrew McCallum. 2011. Optimizing semantic coherence in topic models. In Empirical Methods in Natural Language Processing (EMNLP). 262–272.

[25]

Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. In Empirical Methods in Natural Language Processing (EMNLP). 1532–1543.

[26]

Jipeng Qiang, Ping Chen, Tong Wang, and Xindong Wu. 2017. Topic modeling over short texts by incorporating word embeddings. In Pacific-Asia Conference on Knowledge Discovery and Data Mining. 363–374.

[27]

Xiaojun Quan, Chunyu Kit, Yong Ge, and Sinno Jialin Pan. 2015. Short and sparse text topic modeling via self-aggregation. In International Joint Conference on Artificial Intelligence.

[28]

Daniel Ramage, David Hall, Ramesh Nallapati, and Christopher D Manning. 2009. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Empirical Methods in Natural Language Processing (EMNLP). 248–256.

[29]

Alison Smith, Varun Kumar, Jordan Boyd-Graber, Kevin Seppi, and Leah Findlater. 2018. Closing the loop: User-centered design and evaluation of a human-in-the-loop topic modeling system. In International Conference on Intelligent User Interfaces. 293–304.

Digital Library

[30]

Yang Wang and Greg Mori. 2009. Human action recognition by semilatent topic models. IEEE Transactions on Pattern Analysis and Machine Intelligence 31, 10(2009), 1762–1774.

Digital Library

[31]

Xiaohui Yan, Jiafeng Guo, Yanyan Lan, and Xueqi Cheng. 2013. A Biterm Topic Model for Short Texts. In The Web Conference (WWW). 1445–1456.

Digital Library

[32]

Xiaohui Yan, Jiafeng Guo, Shenghua Liu, Xueqi Cheng, and Yanfeng Wang. 2013. Learning topics in short texts by non-negative matrix factorization on term correlation matrix. In SIAM International Conference on Data Mining (SDM). 749–757.

[33]

Liansheng Zhuang, Haoyuan Gao, Jiebo Luo, and Zhouchen Lin. 2013. Regularized semi-supervised latent dirichlet allocation for visual concept learning. Neurocomputing 119(2013), 26–32.

Digital Library

Cited By

Singh LChurchill R(2025)The Advent of Topic-Noise ModelsText Mining in Educational Research10.1007/978-981-97-7858-4_3(25-42)Online publication date: 13-Jan-2025
https://doi.org/10.1007/978-981-97-7858-4_3
Singh LBao LBode LBudak CPasek JRaghunathan TTraugott MWang YWycoff N(2024)Understanding the rationales and information environments for early, late, and nonadopters of the COVID-19 vaccinenpj Vaccines10.1038/s41541-024-00962-59:1Online publication date: 14-Sep-2024
https://doi.org/10.1038/s41541-024-00962-5

Index Terms

A Guided Topic-Noise Model for Short Texts
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Language resources
  2. Machine learning
2. Information systems
  1. Information systems applications

Index terms have been assigned to the content through auto-classification.

Recommendations

Sparse Biterm Topic Model for Short Texts
Web and Big Data
Abstract
Extracting meaningful and coherent topics from short texts is an important task for many real world applications. Biterm topic model (BTM) is a popular topic model for short texts by explicitly model word co-occurrence patterns in the corpus ...
A biterm topic model for short texts
WWW '13: Proceedings of the 22nd international conference on World Wide Web

Uncovering the topics within short texts, such as tweets and instant messages, has become an important task for many content analysis applications. However, directly applying conventional topic models (e.g. LDA and PLSA) on such short texts may not work ...
The dual-sparse topic model: mining focused topics and focused terms in short text
WWW '14: Proceedings of the 23rd international conference on World wide web

Topic modeling has been proved to be an effective method for exploratory text mining. It is a common assumption of most topic models that a document is generated from a mixture of topics. In real-world scenarios, individual documents usually concentrate ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '22: Proceedings of the ACM Web Conference 2022

April 2022

3764 pages

ISBN:9781450390965

DOI:10.1145/3485447

Editors:
Frédérique Laforest
INSA Lyon, France
,
Raphaël Troncy
EURECOM, France
,
Elena Simperl
King’s College London, UK
,
Deepak Agarwal
Pinterest, USA
,
Aristides Gionis
KTH Royal Institute of Technology, Sweden
,
Ivan Herman
W3C / retired
,
Lionel Médini
Université Lyon 1, France

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 April 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Funding Sources

National Science Foundation

Conference

WWW '22

Sponsor:

SIGWEB

WWW '22: The ACM Web Conference 2022

April 25 - 29, 2022

Virtual Event, Lyon, France

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
691
Total Downloads

Downloads (Last 12 months)235
Downloads (Last 6 weeks)28

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Singh LChurchill R(2025)The Advent of Topic-Noise ModelsText Mining in Educational Research10.1007/978-981-97-7858-4_3(25-42)Online publication date: 13-Jan-2025
https://doi.org/10.1007/978-981-97-7858-4_3
Singh LBao LBode LBudak CPasek JRaghunathan TTraugott MWang YWycoff N(2024)Understanding the rationales and information environments for early, late, and nonadopters of the COVID-19 vaccinenpj Vaccines10.1038/s41541-024-00962-59:1Online publication date: 14-Sep-2024
https://doi.org/10.1038/s41541-024-00962-5

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Figures

Tables

Media

View Table of Conten