skip to main content
10.1145/2623330.2623336acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article

Large-scale high-precision topic modeling on twitter

Published: 24 August 2014 Publication History

Abstract

We are interested in organizing a continuous stream of sparse and noisy texts, known as "tweets", in real time into an ontology of hundreds of topics with measurable and stringently high precision. This inference is performed over a full-scale stream of Twitter data, whose statistical distribution evolves rapidly over time. The implementation in an industrial setting with the potential of affecting and being visible to real users made it necessary to overcome a host of practical challenges. We present a spectrum of topic modeling techniques that contribute to a deployed system. These include non-topical tweet detection, automatic labeled data acquisition, evaluation with human computation, diagnostic and corrective learning and, most importantly, high-precision topic inference. The latter represents a novel two-stage training algorithm for tweet text classification and a close-loop inference mechanism for combining texts with additional sources of information. The resulting system achieves 93% precision at substantial overall coverage.

Supplementary Material

MP4 File (p1907-sidebyside.mp4)

References

[1]
http://about.twitter.com.
[2]
https://blog.twitter.com/2013/new-tweets-per-second-record-and-how.
[3]
R. Balasubramanyan and A. Kolcz. Chatter in twitter: Identification and prevalence. In ASONAM, 2013.
[4]
P. N. Bennett. Using asymmetric distributions to improve text classifier probability estimates. In SIGIR, 2003.
[5]
D. M. Blei. Probabilistic topic models. Commun. ACM, 55(4):77--84, Apr. 2012.
[6]
A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT, 1998.
[7]
P. Bommannavar, A. Kolcz and A. Rajaraman. Estimating recall for rare topic retrieval via conditionally independent classifiers. paper pending review, 2014.
[8]
S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn., 3(1):1--122, Jan. 2011.
[9]
M. Burgess, A. Mazzia, E. Adar, and M. Cafarella. Leveraging noisy lists for social feed ranking. ICWSM'13.
[10]
V. Chandrasekaran and M. I. Jordan. Computational and statistical tradeoffs via convex relaxation. PNAS, 2013.
[11]
D. Easley and J. Kleinberg. Networks, Crowds, and Markets: Reasoning About a Highly Connected World. Cambridge University Press, 2010.
[12]
C. Elkan and K. Noto. Learning classifiers from only positive and unlabeled data. In KDD, 2008.
[13]
G. Forman and E. Kirshenbaum. Extremely fast text feature extraction for classification and indexing. CIKM' 08.
[14]
Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT, 1995.
[15]
S. Gopal and Y. Yang. Recursive regularization for large-scale classification with hierarchical and graphical dependencies. In KDD, 2013.
[16]
G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural Comput., 2006.
[17]
L. Hong and B. D. Davison. Empirical study of topic modeling in twitter. In SOMA, 2010.
[18]
J. Lin and A. Kolcz. Large-scale machine learning at twitter. In SIGMOD, 2012.
[19]
J. Lin, R. Snow, and W. Morgan. Smoothing techniques for adaptive online language models: Topic tracking in tweet streams. In KDD, 2011.
[20]
C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. 2008.
[21]
T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013
[22]
D. Ramage, D. Hall, R. Nallapati, and C. D. Manning. Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In EMNLP, 2009.
[23]
K. Raman, K. M. Svore, R. Gilad-Bachrach, and C. Burges. Learning from our mistakes: Towards a correctable learning algorithm. In CIKM, 2012.
[24]
V. C. Raykar, S. Yu, L. H. Zhao, A. Jerebko, C. Florin, G. H. Valadez, L. Bogoni, and L. Moy. Supervised learning from multiple experts: Whom to trust when everyone lies a bit. In ICML, 2009.
[25]
B. Settles. Closing the loop: Fast, interactive semi-supervised annotation with queries on features and instances. In EMNLP, 2011.
[26]
S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for svm. ICML' 07.
[27]
W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, and X. Li. Comparing twitter and traditional media using topic models. In ECIR, 2011.
[28]
K. Zhou, S.-H. Yang, and H. Zha. Functional matrix factorizations for cold-start recommendation. SIGIR' 11.

Cited By

View all
  • (2024)Exploring Crisis-Driven Social Media Patterns: A Twitter Dataset of Usage During the Russo-Ukrainian WarSocial Networks Analysis and Mining10.1007/978-3-031-78541-2_5(70-85)Online publication date: 2-Sep-2024
  • (2023)Surveillance of communicable diseases using social media: A systematic reviewPLOS ONE10.1371/journal.pone.028210118:2(e0282101)Online publication date: 24-Feb-2023
  • (2022)Evaluating Methods for Efficient Community Detection in Social NetworksInformation10.3390/info1305020913:5(209)Online publication date: 19-Apr-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining
August 2014
2028 pages
ISBN:9781450329569
DOI:10.1145/2623330
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. large-scale machine learning
  2. social media
  3. text classification
  4. topic modeling

Qualifiers

  • Research-article

Conference

KDD '14
Sponsor:

Acceptance Rates

KDD '14 Paper Acceptance Rate 151 of 1,036 submissions, 15%;
Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)30
  • Downloads (Last 6 weeks)3
Reflects downloads up to 20 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Exploring Crisis-Driven Social Media Patterns: A Twitter Dataset of Usage During the Russo-Ukrainian WarSocial Networks Analysis and Mining10.1007/978-3-031-78541-2_5(70-85)Online publication date: 2-Sep-2024
  • (2023)Surveillance of communicable diseases using social media: A systematic reviewPLOS ONE10.1371/journal.pone.028210118:2(e0282101)Online publication date: 24-Feb-2023
  • (2022)Evaluating Methods for Efficient Community Detection in Social NetworksInformation10.3390/info1305020913:5(209)Online publication date: 19-Apr-2022
  • (2022)#facebookdown: Time to panic or detox? Understanding users’ reactions to social media outageExtended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems10.1145/3491101.3519674(1-8)Online publication date: 27-Apr-2022
  • (2022)Mining association rules from COVID-19 related twitter data to discover word patterns, topics and inferencesInformation Systems10.1016/j.is.2022.102054109:COnline publication date: 1-Nov-2022
  • (2022)DeepPavlov Topics: Topic Classification Dataset for Conversational Domain in EnglishAdvances in Neural Computation, Machine Learning, and Cognitive Research VI10.1007/978-3-031-19032-2_39(371-380)Online publication date: 19-Oct-2022
  • (2022)Detecting Personal Health Mentions from Social Media Using Supervised Machine LearningPersonal Health Informatics10.1007/978-3-031-07696-1_12(247-266)Online publication date: 23-Nov-2022
  • (2021)Effects of PM2.5 on People’s Emotion: A Case Study of Weibo (Chinese Twitter) in BeijingInternational Journal of Environmental Research and Public Health10.3390/ijerph1810542218:10(5422)Online publication date: 19-May-2021
  • (2021)Collecting a Large Scale Dataset for Classifying Fake News Tweets Using Weak SupervisionFuture Internet10.3390/fi1305011413:5(114)Online publication date: 29-Apr-2021
  • (2021)Voice tweets between humanization and moderation: Consequences, Challenges and Opportunities.The 23rd International Conference on Information Integration and Web Intelligence10.1145/3487664.3487780(146-151)Online publication date: 29-Nov-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media