research-article

Large-scale high-precision topic modeling on twitter

Authors:

Shuang-Hong Yang,

Andy Schlaikjer,

Pankaj GuptaAuthors Info & Claims

KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

Pages 1907 - 1916

https://doi.org/10.1145/2623330.2623336

Published: 24 August 2014 Publication History

Abstract

We are interested in organizing a continuous stream of sparse and noisy texts, known as "tweets", in real time into an ontology of hundreds of topics with measurable and stringently high precision. This inference is performed over a full-scale stream of Twitter data, whose statistical distribution evolves rapidly over time. The implementation in an industrial setting with the potential of affecting and being visible to real users made it necessary to overcome a host of practical challenges. We present a spectrum of topic modeling techniques that contribute to a deployed system. These include non-topical tweet detection, automatic labeled data acquisition, evaluation with human computation, diagnostic and corrective learning and, most importantly, high-precision topic inference. The latter represents a novel two-stage training algorithm for tweet text classification and a close-loop inference mechanism for combining texts with additional sources of information. The resulting system achieves 93% precision at substantial overall coverage.

Supplementary Material

MP4 File (p1907-sidebyside.mp4)

Download
245.95 MB

References

[1]

http://about.twitter.com.

[2]

https://blog.twitter.com/2013/new-tweets-per-second-record-and-how.

[3]

R. Balasubramanyan and A. Kolcz. Chatter in twitter: Identification and prevalence. In ASONAM, 2013.

Digital Library

[4]

P. N. Bennett. Using asymmetric distributions to improve text classifier probability estimates. In SIGIR, 2003.

Digital Library

[5]

D. M. Blei. Probabilistic topic models. Commun. ACM, 55(4):77--84, Apr. 2012.

Digital Library

[6]

A. Blum and T. Mitchell. Combining labeled and unlabeled data with co-training. In COLT, 1998.

Digital Library

[7]

P. Bommannavar, A. Kolcz and A. Rajaraman. Estimating recall for rare topic retrieval via conditionally independent classifiers. paper pending review, 2014.

[8]

S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein. Distributed optimization and statistical learning via the alternating direction method of multipliers. Found. Trends Mach. Learn., 3(1):1--122, Jan. 2011.

Digital Library

[9]

M. Burgess, A. Mazzia, E. Adar, and M. Cafarella. Leveraging noisy lists for social feed ranking. ICWSM'13.

[10]

V. Chandrasekaran and M. I. Jordan. Computational and statistical tradeoffs via convex relaxation. PNAS, 2013.

[11]

D. Easley and J. Kleinberg. Networks, Crowds, and Markets: Reasoning About a Highly Connected World. Cambridge University Press, 2010.

Digital Library

[12]

C. Elkan and K. Noto. Learning classifiers from only positive and unlabeled data. In KDD, 2008.

Digital Library

[13]

G. Forman and E. Kirshenbaum. Extremely fast text feature extraction for classification and indexing. CIKM' 08.

Digital Library

[14]

Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an application to boosting. In EuroCOLT, 1995.

Digital Library

[15]

S. Gopal and Y. Yang. Recursive regularization for large-scale classification with hierarchical and graphical dependencies. In KDD, 2013.

Digital Library

[16]

G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural Comput., 2006.

Digital Library

[17]

L. Hong and B. D. Davison. Empirical study of topic modeling in twitter. In SOMA, 2010.

Digital Library

[18]

J. Lin and A. Kolcz. Large-scale machine learning at twitter. In SIGMOD, 2012.

Digital Library

[19]

J. Lin, R. Snow, and W. Morgan. Smoothing techniques for adaptive online language models: Topic tracking in tweet streams. In KDD, 2011.

Digital Library

[20]

C. D. Manning, P. Raghavan, and H. Schütze. Introduction to Information Retrieval. 2008.

Digital Library

[21]

T. Mikolov, I. Sutskever, K. Chen, G. Corrado, and J. Dean. Distributed representations of words and phrases and their compositionality. In NIPS, 2013

Digital Library

[22]

D. Ramage, D. Hall, R. Nallapati, and C. D. Manning. Labeled lda: A supervised topic model for credit attribution in multi-labeled corpora. In EMNLP, 2009.

Digital Library

[23]

K. Raman, K. M. Svore, R. Gilad-Bachrach, and C. Burges. Learning from our mistakes: Towards a correctable learning algorithm. In CIKM, 2012.

Digital Library

[24]

V. C. Raykar, S. Yu, L. H. Zhao, A. Jerebko, C. Florin, G. H. Valadez, L. Bogoni, and L. Moy. Supervised learning from multiple experts: Whom to trust when everyone lies a bit. In ICML, 2009.

Digital Library

[25]

B. Settles. Closing the loop: Fast, interactive semi-supervised annotation with queries on features and instances. In EMNLP, 2011.

Digital Library

[26]

S. Shalev-Shwartz, Y. Singer, and N. Srebro. Pegasos: Primal estimated sub-gradient solver for svm. ICML' 07.

Digital Library

[27]

W. X. Zhao, J. Jiang, J. Weng, J. He, E.-P. Lim, H. Yan, and X. Li. Comparing twitter and traditional media using topic models. In ECIR, 2011.

Digital Library

[28]

K. Zhou, S.-H. Yang, and H. Zha. Functional matrix factorizations for cold-start recommendation. SIGIR' 11.

Digital Library

Cited By

Lamprou IShevtsov AAntonakaki DPratikakis PIoannidis S(2024)Exploring Crisis-Driven Social Media Patterns: A Twitter Dataset of Usage During the Russo-Ukrainian WarSocial Networks Analysis and Mining10.1007/978-3-031-78541-2_5(70-85)Online publication date: 2-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-78541-2_5
Pilipiec PSamsten IBota A(2023)Surveillance of communicable diseases using social media: A systematic reviewPLOS ONE10.1371/journal.pone.028210118:2(e0282101)Online publication date: 24-Feb-2023
https://doi.org/10.1371/journal.pone.0282101
Kanavos AVoutos YGrivokostopoulou FMylonas P(2022)Evaluating Methods for Efficient Community Detection in Social NetworksInformation10.3390/info1305020913:5(209)Online publication date: 19-Apr-2022
https://doi.org/10.3390/info13050209
Show More Cited By

Index Terms

Large-scale high-precision topic modeling on twitter
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information systems applications
    1. Data mining

Recommendations

Twitter Opinion Topic Model: Extracting Product Opinions from Tweets by Leveraging Hashtags and Sentiment Lexicon
CIKM '14: Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management

Aspect-based opinion mining is widely applied to review data to aggregate or summarize opinions of a product, and the current state-of-the-art is achieved with Latent Dirichlet Allocation (LDA)-based model. Although social media data like tweets are ...
Topic and sentiment aware microblog summarization for twitter
Abstract
Recent advances in microblog content summarization has primarily viewed this task in the context of traditional multi-document summarization techniques where a microblog post or their collection form one document. While these techniques already ...
Is That Twitter Hashtag Worth Reading
WCI '15: Proceedings of the Third International Symposium on Women in Computing and Informatics

Online social media such as Twitter, Facebook, Wikis and Linkedin have made a great impact on the way we consume information in our day to day life. Now it has become increasingly important that we come across appropriate content from the social media ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '14: Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

August 2014

2028 pages

ISBN:9781450329569

DOI:10.1145/2623330

General Chairs:
Sofus Macskassy
Facebook
,
Claudia Perlich
Dstillery
,
Program Chairs:
Jure Leskovec
Stanford University
,
Wei Wang
UCLA
,
Rayid Ghani
University of Chicago

Copyright © 2014 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 24 August 2014

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

KDD '14

Sponsor:

KDD '14: The 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

August 24 - 27, 2014

New York, New York, USA

Acceptance Rates

KDD '14 Paper Acceptance Rate 151 of 1,036 submissions, 15%;

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Upcoming Conference

KDD '25

Sponsor:
sigkdd
sigkdd

The 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 3 - 7, 2025

Toronto , ON , Canada

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

74
Total Citations
View Citations
2,188
Total Downloads

Downloads (Last 12 months)30
Downloads (Last 6 weeks)3

Reflects downloads up to 20 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lamprou IShevtsov AAntonakaki DPratikakis PIoannidis S(2024)Exploring Crisis-Driven Social Media Patterns: A Twitter Dataset of Usage During the Russo-Ukrainian WarSocial Networks Analysis and Mining10.1007/978-3-031-78541-2_5(70-85)Online publication date: 2-Sep-2024
https://dl.acm.org/doi/10.1007/978-3-031-78541-2_5
Pilipiec PSamsten IBota A(2023)Surveillance of communicable diseases using social media: A systematic reviewPLOS ONE10.1371/journal.pone.028210118:2(e0282101)Online publication date: 24-Feb-2023
https://doi.org/10.1371/journal.pone.0282101
Kanavos AVoutos YGrivokostopoulou FMylonas P(2022)Evaluating Methods for Efficient Community Detection in Social NetworksInformation10.3390/info1305020913:5(209)Online publication date: 19-Apr-2022
https://doi.org/10.3390/info13050209
Liao MSundar S(2022)#facebookdown: Time to panic or detox? Understanding users’ reactions to social media outageExtended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems10.1145/3491101.3519674(1-8)Online publication date: 27-Apr-2022
https://dl.acm.org/doi/10.1145/3491101.3519674
Koukaras PTjortjis CRousidis D(2022)Mining association rules from COVID-19 related twitter data to discover word patterns, topics and inferencesInformation Systems10.1016/j.is.2022.102054109:COnline publication date: 1-Nov-2022
https://dl.acm.org/doi/10.1016/j.is.2022.102054
Sagyndyk BBaymurzina DBurtsev M(2022)DeepPavlov Topics: Topic Classification Dataset for Conversational Domain in EnglishAdvances in Neural Computation, Machine Learning, and Cognitive Research VI10.1007/978-3-031-19032-2_39(371-380)Online publication date: 19-Oct-2022
https://doi.org/10.1007/978-3-031-19032-2_39
Yin ZNi CFabbri DRosenbloom SMalin B(2022)Detecting Personal Health Mentions from Social Media Using Supervised Machine LearningPersonal Health Informatics10.1007/978-3-031-07696-1_12(247-266)Online publication date: 23-Nov-2022
https://doi.org/10.1007/978-3-031-07696-1_12
Shan SJu XWei YWang Z(2021)Effects of PM2.5 on People’s Emotion: A Case Study of Weibo (Chinese Twitter) in BeijingInternational Journal of Environmental Research and Public Health10.3390/ijerph1810542218:10(5422)Online publication date: 19-May-2021
https://doi.org/10.3390/ijerph18105422
Helmstetter SPaulheim H(2021)Collecting a Large Scale Dataset for Classifying Fake News Tweets Using Weak SupervisionFuture Internet10.3390/fi1305011413:5(114)Online publication date: 29-Apr-2021
https://doi.org/10.3390/fi13050114
Henry D(2021)Voice tweets between humanization and moderation: Consequences, Challenges and Opportunities.The 23rd International Conference on Information Integration and Web Intelligence10.1145/3487664.3487780(146-151)Online publication date: 29-Nov-2021
https://dl.acm.org/doi/10.1145/3487664.3487780
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten