Training a text classifier with a single word using Twitter Lists and domain adaptation

Culotta, Aron

doi:10.1007/s13278-016-0317-1

Training a text classifier with a single word using Twitter Lists and domain adaptation

Original Article
Published: 06 February 2016

Volume 6, article number 8, (2016)
Cite this article

Social Network Analysis and Mining Aims and scope Submit manuscript

Aron Culotta¹

416 Accesses
2 Citations
Explore all metrics

Abstract

Annotating data is a common bottleneck in building text classifiers. This is particularly problematic in social media domains, where data drift requires frequent retraining to maintain high accuracy. In this paper, we propose and evaluate a text classification method for Twitter data whose only required human input is a single keyword per class. The algorithm proceeds by identifying exemplar Twitter accounts that are representative of each class by analyzing Twitter Lists (human-curated collections of related Twitter accounts). A classifier is then fit to the exemplar accounts and used to predict labels of new tweets and users. We develop domain adaptation methods to address the noise and selection bias inherent to this approach, which we find to be critical to classification accuracy. Across a diverse set of tasks (topic, gender, and political affiliation classification), we find that the resulting classifier is competitive with a fully supervised baseline, achieving superior accuracy on four of six datasets despite using no manually labeled data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey of transfer learning

Article Open access 28 May 2016

Social media analytics: a survey of techniques, tools and platforms

Article Open access 26 July 2014

Automated identification of media bias in news articles: an interdisciplinary literature review

Article Open access 16 November 2018

Notes

The query used is: site:twitter.com inurl:lists.
https://api.twitter.com/1.1/lists/members.json.
https://api.twitter.com/1.1/statuses/user_timeline.json.
We created a Twitter-specific stop list containing the 500 most frequently used words from a sample of a year’s worth of English tweets.
http://dmoz.org.

References

Al Zamal F, Liu W, Ruths D (2012) Homophily and latent attribute inference: inferring latent attributes of twitter users from neighbors. In: International AAAI Conference on web and social media
Ardehaly EM, Culotta A (2015) Inferring latent attributes of twitter users with label regularization. In: Proceedings of the 2015 Conference of the North American Chapter of the association for computational linguistics: human language technologies, Denver, CO, pp 185–195
Argamon S, Dhawle S, Koppel M, Pennebaker JW (2005) Lexical predictors of personality type. In: Proceedings of the joint annual meeting of the Interface and the Classification Society of North America, St. Louis, MO
Bacchiani M, Riley M, Roark B, Sproat R (2006) Map adaptation of stochastic grammars. Comput Speech Lang 20(1):41–68
Article Google Scholar
Barberá P (2013) Birds of the same feather tweet together. bayesian ideal point estimation using twitter data. In: Proceedings of the social media and political participation, Florence, Italy, pp 10–11
Ben-David S, Blitzer J, Crammer K, Kulesza A, Pereira F, Vaughan JW (2010) A theory of learning from different domains. Mach Learn 79(1–2):151–175
Article MathSciNet Google Scholar
Bergsma S, Dredze M, Van Durme B, Wilson T, Yarowsky D (2013) Broadly improving user classification via communication-based name and location clustering on twitter. In: HLT-NAACL, pp 1010–1019
Bickel S, Brückner M, Scheffer T (2009) Discriminative learning under covariate shift. J Mach Learn Res 10:2137–2155
MathSciNet MATH Google Scholar
Burger JD, Henderson J, Kim G, Zarrella G (2011) Discriminating gender on twitter. In: Proceedings of the conference on empirical methods in natural language processing, association for computational linguistics, Stroudsburg, PA, USA, EMNLP’11, pp 1301–1309. http://dl.acm.org/citation.cfm?id=2145432.2145568
Burgess M, Mazzia A, Adar E, Cafarella MJ (2013) Leveraging noisy lists for social feed ranking. In: Proceedings of the international AAAI Conference on web and social media
Chang J, Rosenn I, Backstrom L, Marlow C (2010) epluribus: ethnicity on social networks. In: Proceedings of the international AAAI Conference on web and social media
Chen M, Weinberger KQ, Blitzer J (2011) Co-training for domain adaptation. In: Advances in neural information processing systems, pp 2456–2464
Conover MD, Gonçalves B, Ratkiewicz J, Flammini A, Menczer F (2011) Predicting the political alignment of twitter users. In: IEEE 3rd international conference on social computing (SOCIALCOM), IEEE, pp 192–199
Culotta A, Kumar NR, Cutler J (2015) Predicting the demographics of twitter users from website traffic data. In: 29th national conference on artificial intelligence (AAAI), pp 72–78
Das Sarma A, Das Sarma A, Gollapudi S, Panigrahy R (2010) Ranking mechanisms in twitter-like forums. In: Proceedings of the 3rd ACM international conference on web search and data mining, ACM, pp 21–30
Daumé III H (2007) Frustratingly easy domain adaptation. In: ACL, pp 53–59
Daumé III H, Kumar A, Saha A (2010) Frustratingly easy semi-supervised domain adaptation. In: Proceedings of the 2010 workshop on domain adaptation for natural language processing, association for computational linguistics, pp 53–59
Dredze M (2012) How social media will change public health. IEEE Intell Syst 27(4):81–84. doi:10.1109/MIS.2012.76
Elkan C (2001) The foundations of cost-sensitive learning. In: International joint conference on artificial intelligence, pp 973–978
Fokianos K, Kedem B (1998) Prediction and classification of non-stationary categorical time series. J Multivar Anal 67(2):277–296
Article MathSciNet MATH Google Scholar
García-Silva A, García-Castro LJ, Castro AG, Corcho Ó (2015) Building domain ontologies out of folksonomies and linked data. Int J Artif Intell Tools. doi:10.1142/S021821301540014X
Go A, Bhayani R, Huang L (2009) Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, vol 1, p 12
Heckman JJ (1979) Sample selection bias as a specification error. Econometrica J Econometric Soc 31(3):153–161
Article MathSciNet MATH Google Scholar
Hong L, Bekkerman R, Adler J, Davison BD (2012) Learning to rank social update streams. In: Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval, ACM, pp 651–660
Huang J, Gretton A, Borgwardt KM, Schölkopf B, Smola AJ (2006) Correcting sample selection bias by unlabeled data. In: Advances in neural information processing systems, pp 601–608
Kim D, Jo Y, Moon IC, Oh A (2010) Analysis of twitter lists as a potential source for discovering latent characteristics of users. In: ACM CHI workshop on microblogging
Lee K, Palsetia D, Narayanan R, Patwary MMA, Agrawal A, Choudhary A (2011) Twitter trending topic classification. In: 2011 IEEE 11th international conference on data mining workshops (ICDMW), IEEE, pp 251–258
Liu W, Ruths D (2013) What’s in a name? using first names as features for gender inference in twitter. In: AAAI spring symposium on analyzing microtext
Manning CD, Raghavan P, Schütze H et al (2008) Introduction to information retrieval, vol 1. Cambridge University Press Cambridge
McClosky D, Charniak E, Johnson M (2006a) Effective self-training for parsing. In: Proceedings of the main conference on human language technology conference of the North American Chapter of the association of computational linguistics, association for computational linguistics, pp 152–159
McClosky D, Charniak E, Johnson M (2006b) Reranking and self-training for parser adaptation. In: Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the association for computational linguistics, association for computational linguistics, pp 337–344
Nguyen D, Smith NA, Ros CP (2011) Author age prediction from text using linear regression. In: Proceedings of the 5th ACL-HLT workshop on language technology for cultural heritage, social sciences, and humanities, association for computational linguistics, Stroudsburg, PA, USA, LaTeCH’11, pp 115–123. http://dl.acm.org/citation.cfm?id=2107636.2107651
O’Connor B, Balasubramanyan R, Routledge BR, Smith NA (2010) From Tweets to polls: linking text sentiment to public opinion time series. In: international AAAI conference on weblogs and social media, Washington, D.C
Oktay H, Firat A, Ertem Z (2014) Demographic breakdown of twitter users: an analysis based on names. In: Academy of Science and Engineering (ASE)
Pennacchiotti M, Popescu AM (2011) A machine learning approach to twitter user classification. In: Adamic LA, Baeza-Yates RA, Counts S (eds) Proceedings of the international AAAI Conference on web and social media. The AAAI Press, Menlo Park, CA
Rao D, Yarowsky D, Shreevats A, Gupta M (2010) Classifying latent user attributes in twitter. In: Proceedings of the 2nd international workshop on search and mining user-generated contents, ACM, New York, NY, USA, SMUC ’10, pp 37–44
Rao D, Paul MJ, Fink C, Yarowsky D, Oates T, Coppersmith G (2011) Hierarchical bayesian models for latent attribute detection in social media. In: Adamic LA, Baeza-Yates RA, Counts S (eds). ICWSM, The AAAI Press
Schwartz HA, Eichstaedt JC, Kern ML, Dziurzynski L, Lucas RE, Agrawal M, Park GJ, Lakshmikanth SK, Jha S, Seligman MEP, Ungar LH (2013) Characterizing geographic variation in well-being using tweets. In: 7th international AAAI conference on weblogs and social media (ICWSM)
Shimodaira H (2000) Improving predictive inference under covariate shift by weighting the log-likelihood function. J Stat Plan Inference 90(2):227–244
Article MathSciNet MATH Google Scholar
Vieweg S, Hughes AL, Starbird K, Palen L (2010) Microblogging during two natural hazards events: what twitter may contribute to situational awareness. In: Proceedings of the 28th international conference on human factors in computing systems. NY, USA, New York, pp 1079–1088
Volkova S (2014) Twitter data collection: crawling users, neighbors and their communication for personal attribute prediction in social media. Technical report, Johns Hopkins University
Volkova S, Van Durme B (2015) Online bayesian models for personal analytics in social media. In: Proceedings of the 29th conference on artificial intelligence (AAAI), Austin, TX
Volkova S, Coppersmith G, Van Durme B (2014) Inferring user political preferences from streaming communications. In: Proceedings of the association for computational linguistics (ACL)
Widmer G, Kubat M (1996) Learning in the presence of concept drift and hidden contexts. Mach Learn 23(1):69–101
Google Scholar
Yang SH, Kolcz A, Schlaikjer A, Gupta P (2014) Large-scale high-precision topic modeling on twitter. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 1907–1916
Zadrozny B (2004) Learning and evaluating classifiers under sample selection bias. In: Proceedings of the 21st international conference on Machine learning, ACM, p 114

Download references

Acknowledgments

This research was funded in part by support from the IIT Educational and Research Initiative Fund and in part by the National Science Foundation under Grant #IIS-1526674. Any opinions, findings and conclusions or recommendations expressed in this material are the authors’ and do not necessarily reflect those of the sponsor.

Author information

Authors and Affiliations

Department of Computer Science, Illinois Institute of Technology, Chicago, IL, 60616, USA
Aron Culotta

Authors

Aron Culotta
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aron Culotta.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Culotta, A. Training a text classifier with a single word using Twitter Lists and domain adaptation. Soc. Netw. Anal. Min. 6, 8 (2016). https://doi.org/10.1007/s13278-016-0317-1

Download citation

Received: 25 November 2015
Revised: 18 January 2016
Accepted: 19 January 2016
Published: 06 February 2016
DOI: https://doi.org/10.1007/s13278-016-0317-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Training a text classifier with a single word using Twitter Lists and domain adaptation

Abstract

Access this article

Similar content being viewed by others

A survey of transfer learning

Social media analytics: a survey of techniques, tools and platforms

Automated identification of media bias in news articles: an interdisciplinary literature review

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Training a text classifier with a single word using Twitter Lists and domain adaptation

Abstract

Access this article

Similar content being viewed by others

A survey of transfer learning

Social media analytics: a survey of techniques, tools and platforms

Automated identification of media bias in news articles: an interdisciplinary literature review

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation