Abstract
Annotating data is a common bottleneck in building text classifiers. This is particularly problematic in social media domains, where data drift requires frequent retraining to maintain high accuracy. In this paper, we propose and evaluate a text classification method for Twitter data whose only required human input is a single keyword per class. The algorithm proceeds by identifying exemplar Twitter accounts that are representative of each class by analyzing Twitter Lists (human-curated collections of related Twitter accounts). A classifier is then fit to the exemplar accounts and used to predict labels of new tweets and users. We develop domain adaptation methods to address the noise and selection bias inherent to this approach, which we find to be critical to classification accuracy. Across a diverse set of tasks (topic, gender, and political affiliation classification), we find that the resulting classifier is competitive with a fully supervised baseline, achieving superior accuracy on four of six datasets despite using no manually labeled data.
Similar content being viewed by others
Notes
The query used is: site:twitter.com inurl:lists.
We created a Twitter-specific stop list containing the 500 most frequently used words from a sample of a year’s worth of English tweets.
References
Al Zamal F, Liu W, Ruths D (2012) Homophily and latent attribute inference: inferring latent attributes of twitter users from neighbors. In: International AAAI Conference on web and social media
Ardehaly EM, Culotta A (2015) Inferring latent attributes of twitter users with label regularization. In: Proceedings of the 2015 Conference of the North American Chapter of the association for computational linguistics: human language technologies, Denver, CO, pp 185–195
Argamon S, Dhawle S, Koppel M, Pennebaker JW (2005) Lexical predictors of personality type. In: Proceedings of the joint annual meeting of the Interface and the Classification Society of North America, St. Louis, MO
Bacchiani M, Riley M, Roark B, Sproat R (2006) Map adaptation of stochastic grammars. Comput Speech Lang 20(1):41–68
Barberá P (2013) Birds of the same feather tweet together. bayesian ideal point estimation using twitter data. In: Proceedings of the social media and political participation, Florence, Italy, pp 10–11
Ben-David S, Blitzer J, Crammer K, Kulesza A, Pereira F, Vaughan JW (2010) A theory of learning from different domains. Mach Learn 79(1–2):151–175
Bergsma S, Dredze M, Van Durme B, Wilson T, Yarowsky D (2013) Broadly improving user classification via communication-based name and location clustering on twitter. In: HLT-NAACL, pp 1010–1019
Bickel S, Brückner M, Scheffer T (2009) Discriminative learning under covariate shift. J Mach Learn Res 10:2137–2155
Burger JD, Henderson J, Kim G, Zarrella G (2011) Discriminating gender on twitter. In: Proceedings of the conference on empirical methods in natural language processing, association for computational linguistics, Stroudsburg, PA, USA, EMNLP’11, pp 1301–1309. http://dl.acm.org/citation.cfm?id=2145432.2145568
Burgess M, Mazzia A, Adar E, Cafarella MJ (2013) Leveraging noisy lists for social feed ranking. In: Proceedings of the international AAAI Conference on web and social media
Chang J, Rosenn I, Backstrom L, Marlow C (2010) epluribus: ethnicity on social networks. In: Proceedings of the international AAAI Conference on web and social media
Chen M, Weinberger KQ, Blitzer J (2011) Co-training for domain adaptation. In: Advances in neural information processing systems, pp 2456–2464
Conover MD, Gonçalves B, Ratkiewicz J, Flammini A, Menczer F (2011) Predicting the political alignment of twitter users. In: IEEE 3rd international conference on social computing (SOCIALCOM), IEEE, pp 192–199
Culotta A, Kumar NR, Cutler J (2015) Predicting the demographics of twitter users from website traffic data. In: 29th national conference on artificial intelligence (AAAI), pp 72–78
Das Sarma A, Das Sarma A, Gollapudi S, Panigrahy R (2010) Ranking mechanisms in twitter-like forums. In: Proceedings of the 3rd ACM international conference on web search and data mining, ACM, pp 21–30
Daumé III H (2007) Frustratingly easy domain adaptation. In: ACL, pp 53–59
Daumé III H, Kumar A, Saha A (2010) Frustratingly easy semi-supervised domain adaptation. In: Proceedings of the 2010 workshop on domain adaptation for natural language processing, association for computational linguistics, pp 53–59
Dredze M (2012) How social media will change public health. IEEE Intell Syst 27(4):81–84. doi:10.1109/MIS.2012.76
Elkan C (2001) The foundations of cost-sensitive learning. In: International joint conference on artificial intelligence, pp 973–978
Fokianos K, Kedem B (1998) Prediction and classification of non-stationary categorical time series. J Multivar Anal 67(2):277–296
García-Silva A, García-Castro LJ, Castro AG, Corcho Ó (2015) Building domain ontologies out of folksonomies and linked data. Int J Artif Intell Tools. doi:10.1142/S021821301540014X
Go A, Bhayani R, Huang L (2009) Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, vol 1, p 12
Heckman JJ (1979) Sample selection bias as a specification error. Econometrica J Econometric Soc 31(3):153–161
Hong L, Bekkerman R, Adler J, Davison BD (2012) Learning to rank social update streams. In: Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval, ACM, pp 651–660
Huang J, Gretton A, Borgwardt KM, Schölkopf B, Smola AJ (2006) Correcting sample selection bias by unlabeled data. In: Advances in neural information processing systems, pp 601–608
Kim D, Jo Y, Moon IC, Oh A (2010) Analysis of twitter lists as a potential source for discovering latent characteristics of users. In: ACM CHI workshop on microblogging
Lee K, Palsetia D, Narayanan R, Patwary MMA, Agrawal A, Choudhary A (2011) Twitter trending topic classification. In: 2011 IEEE 11th international conference on data mining workshops (ICDMW), IEEE, pp 251–258
Liu W, Ruths D (2013) What’s in a name? using first names as features for gender inference in twitter. In: AAAI spring symposium on analyzing microtext
Manning CD, Raghavan P, Schütze H et al (2008) Introduction to information retrieval, vol 1. Cambridge University Press Cambridge
McClosky D, Charniak E, Johnson M (2006a) Effective self-training for parsing. In: Proceedings of the main conference on human language technology conference of the North American Chapter of the association of computational linguistics, association for computational linguistics, pp 152–159
McClosky D, Charniak E, Johnson M (2006b) Reranking and self-training for parser adaptation. In: Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the association for computational linguistics, association for computational linguistics, pp 337–344
Nguyen D, Smith NA, Ros CP (2011) Author age prediction from text using linear regression. In: Proceedings of the 5th ACL-HLT workshop on language technology for cultural heritage, social sciences, and humanities, association for computational linguistics, Stroudsburg, PA, USA, LaTeCH’11, pp 115–123. http://dl.acm.org/citation.cfm?id=2107636.2107651
O’Connor B, Balasubramanyan R, Routledge BR, Smith NA (2010) From Tweets to polls: linking text sentiment to public opinion time series. In: international AAAI conference on weblogs and social media, Washington, D.C
Oktay H, Firat A, Ertem Z (2014) Demographic breakdown of twitter users: an analysis based on names. In: Academy of Science and Engineering (ASE)
Pennacchiotti M, Popescu AM (2011) A machine learning approach to twitter user classification. In: Adamic LA, Baeza-Yates RA, Counts S (eds) Proceedings of the international AAAI Conference on web and social media. The AAAI Press, Menlo Park, CA
Rao D, Yarowsky D, Shreevats A, Gupta M (2010) Classifying latent user attributes in twitter. In: Proceedings of the 2nd international workshop on search and mining user-generated contents, ACM, New York, NY, USA, SMUC ’10, pp 37–44
Rao D, Paul MJ, Fink C, Yarowsky D, Oates T, Coppersmith G (2011) Hierarchical bayesian models for latent attribute detection in social media. In: Adamic LA, Baeza-Yates RA, Counts S (eds). ICWSM, The AAAI Press
Schwartz HA, Eichstaedt JC, Kern ML, Dziurzynski L, Lucas RE, Agrawal M, Park GJ, Lakshmikanth SK, Jha S, Seligman MEP, Ungar LH (2013) Characterizing geographic variation in well-being using tweets. In: 7th international AAAI conference on weblogs and social media (ICWSM)
Shimodaira H (2000) Improving predictive inference under covariate shift by weighting the log-likelihood function. J Stat Plan Inference 90(2):227–244
Vieweg S, Hughes AL, Starbird K, Palen L (2010) Microblogging during two natural hazards events: what twitter may contribute to situational awareness. In: Proceedings of the 28th international conference on human factors in computing systems. NY, USA, New York, pp 1079–1088
Volkova S (2014) Twitter data collection: crawling users, neighbors and their communication for personal attribute prediction in social media. Technical report, Johns Hopkins University
Volkova S, Van Durme B (2015) Online bayesian models for personal analytics in social media. In: Proceedings of the 29th conference on artificial intelligence (AAAI), Austin, TX
Volkova S, Coppersmith G, Van Durme B (2014) Inferring user political preferences from streaming communications. In: Proceedings of the association for computational linguistics (ACL)
Widmer G, Kubat M (1996) Learning in the presence of concept drift and hidden contexts. Mach Learn 23(1):69–101
Yang SH, Kolcz A, Schlaikjer A, Gupta P (2014) Large-scale high-precision topic modeling on twitter. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 1907–1916
Zadrozny B (2004) Learning and evaluating classifiers under sample selection bias. In: Proceedings of the 21st international conference on Machine learning, ACM, p 114
Acknowledgments
This research was funded in part by support from the IIT Educational and Research Initiative Fund and in part by the National Science Foundation under Grant #IIS-1526674. Any opinions, findings and conclusions or recommendations expressed in this material are the authors’ and do not necessarily reflect those of the sponsor.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Culotta, A. Training a text classifier with a single word using Twitter Lists and domain adaptation. Soc. Netw. Anal. Min. 6, 8 (2016). https://doi.org/10.1007/s13278-016-0317-1
Received:
Revised:
Accepted:
Published:
DOI: https://doi.org/10.1007/s13278-016-0317-1