Skip to main content

Advertisement

Log in

Training a text classifier with a single word using Twitter Lists and domain adaptation

  • Original Article
  • Published:
Social Network Analysis and Mining Aims and scope Submit manuscript

Abstract

Annotating data is a common bottleneck in building text classifiers. This is particularly problematic in social media domains, where data drift requires frequent retraining to maintain high accuracy. In this paper, we propose and evaluate a text classification method for Twitter data whose only required human input is a single keyword per class. The algorithm proceeds by identifying exemplar Twitter accounts that are representative of each class by analyzing Twitter Lists (human-curated collections of related Twitter accounts). A classifier is then fit to the exemplar accounts and used to predict labels of new tweets and users. We develop domain adaptation methods to address the noise and selection bias inherent to this approach, which we find to be critical to classification accuracy. Across a diverse set of tasks (topic, gender, and political affiliation classification), we find that the resulting classifier is competitive with a fully supervised baseline, achieving superior accuracy on four of six datasets despite using no manually labeled data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. The query used is: site:twitter.com inurl:lists.

  2. https://api.twitter.com/1.1/lists/members.json.

  3. https://api.twitter.com/1.1/statuses/user_timeline.json.

  4. We created a Twitter-specific stop list containing the 500 most frequently used words from a sample of a year’s worth of English tweets.

  5. http://dmoz.org.

References

  • Al Zamal F, Liu W, Ruths D (2012) Homophily and latent attribute inference: inferring latent attributes of twitter users from neighbors. In: International AAAI Conference on web and social media

  • Ardehaly EM, Culotta A (2015) Inferring latent attributes of twitter users with label regularization. In: Proceedings of the 2015 Conference of the North American Chapter of the association for computational linguistics: human language technologies, Denver, CO, pp 185–195

  • Argamon S, Dhawle S, Koppel M, Pennebaker JW (2005) Lexical predictors of personality type. In: Proceedings of the joint annual meeting of the Interface and the Classification Society of North America, St. Louis, MO

  • Bacchiani M, Riley M, Roark B, Sproat R (2006) Map adaptation of stochastic grammars. Comput Speech Lang 20(1):41–68

    Article  Google Scholar 

  • Barberá P (2013) Birds of the same feather tweet together. bayesian ideal point estimation using twitter data. In: Proceedings of the social media and political participation, Florence, Italy, pp 10–11

  • Ben-David S, Blitzer J, Crammer K, Kulesza A, Pereira F, Vaughan JW (2010) A theory of learning from different domains. Mach Learn 79(1–2):151–175

    Article  MathSciNet  Google Scholar 

  • Bergsma S, Dredze M, Van Durme B, Wilson T, Yarowsky D (2013) Broadly improving user classification via communication-based name and location clustering on twitter. In: HLT-NAACL, pp 1010–1019

  • Bickel S, Brückner M, Scheffer T (2009) Discriminative learning under covariate shift. J Mach Learn Res 10:2137–2155

    MathSciNet  MATH  Google Scholar 

  • Burger JD, Henderson J, Kim G, Zarrella G (2011) Discriminating gender on twitter. In: Proceedings of the conference on empirical methods in natural language processing, association for computational linguistics, Stroudsburg, PA, USA, EMNLP’11, pp 1301–1309. http://dl.acm.org/citation.cfm?id=2145432.2145568

  • Burgess M, Mazzia A, Adar E, Cafarella MJ (2013) Leveraging noisy lists for social feed ranking. In: Proceedings of the international AAAI Conference on web and social media

  • Chang J, Rosenn I, Backstrom L, Marlow C (2010) epluribus: ethnicity on social networks. In: Proceedings of the international AAAI Conference on web and social media

  • Chen M, Weinberger KQ, Blitzer J (2011) Co-training for domain adaptation. In: Advances in neural information processing systems, pp 2456–2464

  • Conover MD, Gonçalves B, Ratkiewicz J, Flammini A, Menczer F (2011) Predicting the political alignment of twitter users. In: IEEE 3rd international conference on social computing (SOCIALCOM), IEEE, pp 192–199

  • Culotta A, Kumar NR, Cutler J (2015) Predicting the demographics of twitter users from website traffic data. In: 29th national conference on artificial intelligence (AAAI), pp 72–78

  • Das Sarma A, Das Sarma A, Gollapudi S, Panigrahy R (2010) Ranking mechanisms in twitter-like forums. In: Proceedings of the 3rd ACM international conference on web search and data mining, ACM, pp 21–30

  • Daumé III H (2007) Frustratingly easy domain adaptation. In: ACL, pp 53–59

  • Daumé III H, Kumar A, Saha A (2010) Frustratingly easy semi-supervised domain adaptation. In: Proceedings of the 2010 workshop on domain adaptation for natural language processing, association for computational linguistics, pp 53–59

  • Dredze M (2012) How social media will change public health. IEEE Intell Syst 27(4):81–84. doi:10.1109/MIS.2012.76

  • Elkan C (2001) The foundations of cost-sensitive learning. In: International joint conference on artificial intelligence, pp 973–978

  • Fokianos K, Kedem B (1998) Prediction and classification of non-stationary categorical time series. J Multivar Anal 67(2):277–296

    Article  MathSciNet  MATH  Google Scholar 

  • García-Silva A, García-Castro LJ, Castro AG, Corcho Ó (2015) Building domain ontologies out of folksonomies and linked data. Int J Artif Intell Tools. doi:10.1142/S021821301540014X

  • Go A, Bhayani R, Huang L (2009) Twitter sentiment classification using distant supervision. CS224N Project Report, Stanford, vol 1, p 12

  • Heckman JJ (1979) Sample selection bias as a specification error. Econometrica J Econometric Soc 31(3):153–161

    Article  MathSciNet  MATH  Google Scholar 

  • Hong L, Bekkerman R, Adler J, Davison BD (2012) Learning to rank social update streams. In: Proceedings of the 35th international ACM SIGIR conference on research and development in information retrieval, ACM, pp 651–660

  • Huang J, Gretton A, Borgwardt KM, Schölkopf B, Smola AJ (2006) Correcting sample selection bias by unlabeled data. In: Advances in neural information processing systems, pp 601–608

  • Kim D, Jo Y, Moon IC, Oh A (2010) Analysis of twitter lists as a potential source for discovering latent characteristics of users. In: ACM CHI workshop on microblogging

  • Lee K, Palsetia D, Narayanan R, Patwary MMA, Agrawal A, Choudhary A (2011) Twitter trending topic classification. In: 2011 IEEE 11th international conference on data mining workshops (ICDMW), IEEE, pp 251–258

  • Liu W, Ruths D (2013) What’s in a name? using first names as features for gender inference in twitter. In: AAAI spring symposium on analyzing microtext

  • Manning CD, Raghavan P, Schütze H et al (2008) Introduction to information retrieval, vol 1. Cambridge University Press Cambridge

  • McClosky D, Charniak E, Johnson M (2006a) Effective self-training for parsing. In: Proceedings of the main conference on human language technology conference of the North American Chapter of the association of computational linguistics, association for computational linguistics, pp 152–159

  • McClosky D, Charniak E, Johnson M (2006b) Reranking and self-training for parser adaptation. In: Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the association for computational linguistics, association for computational linguistics, pp 337–344

  • Nguyen D, Smith NA, Ros CP (2011) Author age prediction from text using linear regression. In: Proceedings of the 5th ACL-HLT workshop on language technology for cultural heritage, social sciences, and humanities, association for computational linguistics, Stroudsburg, PA, USA, LaTeCH’11, pp 115–123. http://dl.acm.org/citation.cfm?id=2107636.2107651

  • O’Connor B, Balasubramanyan R, Routledge BR, Smith NA (2010) From Tweets to polls: linking text sentiment to public opinion time series. In: international AAAI conference on weblogs and social media, Washington, D.C

  • Oktay H, Firat A, Ertem Z (2014) Demographic breakdown of twitter users: an analysis based on names. In: Academy of Science and Engineering (ASE)

  • Pennacchiotti M, Popescu AM (2011) A machine learning approach to twitter user classification. In: Adamic LA, Baeza-Yates RA, Counts S (eds) Proceedings of the international AAAI Conference on web and social media. The AAAI Press, Menlo Park, CA

  • Rao D, Yarowsky D, Shreevats A, Gupta M (2010) Classifying latent user attributes in twitter. In: Proceedings of the 2nd international workshop on search and mining user-generated contents, ACM, New York, NY, USA, SMUC ’10, pp 37–44

  • Rao D, Paul MJ, Fink C, Yarowsky D, Oates T, Coppersmith G (2011) Hierarchical bayesian models for latent attribute detection in social media. In: Adamic LA, Baeza-Yates RA, Counts S (eds). ICWSM, The AAAI Press

  • Schwartz HA, Eichstaedt JC, Kern ML, Dziurzynski L, Lucas RE, Agrawal M, Park GJ, Lakshmikanth SK, Jha S, Seligman MEP, Ungar LH (2013) Characterizing geographic variation in well-being using tweets. In: 7th international AAAI conference on weblogs and social media (ICWSM)

  • Shimodaira H (2000) Improving predictive inference under covariate shift by weighting the log-likelihood function. J Stat Plan Inference 90(2):227–244

    Article  MathSciNet  MATH  Google Scholar 

  • Vieweg S, Hughes AL, Starbird K, Palen L (2010) Microblogging during two natural hazards events: what twitter may contribute to situational awareness. In: Proceedings of the 28th international conference on human factors in computing systems. NY, USA, New York, pp 1079–1088

  • Volkova S (2014) Twitter data collection: crawling users, neighbors and their communication for personal attribute prediction in social media. Technical report, Johns Hopkins University

  • Volkova S, Van Durme B (2015) Online bayesian models for personal analytics in social media. In: Proceedings of the 29th conference on artificial intelligence (AAAI), Austin, TX

  • Volkova S, Coppersmith G, Van Durme B (2014) Inferring user political preferences from streaming communications. In: Proceedings of the association for computational linguistics (ACL)

  • Widmer G, Kubat M (1996) Learning in the presence of concept drift and hidden contexts. Mach Learn 23(1):69–101

    Google Scholar 

  • Yang SH, Kolcz A, Schlaikjer A, Gupta P (2014) Large-scale high-precision topic modeling on twitter. In: Proceedings of the 20th ACM SIGKDD international conference on knowledge discovery and data mining, ACM, pp 1907–1916

  • Zadrozny B (2004) Learning and evaluating classifiers under sample selection bias. In: Proceedings of the 21st international conference on Machine learning, ACM, p 114

Download references

Acknowledgments

This research was funded in part by support from the IIT Educational and Research Initiative Fund and in part by the National Science Foundation under Grant #IIS-1526674. Any opinions, findings and conclusions or recommendations expressed in this material are the authors’ and do not necessarily reflect those of the sponsor.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aron Culotta.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Culotta, A. Training a text classifier with a single word using Twitter Lists and domain adaptation. Soc. Netw. Anal. Min. 6, 8 (2016). https://doi.org/10.1007/s13278-016-0317-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s13278-016-0317-1

Keywords

Navigation