Abstract
To help scholars and businesses understand and analyse Twitter users, it is useful to have classifiers that can identify the communities that a given user belongs to, e.g. business or politics. Obtaining high quality training data is an important step towards producing an effective multi-community classifier. An efficient approach for creating such ground truth data is to extract users from existing public Twitter lists, where those lists represent different communities, e.g. a list of journalists. However, ground truth datasets obtained using such lists can be noisy, since not all users that belong to a community are good training examples for that community. In this paper, we conduct a thorough failure analysis of a ground truth dataset generated using Twitter lists. We discuss how some categories of users collected from these Twitter public lists could negatively affect the classification performance and therefore should not be used for training. Through experiments with 3 classifiers and 5 communities, we show that removing ambiguous users based on their tweets and profile can indeed result in a 10% increase in F1 performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
\(\sim \)20% of accounts have been removed from Twitter, and are excluded from our test dataset.
- 2.
One-vs-all is the recommended setup for multi-class classification using SVM [12].
References
Purcell, K., Rainie, L., Mitchell, A., Rosenstiel, T.: Understanding the participatory news consumer. Pew Internet Am. Life Proj. 1, 19–21 (2010)
Erikson, R., MacKuen, M., Stimson, J.: The Macro Polity. Cambridge University Press, Cambridge (2002)
Culotta, A., Kumar, N., Cutler, J.: Predicting the demographics of Twitter users from website traffic data. In: Proceedings of AAAI (2015)
Pennacchiotti, M., Popescu, A.: A machine learning approach to Twitter user classification. In: Proceedings of ICWSM (2011)
De Choudhury, M., Diakopoulos, N., Naaman, M.: Unfolding the event landscape on Twitter: classification and exploration of user categories. In: Proceedings of CSCW (2012)
Sachan, M., Dubey, A., Srivastava, S., Xing, E.P., Hovy, E.: Spatial compactness meets topical consistency: jointly modeling links and content for community detection. In: Proceedings of the ICWSDM (2014)
Chen, X., Wang, Y., Agichtein, E., Wang, F.: A comparative study of demographic attribute inference in Twitter. In: Proceedings of ICWSM (2015)
Fang, A., Ounis, I., Habel, P., Macdonald, C., Limsopatham, N.: Topic-centric classification of Twitter user’s political orientation. In: Proceedings of SIGIR (2015)
Feng, V., Hirst, G.: Detecting deceptive opinions with profile compatibility. In: Proceedings of IJCNLP (2013)
Bergsma, S., Dredze, M., Van Durme, B., Wilson, T., Yarowsky, D.: Broadly improving user classification via communication-based name and location clustering on Twitter. In: Proceedings of HLT-NAACL (2013)
Bagdouri, M., Oard, D.: Profession-based person search in microblogs: using seed sets to find journalists. In: Proceedings of CIKM (2015)
Rifkin, R., Klautau, A.: In defense of one-vs-all classification. J. Mach. Learn. Res. 5, 101–141 (2004)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Su, T., Fang, A., McCreadie, R., Macdonald, C., Ounis, I. (2018). On Refining Twitter Lists as Ground Truth Data for Multi-community User Classification. In: Pasi, G., Piwowarski, B., Azzopardi, L., Hanbury, A. (eds) Advances in Information Retrieval. ECIR 2018. Lecture Notes in Computer Science(), vol 10772. Springer, Cham. https://doi.org/10.1007/978-3-319-76941-7_74
Download citation
DOI: https://doi.org/10.1007/978-3-319-76941-7_74
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-76940-0
Online ISBN: 978-3-319-76941-7
eBook Packages: Computer ScienceComputer Science (R0)