On Refining Twitter Lists as Ground Truth Data for Multi-community User Classification

Su, Ting; Fang, Anjie; McCreadie, Richard; Macdonald, Craig; Ounis, Iadh

doi:10.1007/978-3-319-76941-7_74

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10772))

Included in the following conference series:

European Conference on Information Retrieval

4603 Accesses

Abstract

To help scholars and businesses understand and analyse Twitter users, it is useful to have classifiers that can identify the communities that a given user belongs to, e.g. business or politics. Obtaining high quality training data is an important step towards producing an effective multi-community classifier. An efficient approach for creating such ground truth data is to extract users from existing public Twitter lists, where those lists represent different communities, e.g. a list of journalists. However, ground truth datasets obtained using such lists can be noisy, since not all users that belong to a community are good training examples for that community. In this paper, we conduct a thorough failure analysis of a ground truth dataset generated using Twitter lists. We discuss how some categories of users collected from these Twitter public lists could negatively affect the classification performance and therefore should not be used for training. Through experiments with 3 classifiers and 5 communities, we show that removing ambiguous users based on their tweets and profile can indeed result in a 10% increase in F1 performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Twitter User Classification with Posting Locations

The Problem of Data Cleaning for Knowledge Extraction from Social Media

Persona Classification of Celebrity Twitter Users

Notes

1.
$\sim $20% of accounts have been removed from Twitter, and are excluded from our test dataset.
2.
One-vs-all is the recommended setup for multi-class classification using SVM [12].

References

Purcell, K., Rainie, L., Mitchell, A., Rosenstiel, T.: Understanding the participatory news consumer. Pew Internet Am. Life Proj. 1, 19–21 (2010)
Google Scholar
Erikson, R., MacKuen, M., Stimson, J.: The Macro Polity. Cambridge University Press, Cambridge (2002)
Google Scholar
Culotta, A., Kumar, N., Cutler, J.: Predicting the demographics of Twitter users from website traffic data. In: Proceedings of AAAI (2015)
Google Scholar
Pennacchiotti, M., Popescu, A.: A machine learning approach to Twitter user classification. In: Proceedings of ICWSM (2011)
Google Scholar
De Choudhury, M., Diakopoulos, N., Naaman, M.: Unfolding the event landscape on Twitter: classification and exploration of user categories. In: Proceedings of CSCW (2012)
Google Scholar
Sachan, M., Dubey, A., Srivastava, S., Xing, E.P., Hovy, E.: Spatial compactness meets topical consistency: jointly modeling links and content for community detection. In: Proceedings of the ICWSDM (2014)
Google Scholar
Chen, X., Wang, Y., Agichtein, E., Wang, F.: A comparative study of demographic attribute inference in Twitter. In: Proceedings of ICWSM (2015)
Google Scholar
Fang, A., Ounis, I., Habel, P., Macdonald, C., Limsopatham, N.: Topic-centric classification of Twitter user’s political orientation. In: Proceedings of SIGIR (2015)
Google Scholar
Feng, V., Hirst, G.: Detecting deceptive opinions with profile compatibility. In: Proceedings of IJCNLP (2013)
Google Scholar
Bergsma, S., Dredze, M., Van Durme, B., Wilson, T., Yarowsky, D.: Broadly improving user classification via communication-based name and location clustering on Twitter. In: Proceedings of HLT-NAACL (2013)
Google Scholar
Bagdouri, M., Oard, D.: Profession-based person search in microblogs: using seed sets to find journalists. In: Proceedings of CIKM (2015)
Google Scholar
Rifkin, R., Klautau, A.: In defense of one-vs-all classification. J. Mach. Learn. Res. 5, 101–141 (2004)
MathSciNet MATH Google Scholar

Download references

Author information

Authors and Affiliations

University of Glasgow, Glasgow, UK
Ting Su, Anjie Fang, Richard McCreadie, Craig Macdonald & Iadh Ounis

Authors

Ting Su
View author publications
You can also search for this author in PubMed Google Scholar
Anjie Fang
View author publications
You can also search for this author in PubMed Google Scholar
Richard McCreadie
View author publications
You can also search for this author in PubMed Google Scholar
Craig Macdonald
View author publications
You can also search for this author in PubMed Google Scholar
Iadh Ounis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ting Su .

Editor information

Editors and Affiliations

Department of Informatics, Systems, and Communication, University of Milano-Bicocca, Milan, Italy
Gabriella Pasi
LIP6 – UPMC/CNRS, University Pierre et Marie Curie, Paris, France
Benjamin Piwowarski
University of Glasgow, Glasgow, United Kingdom
Leif Azzopardi
Technical University of Vienna, Vienna, Austria
Allan Hanbury

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Su, T., Fang, A., McCreadie, R., Macdonald, C., Ounis, I. (2018). On Refining Twitter Lists as Ground Truth Data for Multi-community User Classification. In: Pasi, G., Piwowarski, B., Azzopardi, L., Hanbury, A. (eds) Advances in Information Retrieval. ECIR 2018. Lecture Notes in Computer Science(), vol 10772. Springer, Cham. https://doi.org/10.1007/978-3-319-76941-7_74

Download citation

DOI: https://doi.org/10.1007/978-3-319-76941-7_74
Published: 01 March 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-76940-0
Online ISBN: 978-3-319-76941-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics