Abstract
Online influence operations (OIOs) present a serious threat to the integrity of online social spaces and to real-world democratic elections. While many OIO detection approaches have focused on classification algorithms for individual social media posts (often with artificially balanced datasets), we present a novel system centering around a human analyst. This system incorporates a user representation and visualization procedure for unbalanced social media data. Our content-based social media user representation, the Mean User-Text Agglomeration (MUTA), summarizes a user’s social media activity with respect to Transformer embeddings of texts authored by the user. We apply MUTA to a real social media dataset in advance of an election event and flag a number of suspicious Reddit users that were later removed by the social media platform. When projected to a 2-dimensional visualizable space, MUTA user representations are shown, via extrinsic cluster quality measures, to outperform BERT representations for analyst identification of OIO accounts.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Alexa Internet, I.: Alexa rankings by country (2021). Accessed 06 July 2021
Alizadeh, M., Shapiro, J.N., Buntain, C., Tucker, J.A.: Content-based features predict social media infl. operations. Sci. Adv. 6(30), eabb5824 (2020)
Amigó, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics. Inf. Retrieval 12(4), 461–486 (2009)
Andrews, N., Bishop, M.: Learning invariant representations of social media users. In: EMNLP/IJCNLP (2019)
Baumgartner, J., Zannettou, S., Keegan, B., Squire, M., Blackburn, J.: The pushshift reddit dataset. ArXiv abs/2001.08435 (2020)
Behrisch, M., et al.: Quality metrics for information visualization. In: Computer Graphics Forum. Wiley Online Library, vol. 37, pp. 625–662 (2018)
Benton, A., Arora, R., Dredze, M.: Learning multiview embeddings of twitter users. In: 54th Annual Meeting of the ACL (Volume 2: Short Papers), pp. 14–19 (2016)
Coenen, A., Reif, E., Yuan, A., Kim, B., Pearce, A., Viégas, F.B., Wattenberg, M.: Visualizing and measuring the geometry of Bert. In: NeurIPS (2019)
Coscia, A.: Reddit suspicious accounts dataset (2018). https://github.com/ALCC01/reddit-suspicious-accounts. Accessed 20 Apr 2019
Crothers, E., Japkowicz, N., Viktor, H.L.: Towards ethical content-based detection of online influence campaigns. In: IEEE MLSP 2019, pp. 1–6 (2019). https://doi.org/10.1109/MLSP.2019.8918842
Crothers, E.: Ethical detection of online influence campaigns using transformer language models. université d’Ottawa/University of Ottawa (2020)
Crothers, E.: Muta-2021 (2021). https://github.com/ecrows/MUTA-2021
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers. CoRR abs/1810.04805 (2018). http://arxiv.org/abs/1810.04805
Explosion: Spacy python library. https://github.com/explosion/spaCy (2019). Version 2.0.16
Fornacciari, P., Mordonini, M., Poggi, A., Sani, L., Tomaiuolo, M.: A holistic system for troll detection on twitter. Comput. Hum. Behav. 89, 258–268 (2018)
Gao, H., Hu, J., Wilson, C., Li, Z., Chen, Y., Zhao, B.Y.: Detecting and characterizing social spam campaigns. In: Proceedings of ACM IMC 2010, p. 35–47. ACM, New York, NY, USA (2010). https://doi.org/10.1145/1879141.1879147
Gencoglu, O.: Deep representation learning for clustering of health tweets. CoRR abs/1901.00439 (2019). http://arxiv.org/abs/1901.00439
Gleicher, N.: Removing coordinated inauthentic behavior (2020). https://about.fb.com/news/2020/07/removing-political-coordinated-inauthentic-behavior/
Hleg, E.H.L.E.G.o.A.: Ethics guidelines for trustworthy AI (2019). https://ec.europa.eu/digital-single-market/en/news/ethics-guidelines-trustworthy-ai
Huffman, S.: Reddit 2017 transparency report findings (2018). Accessed 23 May 2019
Kaminski, M., Malgieri, G.: Algo. impact assessments under the GDPR: Producing multi-layered explanations. SSRN (2019). https://doi.org/10.2139/ssrn.3456224
Kennedy, S., Walsh, N., Sloka, K., McCarren, A., Foster, J.: Fact or factitious? contextualized opinion spam detection. In: ACL 57: Student Research Workshop. ACL, Florence, Italy, pp. 344–350 (2019). https://doi.org/10.18653/v1/P19-2048
van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008). http://www.jmlr.org/papers/v9/vandermaaten08a.html
McInnes, L.: Parameter selection for HDBSCAN (2016). https://hdbscan.readthedocs.io/en/latest/parameter_selection.html
McInnes, L., Healy, J.: UMAP: Uniform Manifold Approximation and Projection for dimension reduction. ArXiv abs/1802.03426 (2018)
McInnes, L., Healy, J., Astels, S.: HDBSCAN: Hierarchical Density based clustering. JOSS 2(11) (2017). https://doi.org/10.21105/joss.00205, https://doi.org/10.21105
Reimers, N., Gurevych, I.: Sentence-BERT: Sentence embeddings using siamese bert-networks. In: EMNLP/IJCNLP (2019)
Reimers, N., Schiller, B., Beck, T., Daxenberger, J., Stab, C., Gurevych, I.: Classification and clustering of arguments with contextualized word embeddings. In: ACL 57, pp. 567–578. ACL, Florence, Italy (2019). https://doi.org/10.18653/v1/P19-1054
Ribeiro, M., Calais, P., Santos, Y., Almeida, V., Meira Jr, W.: Characterizing and detecting hateful users on twitter. In: ICWSM, vol. 12 (2018)
Foundation of evaluation: van Rijsbergen. J. Documentation 30, 365–373 (1974)
Rosales-Méndez, H., Ramírez-Cruz, Y.: CICE-BCubed: a new evaluation measure for overlapping clustering algorithms. In: Ruiz-Shulcloper, J., Sanniti di Baja, G. (eds.) Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications, CIARP 2013. LNCS, vol. 8258, pp. 157–164. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-41822-8_20
Singh, K., Shakya, H., Biswas, B.: Clustering of people in social network based on textual similarity. Perspect. Sci. 8, 570–573 (2016). https://doi.org/10.1016/j.pisc.2016.06.023
Strehl, A., Ghosh, J.: Cluster ensembles - a knowledge reuse framework for combining multiple partitions. JMLR 3, 583–617 (2003). https://doi.org/10.1162/153244303321897735
Twitter: Twitter elections integrity dataset. Internet (2019). Accessed 20 Apr 2019
Yang, C., Harkreader, R., Zhang, J., Shin, S., Gu, G.: Analyzing spammers’ social networks for fun and profit: A case study of cyber criminal ecosystem on twitter. In: WWW 2012. p. 71–80. ACM, New York, NY, USA (2012). https://doi.org/10.1145/2187836.2187847
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Crothers, E., Viktor, H., Japkowicz, N. (2021). Mean User-Text Agglomeration (MUTA): Practical User Representation and Visualization for Detection of Online Influence Operations. In: Mohaisen, D., Jin, R. (eds) Computational Data and Social Networks. CSoNet 2021. Lecture Notes in Computer Science(), vol 13116. Springer, Cham. https://doi.org/10.1007/978-3-030-91434-9_27
Download citation
DOI: https://doi.org/10.1007/978-3-030-91434-9_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-91433-2
Online ISBN: 978-3-030-91434-9
eBook Packages: Computer ScienceComputer Science (R0)