Abstract
Aliases are used as a means of anonymity on the Internet in environments such as IRC (internet relay chat), forums and micro-blogging websites such as Twitter. While there are genuine reasons for the use of aliases, such as journalists operating in politically oppressive countries, they are increasingly being used by cybercriminals and extremist organisations. In recent years, we have seen increased research on authorship attribution of Twitter messages, including authorship analysis of aliases. Previous studies have shown that anti-aliasing of randomly generated sub-aliases yields high accuracies when linking the sub-aliases, but become much less accurate when topic-based sub-aliases are used. N-gram methods have previously been demonstrated to perform better than other methods in this situation. This paper investigates the effect of topic-based sampling on authorship attribution accuracy for the popular micro-blogging website Twitter. Features are extracted using character n-grams, which accurately capture differences in authorship style. These features are analysed using support vector machines using a one-versus-all classifier. The predictive performance of the algorithm is then evaluated using two different sampling methodologies - authors that were sampled through a context-sensitive topic-based search and authors that were sampled randomly. Topic-based sampling of authors is found to produce more accurate authorship predictions. This paper presents several theories as to why this might be the case.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Mendenhall, T.C.: The characteristic curves of composition. Sci. 237–249 (1887)
Sanzgiri, A., Joyce, J., Upadhyaya, S.: The early (tweet-ing) bird spreads the worm: an assessment of Twitter for malware propagation. Procedia Comput. Sci. 10, 705–712 (2012)
Sanzgiri, A., Hughes, A., Upadhyaya, S.: Analysis of malware propagation in Twitter. In: 2013 IEEE 32nd International Symposium on Reliable Distributed Systems (SRDS). IEEE (2013)
Wang, X., Gerber, M.S., Brown, D.E.: Automatic crime prediction using events extracted from twitter posts. In: Yang, S.J., Greenberg, A.M., Endsley, M. (eds.) SBP 2012. LNCS, vol. 7227, pp. 231–238. Springer, Heidelberg (2012). doi:10.1007/978-3-642-29047-3_28
Bollen, J., Mao, H., Zeng, X.: Twitter mood predicts the stock market. J. Comput. Sci. 2(1), 1–8 (2011)
Si, J., et al.: Exploiting topic based Twitter sentiment for stock prediction. ACL 2013(2), 24–29 (2013)
Sang, E.T.K., Bos, J.: Predicting the 2011 dutch senate election results with Twitter. In: Proceedings of the Workshop on Semantic Analysis in Social Media. Association for Computational Linguistics (2012)
Achrekar, H., et al.: Predicting flu trends using twitter data. In: 2011 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). IEEE (2011)
Ritterman, J., Osborne, M. Klein, E.: Using prediction markets and Twitter to predict a swine flu pandemic. In: 1st International Workshop on Mining Social Media (2009). http://homepages.inf.ed.ac.uk/miles/papers/swine09.pdf. Accessed 26 Aug 2015
Gayo-Avello, D.: “I wanted to predict elections with Twitter and all i got was this Lousy paper”–a balanced survey on election prediction using Twitter Data (2012). arXiv preprint arXiv:1204.6441
Layton, R., Watters, P., Dazeley, R.: Authorship attribution for Twitter in 140 characters or less. In: 2010 Second Cybercrime and Trustworthy Computing Workshop (CTC). IEEE (2010)
Layton, R., Watters, P.A., Dazeley, R.: Authorship analysis of aliases: does topic influence accuracy? Nat. Lang. Eng. 21(04), 497–518 (2015)
Kanaris, I., et al.: Words versus character n-grams for anti-spam filtering. Int. J. Artif. Intell. Tools 16(06), 1047–1067 (2007)
Bhargava, M., Mehndiratta, P., Asawa, K.: Stylometric analysis for authorship attribution on Twitter. In: Bhatnagar, V., Srinivasa, S. (eds.) BDA 2013. LNCS, vol. 8302, pp. 37–47. Springer, Cham (2013). doi:10.1007/978-3-319-03689-2_3
Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inform. Sci. Technol. 60(3), 538–556 (2009)
Oxford: The Oxford English Corpus: Facts about the language (2015). http://www.oxforddictionaries.com/words/the-oec-facts-about-the-language. Accessed 2015
Kanaris, I., Kanaris, K., Stamatatos, E.: Spam detection using character n-grams. In: Antoniou, G., Potamias, G., Spyropoulos, C., Plexousakis, D. (eds.) SETN 2006. LNCS, vol. 3955, pp. 95–104. Springer, Heidelberg (2006). doi:10.1007/11752912_12
Stamatatos, E.: Author identification: using text sampling to handle the class imbalance problem. Inf. Process. Manag. 44(2), 790–799 (2008)
Ng, A.: Support vector machines. CS229 Lecture notes 1(3), 1–3 (2000)
Hsu, C.-W., Chang, C.-C., Lin, C.-J.: A practical guide to support vector classification (2003)
van Baayen, H., et al.: An experiment in authorship attribution. In: 6th JADT (2002)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Pan, L., Gondal, I., Layton, R. (2017). Improving Authorship Attribution in Twitter Through Topic-Based Sampling. In: Peng, W., Alahakoon, D., Li, X. (eds) AI 2017: Advances in Artificial Intelligence. AI 2017. Lecture Notes in Computer Science(), vol 10400. Springer, Cham. https://doi.org/10.1007/978-3-319-63004-5_20
Download citation
DOI: https://doi.org/10.1007/978-3-319-63004-5_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-63003-8
Online ISBN: 978-3-319-63004-5
eBook Packages: Computer ScienceComputer Science (R0)