Improving Authorship Attribution in Twitter Through Topic-Based Sampling

Pan, Luoxi; Gondal, Iqbal; Layton, Robert

doi:10.1007/978-3-319-63004-5_20

Luoxi Pan¹⁶,
Iqbal Gondal¹⁷ &
Robert Layton¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10400))

Included in the following conference series:

Australasian Joint Conference on Artificial Intelligence

1546 Accesses
1 Altmetric

Abstract

Aliases are used as a means of anonymity on the Internet in environments such as IRC (internet relay chat), forums and micro-blogging websites such as Twitter. While there are genuine reasons for the use of aliases, such as journalists operating in politically oppressive countries, they are increasingly being used by cybercriminals and extremist organisations. In recent years, we have seen increased research on authorship attribution of Twitter messages, including authorship analysis of aliases. Previous studies have shown that anti-aliasing of randomly generated sub-aliases yields high accuracies when linking the sub-aliases, but become much less accurate when topic-based sub-aliases are used. N-gram methods have previously been demonstrated to perform better than other methods in this situation. This paper investigates the effect of topic-based sampling on authorship attribution accuracy for the popular micro-blogging website Twitter. Features are extracted using character n-grams, which accurately capture differences in authorship style. These features are analysed using support vector machines using a one-versus-all classifier. The predictive performance of the algorithm is then evaluated using two different sampling methodologies - authors that were sampled through a context-sensitive topic-based search and authors that were sampled randomly. Topic-based sampling of authors is found to produce more accurate authorship predictions. This paper presents several theories as to why this might be the case.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Overview of PAN 2020: Authorship Verification, Celebrity Profiling, Profiling Fake News Spreaders on Twitter, and Style Change Detection

Authorship Analysis of Online Social Media Content

A review of features for the discrimination of twitter users: application to the prediction of offline influence

Article 09 May 2016

References

Mendenhall, T.C.: The characteristic curves of composition. Sci. 237–249 (1887)
Google Scholar
Sanzgiri, A., Joyce, J., Upadhyaya, S.: The early (tweet-ing) bird spreads the worm: an assessment of Twitter for malware propagation. Procedia Comput. Sci. 10, 705–712 (2012)
Article Google Scholar
Sanzgiri, A., Hughes, A., Upadhyaya, S.: Analysis of malware propagation in Twitter. In: 2013 IEEE 32nd International Symposium on Reliable Distributed Systems (SRDS). IEEE (2013)
Google Scholar
Wang, X., Gerber, M.S., Brown, D.E.: Automatic crime prediction using events extracted from twitter posts. In: Yang, S.J., Greenberg, A.M., Endsley, M. (eds.) SBP 2012. LNCS, vol. 7227, pp. 231–238. Springer, Heidelberg (2012). doi:10.1007/978-3-642-29047-3_28
Chapter Google Scholar
Bollen, J., Mao, H., Zeng, X.: Twitter mood predicts the stock market. J. Comput. Sci. 2(1), 1–8 (2011)
Article Google Scholar
Si, J., et al.: Exploiting topic based Twitter sentiment for stock prediction. ACL 2013(2), 24–29 (2013)
Google Scholar
Sang, E.T.K., Bos, J.: Predicting the 2011 dutch senate election results with Twitter. In: Proceedings of the Workshop on Semantic Analysis in Social Media. Association for Computational Linguistics (2012)
Google Scholar
Achrekar, H., et al.: Predicting flu trends using twitter data. In: 2011 IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS). IEEE (2011)
Google Scholar
Ritterman, J., Osborne, M. Klein, E.: Using prediction markets and Twitter to predict a swine flu pandemic. In: 1st International Workshop on Mining Social Media (2009). http://homepages.inf.ed.ac.uk/miles/papers/swine09.pdf. Accessed 26 Aug 2015
Gayo-Avello, D.: “I wanted to predict elections with Twitter and all i got was this Lousy paper”–a balanced survey on election prediction using Twitter Data (2012). arXiv preprint arXiv:1204.6441
Layton, R., Watters, P., Dazeley, R.: Authorship attribution for Twitter in 140 characters or less. In: 2010 Second Cybercrime and Trustworthy Computing Workshop (CTC). IEEE (2010)
Google Scholar
Layton, R., Watters, P.A., Dazeley, R.: Authorship analysis of aliases: does topic influence accuracy? Nat. Lang. Eng. 21(04), 497–518 (2015)
Article Google Scholar
Kanaris, I., et al.: Words versus character n-grams for anti-spam filtering. Int. J. Artif. Intell. Tools 16(06), 1047–1067 (2007)
Article Google Scholar
Bhargava, M., Mehndiratta, P., Asawa, K.: Stylometric analysis for authorship attribution on Twitter. In: Bhatnagar, V., Srinivasa, S. (eds.) BDA 2013. LNCS, vol. 8302, pp. 37–47. Springer, Cham (2013). doi:10.1007/978-3-319-03689-2_3
Chapter Google Scholar
Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inform. Sci. Technol. 60(3), 538–556 (2009)
Article Google Scholar
Oxford: The Oxford English Corpus: Facts about the language (2015). http://www.oxforddictionaries.com/words/the-oec-facts-about-the-language. Accessed 2015
Kanaris, I., Kanaris, K., Stamatatos, E.: Spam detection using character n-grams. In: Antoniou, G., Potamias, G., Spyropoulos, C., Plexousakis, D. (eds.) SETN 2006. LNCS, vol. 3955, pp. 95–104. Springer, Heidelberg (2006). doi:10.1007/11752912_12
Chapter Google Scholar
Stamatatos, E.: Author identification: using text sampling to handle the class imbalance problem. Inf. Process. Manag. 44(2), 790–799 (2008)
Article Google Scholar
Ng, A.: Support vector machines. CS229 Lecture notes 1(3), 1–3 (2000)
MathSciNet Google Scholar
Hsu, C.-W., Chang, C.-C., Lin, C.-J.: A practical guide to support vector classification (2003)
Google Scholar
van Baayen, H., et al.: An experiment in authorship attribution. In: 6th JADT (2002)
Google Scholar

Download references

Author information

Authors and Affiliations

Monash University, Melbourne, Australia
Luoxi Pan
Internet Commerce Security Laboratory, Federation University Australia, Ballarat, Australia
Iqbal Gondal & Robert Layton

Authors

Luoxi Pan
View author publications
You can also search for this author in PubMed Google Scholar
Iqbal Gondal
View author publications
You can also search for this author in PubMed Google Scholar
Robert Layton
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Luoxi Pan .

Editor information

Editors and Affiliations

La Trobe University, Melbourne, Australia
Wei Peng
La Trobe Business School, La Trobe University, Bundoora, Victoria, Australia
Damminda Alahakoon
RMIT University, Melbourne, Australia
Xiaodong Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Pan, L., Gondal, I., Layton, R. (2017). Improving Authorship Attribution in Twitter Through Topic-Based Sampling. In: Peng, W., Alahakoon, D., Li, X. (eds) AI 2017: Advances in Artificial Intelligence. AI 2017. Lecture Notes in Computer Science(), vol 10400. Springer, Cham. https://doi.org/10.1007/978-3-319-63004-5_20

Download citation

DOI: https://doi.org/10.1007/978-3-319-63004-5_20
Published: 09 July 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-63003-8
Online ISBN: 978-3-319-63004-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics