Skip to main content

Advancing the Use of Information Compression Distances in Authorship Attribution

  • Conference paper
  • First Online:
Disinformation in Open Online Media (MISDOOM 2022)

Abstract

Detecting unreliable information in social media is an open challenge, in part as a result of the difficulty to associate a piece of information to known and trustworthy actors. The identification of the origin of sources can help society deal with unverified, incomplete, or even false information. In this work we tackle the problem of associating a piece of information to a certain politician. The use of inaccurate information is of great relevance in the case of politicians, since it affects social perception and voting behavior. Moreover, misquotation can be weaponized to hinder adversary reputation. We consider the task of applying a compression-based metric to conduct authorship attribution in social media, namely in Twitter. In specific, we leverage the Normalized Compression Distance (NCD) to compare an author’s text with other authors’ texts. We show that this methodology performs well, obtaining 80.3% accuracy in a scenario with 6 different politicians.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    The original dataset can be downloaded from https://www.reddit.com/r/datasets/comments/6fniik/over_one_million_tweets_collected_from_us/.

  2. 2.

    For more information visit https://datatracker.ietf.org/doc/html/rfc1951.

References

  1. Alonso-Fernandez, F., Belvisi, N.M.S., Hernandez-Diaz, K., Muhammad, N., Bigun, J.: Writer identification using microblogging texts for social media forensics. IEEE Trans. Biomet. Behav. Identity Sci. 3(3), 405–426 (2021)

    Article  Google Scholar 

  2. Aykent, S., Dozier, G.: AARef: exploiting authorship identifiers of micro-messages with refinement blocks. In: 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 1044–1050. IEEE (2020)

    Google Scholar 

  3. Aykent, S., Dozier, G.: Author identification of micro-messages via multi-channel convolutional neural networks. In: 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 675–681. IEEE (2020)

    Google Scholar 

  4. Baayen, H., Halteren, H., Neijt, A., Tweedie, F.: An experiment in authorship attribution, January 2002

    Google Scholar 

  5. Binongo, J.N.G.: Who wrote the 15th book of OZ? An application of multivariate analysis to authorship attribution. Chance 16(2), 9–17 (2003)

    Article  MathSciNet  Google Scholar 

  6. Burrows, J.F.: Word-patterns and story-shapes: the statistical analysis of narrative style. Liter. Linguist. Comput. 2(2), 61–70 (1987)

    Article  Google Scholar 

  7. Chollet, F., et al.: Keras. http://keras.io (2015)

  8. Cilibrasi, R., Vitanyi, P.: Clustering by compression. IEEE Trans. Inf. Theory 51(4), 1523–1545 (2005)

    Article  MathSciNet  Google Scholar 

  9. Cortes, C., Vapnik, V.: Support vector networks. Mach. Learn. 20, 273–297 (1995)

    Article  Google Scholar 

  10. Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship attribution with support vector machines. Appl. Intell. 19, 109–123 (2003). https://doi.org/10.1023/A:1023824908771

    Article  MATH  Google Scholar 

  11. Fourkioti, O., Symeonidis, S., Arampatzis, A.: Language models and fusion for authorship attribution. Inf. Process. Manag. 56(6), 102061 (2019)

    Article  Google Scholar 

  12. Halvani, O., Winter, C., Graner, L.: On the usefulness of compression models for authorship verification. In: Proceedings of the 12th International Conference on Availability, Reliability and Security, pp. 1–10 (2017)

    Google Scholar 

  13. Hameleers, M., Minihold, S.: Constructing discourses on (un)truthfulness: attributions of reality, misinformation, and disinformation by politicians in a comparative social media setting. Commun. Res. (2020)

    Google Scholar 

  14. Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, 2nd edn. Inference and Prediction. Springer, New York (2009). https://doi.org/10.1007/978-0-387-21606-5

    Book  MATH  Google Scholar 

  15. Holmes, D., Robertson, M., Paez, R.: Stephen crane and the New York tribune: a case study in traditional and non-traditional authorship attribution. Comput. Human. 35, 315–331 (2001)

    Article  Google Scholar 

  16. IARPA: Human Interpretable Attribution of Text using Underlying Structure (HIATUS) Program (2022)

    Google Scholar 

  17. Jursenas, A., Karlauskas, K., Ledinauskas, E., Maskeliunas, G., Rondomanskas, D., Ruseckas, J.: The Role of AI in the Battle Against Disinformation (2022)

    Google Scholar 

  18. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings (2015)

    Google Scholar 

  19. Kjell, B., Addison Woods, W., Frieder, O.: Information retrieval using letter tuples with neural network and nearest neighbor classifiers. In: 1995 IEEE International Conference on Systems, Man and Cybernetics. Intelligent Systems for the 21st Century. vol. 2, pp. 1222–1226 (1995)

    Google Scholar 

  20. Layton, R., Watters, P., Dazeley, R.: Authorship attribution for twitter in 140 characters or less. In: 2010 Second Cybercrime and Trustworthy Computing Workshop, pp. 1–8. IEEE (2010)

    Google Scholar 

  21. Oliva, C., Palmero-Muñoz, S., Lago-Fernández, L.F., Arroyo, D.: Improving LSTMs’ under-performance in authorship attribution for short texts. In: Proceedings of the European Interdisciplinary Cybersecurity Conference (EICC) (2022)

    Google Scholar 

  22. Oliveira, W., Jr., Justino, E., Oliveira, L.S.: Comparing compression models for authorship attribution. Forensic Sci. Int. 228(1–3), 100–104 (2013)

    Article  Google Scholar 

  23. Pedregosa, F., et al.: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)

    MathSciNet  MATH  Google Scholar 

  24. Rocha, A., et al.: Authorship attribution for social media forensics. IEEE Trans. Inf. Forensics Secur. 12(1), 5–33 (2017)

    Article  Google Scholar 

  25. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning Internal Representations by Error Propagation, pp. 318–362. MIT Press, Cambridge, MA, USA (1986)

    Google Scholar 

  26. Schwartz, R., Tsur, O., Rappoport, A., Koppel, M.: Authorship attribution of micro-messages. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1880–1891 (2013)

    Google Scholar 

  27. Selj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: Proceedings of the Conference Pacific Association for Computational Linguistics PACLING 2003, September 2003

    Google Scholar 

  28. Shrestha, P., Sierra, S., González, F.A., Montes, M., Rosso, P., Solorio, T.: Convolutional neural networks for authorship attribution of short texts. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, vol. 2, Short Papers, pp. 669–674 (2017)

    Google Scholar 

  29. Theophilo, A., Giot, R., Rocha, A.: Authorship attribution of social media messages. IEEE Trans. Comput. Soc. Syst. 1–14 (2021)

    Google Scholar 

  30. Theóphilo, A., Pereira, L.A., Rocha, A.: A needle in a haystack? Harnessing onomatopoeia and user-specific stylometrics for authorship attribution of micro-messages. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2692–2696. IEEE (2019)

    Google Scholar 

  31. de la Torre-Abaitua, G., Lago-Fernández, L.F., Arroyo, D.: A compression-based method for detecting anomalies in textual data. Entropy 23(5), 618 (2021)

    Article  MathSciNet  Google Scholar 

  32. de la Torre-Abaitua, G., Lago-Fernández, L.F., Arroyo, D.: On the application of compression-based metrics to identifying anomalous behaviour in web traffic. Log. J. IGPL 28(4), 546–557 (2020)

    Article  MathSciNet  Google Scholar 

  33. Veenman, C.J., Li, Z.: Authorship verification with compression features. In: CLEF (Working Notes) (2013)

    Google Scholar 

Download references

Acknowledgements

This project has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under grant agreement No. 872855 (TRESCA project), from Grant PLEC2021-007681 (project XAI-DisInfodemics) funded by MCIN/AEI/ 10.13039/501100011033 and by European Union NextGeneration EU/PRTR, from Comunidad de Madrid (Spain) under the project CYNAMON (no. P2018/TCS-4566), cofunded with FSE and FEDER EU funds, and from Spanish projects MINECO/FEDER TIN2017-84452-R and PID2020-114867RB-I00 (http://www.mineco.gob.es/).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Christian Oliva .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Muñoz, S.P., Oliva, C., Lago-Fernández, L.F., Arroyo, D. (2022). Advancing the Use of Information Compression Distances in Authorship Attribution. In: Spezzano, F., Amaral, A., Ceolin, D., Fazio, L., Serra, E. (eds) Disinformation in Open Online Media. MISDOOM 2022. Lecture Notes in Computer Science, vol 13545 . Springer, Cham. https://doi.org/10.1007/978-3-031-18253-2_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-18253-2_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-18252-5

  • Online ISBN: 978-3-031-18253-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics