Advancing the Use of Information Compression Distances in Authorship Attribution

Muñoz, Santiago Palmero; Oliva, Christian; Lago-Fernández, Luis F.; Arroyo, David

doi:10.1007/978-3-031-18253-2_8

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13545 ))

Included in the following conference series:

Multidisciplinary International Symposium on Disinformation in Open Online Media

826 Accesses
1 Citations
1 Altmetric

Abstract

Detecting unreliable information in social media is an open challenge, in part as a result of the difficulty to associate a piece of information to known and trustworthy actors. The identification of the origin of sources can help society deal with unverified, incomplete, or even false information. In this work we tackle the problem of associating a piece of information to a certain politician. The use of inaccurate information is of great relevance in the case of politicians, since it affects social perception and voting behavior. Moreover, misquotation can be weaponized to hinder adversary reputation. We consider the task of applying a compression-based metric to conduct authorship attribution in social media, namely in Twitter. In specific, we leverage the Normalized Compression Distance (NCD) to compare an author’s text with other authors’ texts. We show that this methodology performs well, obtaining 80.3% accuracy in a scenario with 6 different politicians.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 44.99; Price excludes VAT (USA)

Softcover Book: USD 59.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Following Negationists on Twitter and Telegram: Application of NCD to the Analysis of Multiplatform Misinformation Dynamics

Authorship Analysis of Online Social Media Content

Authorship verification applied to detection of compromised accounts on online social networks

Article 05 September 2016

Notes

1.
The original dataset can be downloaded from https://www.reddit.com/r/datasets/comments/6fniik/over_one_million_tweets_collected_from_us/.
2.
For more information visit https://datatracker.ietf.org/doc/html/rfc1951.

References

Alonso-Fernandez, F., Belvisi, N.M.S., Hernandez-Diaz, K., Muhammad, N., Bigun, J.: Writer identification using microblogging texts for social media forensics. IEEE Trans. Biomet. Behav. Identity Sci. 3(3), 405–426 (2021)
Article Google Scholar
Aykent, S., Dozier, G.: AARef: exploiting authorship identifiers of micro-messages with refinement blocks. In: 2020 19th IEEE International Conference on Machine Learning and Applications (ICMLA), pp. 1044–1050. IEEE (2020)
Google Scholar
Aykent, S., Dozier, G.: Author identification of micro-messages via multi-channel convolutional neural networks. In: 2020 IEEE International Conference on Systems, Man, and Cybernetics (SMC), pp. 675–681. IEEE (2020)
Google Scholar
Baayen, H., Halteren, H., Neijt, A., Tweedie, F.: An experiment in authorship attribution, January 2002
Google Scholar
Binongo, J.N.G.: Who wrote the 15th book of OZ? An application of multivariate analysis to authorship attribution. Chance 16(2), 9–17 (2003)
Article MathSciNet Google Scholar
Burrows, J.F.: Word-patterns and story-shapes: the statistical analysis of narrative style. Liter. Linguist. Comput. 2(2), 61–70 (1987)
Article Google Scholar
Chollet, F., et al.: Keras. http://keras.io (2015)
Cilibrasi, R., Vitanyi, P.: Clustering by compression. IEEE Trans. Inf. Theory 51(4), 1523–1545 (2005)
Article MathSciNet Google Scholar
Cortes, C., Vapnik, V.: Support vector networks. Mach. Learn. 20, 273–297 (1995)
Article Google Scholar
Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship attribution with support vector machines. Appl. Intell. 19, 109–123 (2003). https://doi.org/10.1023/A:1023824908771
Article MATH Google Scholar
Fourkioti, O., Symeonidis, S., Arampatzis, A.: Language models and fusion for authorship attribution. Inf. Process. Manag. 56(6), 102061 (2019)
Article Google Scholar
Halvani, O., Winter, C., Graner, L.: On the usefulness of compression models for authorship verification. In: Proceedings of the 12th International Conference on Availability, Reliability and Security, pp. 1–10 (2017)
Google Scholar
Hameleers, M., Minihold, S.: Constructing discourses on (un)truthfulness: attributions of reality, misinformation, and disinformation by politicians in a comparative social media setting. Commun. Res. (2020)
Google Scholar
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, 2nd edn. Inference and Prediction. Springer, New York (2009). https://doi.org/10.1007/978-0-387-21606-5
Book MATH Google Scholar
Holmes, D., Robertson, M., Paez, R.: Stephen crane and the New York tribune: a case study in traditional and non-traditional authorship attribution. Comput. Human. 35, 315–331 (2001)
Article Google Scholar
IARPA: Human Interpretable Attribution of Text using Underlying Structure (HIATUS) Program (2022)
Google Scholar
Jursenas, A., Karlauskas, K., Ledinauskas, E., Maskeliunas, G., Rondomanskas, D., Ruseckas, J.: The Role of AI in the Battle Against Disinformation (2022)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: Bengio, Y., LeCun, Y. (eds.) 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7–9, 2015, Conference Track Proceedings (2015)
Google Scholar
Kjell, B., Addison Woods, W., Frieder, O.: Information retrieval using letter tuples with neural network and nearest neighbor classifiers. In: 1995 IEEE International Conference on Systems, Man and Cybernetics. Intelligent Systems for the 21st Century. vol. 2, pp. 1222–1226 (1995)
Google Scholar
Layton, R., Watters, P., Dazeley, R.: Authorship attribution for twitter in 140 characters or less. In: 2010 Second Cybercrime and Trustworthy Computing Workshop, pp. 1–8. IEEE (2010)
Google Scholar
Oliva, C., Palmero-Muñoz, S., Lago-Fernández, L.F., Arroyo, D.: Improving LSTMs’ under-performance in authorship attribution for short texts. In: Proceedings of the European Interdisciplinary Cybersecurity Conference (EICC) (2022)
Google Scholar
Oliveira, W., Jr., Justino, E., Oliveira, L.S.: Comparing compression models for authorship attribution. Forensic Sci. Int. 228(1–3), 100–104 (2013)
Article Google Scholar
Pedregosa, F., et al.: Scikit-learn: Machine learning in Python. J. Mach. Learn. Res. 12, 2825–2830 (2011)
MathSciNet MATH Google Scholar
Rocha, A., et al.: Authorship attribution for social media forensics. IEEE Trans. Inf. Forensics Secur. 12(1), 5–33 (2017)
Article Google Scholar
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning Internal Representations by Error Propagation, pp. 318–362. MIT Press, Cambridge, MA, USA (1986)
Google Scholar
Schwartz, R., Tsur, O., Rappoport, A., Koppel, M.: Authorship attribution of micro-messages. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1880–1891 (2013)
Google Scholar
Selj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: Proceedings of the Conference Pacific Association for Computational Linguistics PACLING 2003, September 2003
Google Scholar
Shrestha, P., Sierra, S., González, F.A., Montes, M., Rosso, P., Solorio, T.: Convolutional neural networks for authorship attribution of short texts. In: Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics, vol. 2, Short Papers, pp. 669–674 (2017)
Google Scholar
Theophilo, A., Giot, R., Rocha, A.: Authorship attribution of social media messages. IEEE Trans. Comput. Soc. Syst. 1–14 (2021)
Google Scholar
Theóphilo, A., Pereira, L.A., Rocha, A.: A needle in a haystack? Harnessing onomatopoeia and user-specific stylometrics for authorship attribution of micro-messages. In: ICASSP 2019–2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2692–2696. IEEE (2019)
Google Scholar
de la Torre-Abaitua, G., Lago-Fernández, L.F., Arroyo, D.: A compression-based method for detecting anomalies in textual data. Entropy 23(5), 618 (2021)
Article MathSciNet Google Scholar
de la Torre-Abaitua, G., Lago-Fernández, L.F., Arroyo, D.: On the application of compression-based metrics to identifying anomalous behaviour in web traffic. Log. J. IGPL 28(4), 546–557 (2020)
Article MathSciNet Google Scholar
Veenman, C.J., Li, Z.: Authorship verification with compression features. In: CLEF (Working Notes) (2013)
Google Scholar

Download references

Acknowledgements

This project has received funding from the European Union’s Horizon 2020 Research and Innovation Programme under grant agreement No. 872855 (TRESCA project), from Grant PLEC2021-007681 (project XAI-DisInfodemics) funded by MCIN/AEI/ 10.13039/501100011033 and by European Union NextGeneration EU/PRTR, from Comunidad de Madrid (Spain) under the project CYNAMON (no. P2018/TCS-4566), cofunded with FSE and FEDER EU funds, and from Spanish projects MINECO/FEDER TIN2017-84452-R and PID2020-114867RB-I00 (http://www.mineco.gob.es/).

Author information

Authors and Affiliations

Institute for Physical and Information Technologies “Leonardo Torres Quevedo” (ITEFI), Consejo Superior de Investigaciones Científicas (CSIC), Madrid, Spain
Santiago Palmero Muñoz & David Arroyo
Universidad Autónoma de Madrid, 28049, Madrid, Spain
Christian Oliva & Luis F. Lago-Fernández

Authors

Santiago Palmero Muñoz
View author publications
You can also search for this author in PubMed Google Scholar
Christian Oliva
View author publications
You can also search for this author in PubMed Google Scholar
Luis F. Lago-Fernández
View author publications
You can also search for this author in PubMed Google Scholar
David Arroyo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christian Oliva .

Editor information

Editors and Affiliations

Boise State University, Boise, ID, USA
Francesca Spezzano
Universidade do Vale do Rio dos Sinos, São Leopoldo, Brazil
Adriana Amaral
Centrum Wiskunde and Informatica, Amsterdam, The Netherlands
Davide Ceolin
Vanderbilt University, Nashville, TN, USA
Lisa Fazio
Boise State University, Boise, ID, USA
Edoardo Serra

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Muñoz, S.P., Oliva, C., Lago-Fernández, L.F., Arroyo, D. (2022). Advancing the Use of Information Compression Distances in Authorship Attribution. In: Spezzano, F., Amaral, A., Ceolin, D., Fazio, L., Serra, E. (eds) Disinformation in Open Online Media. MISDOOM 2022. Lecture Notes in Computer Science, vol 13545 . Springer, Cham. https://doi.org/10.1007/978-3-031-18253-2_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-18253-2_8
Published: 04 October 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18252-5
Online ISBN: 978-3-031-18253-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Advancing the Use of Information Compression Distances in Authorship Attribution