Content-Based Authorship Identification for Short Texts in Social Media Networks

de la Puerta, José Gaviria; Pastor-López, Iker; Hernández, Javier Salcedo; Tellaeche, Alberto; Sanz, Borja; Sanjurjo-González, Hugo; Cuzzocrea, Alfredo; Bringas, Pablo G.

doi:10.1007/978-3-030-86271-8_3

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12886))

Included in the following conference series:

International Conference on Hybrid Artificial Intelligence Systems

1271 Accesses

Abstract

Today social networks contain a high number of false profiles that can carry out malicious actions on other users, such as radicalization or defamation. This makes it necessary to be able to identify the same false profile and its behaviour on different social networks in order to take action against it. To this end, this article presents a new approach based on behavior analysis for the identification of text authorship in social networks.

The work presented in this paper was supported by the European Commission under contract H2020-700367 DANTE.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The DANTE (Detecting and analysing terrorist-related online contents and financing activities) project aims to deliver more effective, efficient, automated data mining and analytics solutions and an integrated system to detect, retrieve, collect and analyse huge amount of heterogeneous and complex multimedia and multi-language terrorist-related contents, from both the Surface and the Deep Web, including Dark nets. More information at http://www.h2020-dante.eu/.
2.
Stop words are the most common words in a language, which are normally filtered out during the pre-processing step of a natural language processing experiment.
3.
More information at https://spark.apache.org/mllib/.
4.
The cosine similarity is a measure that calculates the cosine of the angle between two vectors (orientation). It is commonly used for measuring the similarity between two documents represented in a normalized vector space model.

References

Boyd, D.M., Ellison, N.B.: Social network sites: definition, history, and scholarship. J. Comput.-Mediat. Commun. 13(1), 210–230 (2007)
Article Google Scholar
Lilleberg, J., Zhu, Y., Zhang, Y.: Support vector machines and word2vec for text classification with semantic features. In: IEEE 14th International Conference on Cognitive Informatics & Cognitive Computing (ICCI* CC), pp. 136–140. IEEE, Beijing (2015)
Google Scholar
Agarwal, A., and Xie, B., Vovsha, I., Rambow, O., Passonneau, R.: Sentiment analysis of Twitter data. In: Proceedings of the Workshop on Languages in Social Media (LSM 2011), pp. 30–38. Association for Computational Linguistics, Portland (2011)
Google Scholar
Khonji, M., Iraqi, Y., Jones, A.: Mitigation of spear phishing attacks: a content-based Authorship Identification framework. In: 2011 International Conference for Internet Technology and Secured Transactions, pp. 416–421. IEEE, Abu Dabi (2010)
Google Scholar
Chunxia, Z., Xindong, W., Zhendong, N., Wei, D.: Authorship identification from unstructured texts. Knowl.-Based Syst. 66, 99–111 (2014)
Article Google Scholar
Galán-García, P., Puerta, J.G.D.L., Gómez, C.L., Santos, I., Bringas, P.G.: Supervised machine learning for the detection of troll profiles in twitter social network: application to a real case of cyberbullying. Log. J. IGPL 24(1), 42–53 (2016)
MathSciNet Google Scholar
Webster, J.J., Kit, C.: Tokenization as the initial phase in NLP. In: Proceedings of the 14th Conference on Computational Linguistics, pp. 1106–1110. Association for Computational Linguistics (1992)
Google Scholar
Mikolov, T., Chen, K., Corrado, G.S., Dean, J.: Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781 (2013)
Salton, G., McGill, M.J.: Book Title. McGraw-Hill, Inc. (1986)
Google Scholar
Cavnar, W.B., Trenkle, J.M.: N-gram-based text categorization. In: Proceedings of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval, vol. 161175 (1994)
Google Scholar
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. Adv. Neural. Inf. Process. Syst. 26, 3111–3119 (2010)
Google Scholar
Morin, F., Bengio, Y.: Hierarchical probabilistic neural network language model. Aistats 5, 246–252 (2005)
Google Scholar
Le, Q., Mikolov, T.: Distributed representations of sentences and documents. In: 31st International Conference on Machine Learning, Beijing, pp. 1188–1196 (2014)
Google Scholar
Foltz, P.W., Kintsch, W., Landauer, T.K.: The measurement of textual coherence with latent semantic analysis. Discour. Process. 25(2,4), 285–307 (1998)
Article Google Scholar
Socher, R., et al.: Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pp. 1631–1642. Association for Computational Linguistics, Seattle (2013)
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Engineering, University of Deusto, Avda Universidades 24, 48007, Bilbao, Spain
José Gaviria de la Puerta, Iker Pastor-López, Javier Salcedo Hernández, Alberto Tellaeche, Borja Sanz, Hugo Sanjurjo-González & Pablo G. Bringas
University of Calabria, Rende, Italy
Alfredo Cuzzocrea

Authors

José Gaviria de la Puerta
View author publications
You can also search for this author in PubMed Google Scholar
Iker Pastor-López
View author publications
You can also search for this author in PubMed Google Scholar
Javier Salcedo Hernández
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Tellaeche
View author publications
You can also search for this author in PubMed Google Scholar
Borja Sanz
View author publications
You can also search for this author in PubMed Google Scholar
Hugo Sanjurjo-González
View author publications
You can also search for this author in PubMed Google Scholar
Alfredo Cuzzocrea
View author publications
You can also search for this author in PubMed Google Scholar
Pablo G. Bringas
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to José Gaviria de la Puerta .

Editor information

Editors and Affiliations

University of Deusto, Bilbao, Spain
Hugo Sanjurjo González
University of Deusto, Bilbao, Spain
Iker Pastor López
University of Deusto, Bilbao, Spain
Pablo García Bringas
University of A Coruña, A Coruña, Spain
Héctor Quintián
University of Salamanca, Salamanca, Spain
Emilio Corchado

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

de la Puerta, J.G. et al. (2021). Content-Based Authorship Identification for Short Texts in Social Media Networks. In: Sanjurjo González, H., Pastor López, I., García Bringas, P., Quintián, H., Corchado, E. (eds) Hybrid Artificial Intelligent Systems. HAIS 2021. Lecture Notes in Computer Science(), vol 12886. Springer, Cham. https://doi.org/10.1007/978-3-030-86271-8_3

Download citation

DOI: https://doi.org/10.1007/978-3-030-86271-8_3
Published: 15 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86270-1
Online ISBN: 978-3-030-86271-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics