Document embeddings learned on various types of n-grams for cross-topic authorship attribution

Gómez-Adorno, Helena; Posadas-Durán, Juan-Pablo; Sidorov, Grigori; Pinto, David

doi:10.1007/s00607-018-0587-8

Document embeddings learned on various types of n-grams for cross-topic authorship attribution

Published: 25 January 2018

Volume 100, pages 741–756, (2018)
Cite this article

Computing Aims and scope Submit manuscript

Helena Gómez-Adorno¹,
Juan-Pablo Posadas-Durán²,
Grigori Sidorov¹ &
…
David Pinto³

1008 Accesses
30 Citations
Explore all metrics

Abstract

Recently, document embeddings methods have been proposed aiming at capturing hidden properties of the texts. These methods allow to represent documents in terms of fixed-length, continuous and dense feature vectors. In this paper, we propose to learn document vectors based on n-grams and not only on words. We use the recently proposed Paragraph Vector method. These n-grams include character n-grams, word n-grams and n-grams of POS tags (in all cases with n varying from 1 to 5). We considered the task of Cross-Topic Authorship Attribution and made experiments on The Guardian corpus. Experimental results show that our method outperforms word-based embeddings and character n-gram based linear models, which are among the most effective approaches for identifying the writing style of an author.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Application of the distributed document representation in the authorship attribution task for small corpora

Article 28 November 2016

An Interpretable Authorship Attribution Algorithm Based on Distance-Related Characterizations of Tokens

Enhancing Authorship Attribution Through Embedding Fusion: A Novel Approach with Masked and Encoder-Decoder Language Models

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

References

Abbasi A, Chen H (2005) Applying authorship analysis to extremist-group web forum messages. IEEE Intell Syst 20(5):67–75
Article Google Scholar
Bengio Y, Ducharme R, Vincent P, Jauvin C (2003) A neural probabilistic language model. J Mach Learn Res 3:1137–1155
MATH Google Scholar
Black PE (2015) Fisher-yates shuffle. In: Pieterse V, Black PE (eds) Dictionary of algorithms and data structures [online]. Available from https://www.nist.gov/dads/HTML/fisherYatesShuffle.html
Coulthard M (2012) On admissible linguistic evidence. J Law Policy 21:441
Google Scholar
Escalante HJ, Solorio T, Montes-y Gómez M (2011) Local histograms of character n-grams for authorship attribution. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies, vol 1, ACL ’11, pp 288–298
Gómez-Adorno H, Sidorov G, Pinto D, Markov I (2015) A graph based authorship identification approach. In: Working notes papers of the CLEF 2015 evaluation labs, CLEF ’15, vol 1391
Gómez-Adorno H, Sidorov G, Pinto D, Vilariño D, Gelbukh A (2016) Automatic authorship detection using textual patterns extracted from integrated syntactic graphs. Sensors 16(9):1374
Article Google Scholar
Iyyer M, Manjunatha V, Boyd-Graber JL, Daumé III H (2015) Deep unordered composition rivals syntactic methods for text classification. In: Association for computational linguistics, ACl ’15, pp 1681–1691
Kalchbrenner N, Grefenstette E, Blunsom P (2014) A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188
Kestemont M, Luyckx K, Daelemans W, Crombez T (2012) Cross-genre authorship verification using unmasking. English Stud 93(3):340–356
Article Google Scholar
Kiros R, Zhu Y, Salakhutdinov RR, Zemel R, Urtasun R, Torralba A, Fidler S (2015) Skip-thought vectors. In: Advances in neural information processing systems, NIPS ’15, pp 3294–3302
Koppel M, Schler J, Bonchek-Dokow E (2007) Measuring differentiability: unmasking pseudonymous authors. J Mach Learn Res 8:1261–1276
MATH Google Scholar
Koppel M, Seidman S (2013) Automatically identifying pseudepigraphic texts. In: Proceedings of the 2013 conference on empirical methods in natural language processing, EMNLP ’13, pp 1449–1454
Le QV, Mikolov T (2014) Distributed representations of sentences and documents. In: Proceedings of the 31th international conference on machine learning, ICML ’14, pp 1188–1196
Li B, Liu T, Du X, Zhang D, Zhao Z (2015) Learning document embeddings by predicting n-grams for sentiment classification of long movie reviews. arXiv preprint arXiv:1512.08183
Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th annual meeting of the association for computational linguistics: human language technologies-vol 1, ACl ’11, pp 142–150
Madigan D, Genkin A, Lewis DD, Fradkin D (2005) Bayesian multinomial logistic regression for author identification. In: AIP conference proceedings, vol 803, pp 509–516. AIP
Markov I, Stamatatos E, Sidorov G (2017) Improving cross-topic authorship attribution: the role of pre-processing. In: 18th International conference on computational linguistics and intelligent text processing, CICLING ’17
Mikolov T, Yih WT, Zweig G (2013) Linguistic regularities in continuous space word representations. In: Conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL ’13, pp 746–751
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830
MathSciNet MATH Google Scholar
Posadas-Durán JP, Gómez-Adorno H, Sidorov G, Batyrshin I, Pinto D, Chanona-Hernández L Application of the distributed document representation in the authorship attribution task for small corpora. Soft Comput 21(3):1–13
Article Google Scholar
Posadas-Duran JP, Sidorov G, Batyrshin I (2014) Complete syntactic n-grams as style markers for authorship attribution. In: Mexican international conference on artificial intelligence, MICAI ’14, pp 9–17
Google Scholar
Posadas-Durán JP, Sidorov G, Batyrshin I, Mirasol-Meléndez E (2015) Author verification using syntactic n-grams. In: Working notes papers of the CLEF 2015 evaluation labs, CLEF ’15, vol 1391
Potthast M, Braun S, Buz T, Duffhauss F, Friedrich F, Gülzow JM, Köhler J, Lötzsch W, Müller F, Müller ME, Paßmann R, Reinke B, Rettenmeier L, Rometsch T, Sommer T, Träger M, Wilhelm S, Stein B, Stamatatos E, Hagen M (2016) Who wrote the web? Revisiting influential author identification research applicable to information retrieval. In: Advances in information retrieval—38th European conference on IR research, ECIR ’16, pp 393–407
Google Scholar
Sapkota U, Bethard S, Montes-y Gómez M, Solorio T (2015) Not all character n-grams are created equal: a study in authorship attribution. In: Conference of the North American chapter of the association for computational linguistics: human language technologies, NAACL ’2015, pp 93–102
Sapkota U, Solorio T, Montes-y Gómez M, Bethard S, Rosso P (2014) Cross-topic authorship attribution: will out-of-topic data help? In: The 25th international conference on computational linguistics: technical papers, COLING ’14, pp 1228–1237
Schwartz MB (2016) An examination of cross-domain authorship attribution techniques. CUNY Academic Works. https://academicworks.cuny.edu/gc_etds/1573. Accessed 16 Jan 2018
Sidorov G, Velasquez F, Stamatatos E, Gelbukh A, Chanona-Hernández L (2014) Syntactic n-grams as machine learning features for natural language processing. Expert Syst Appl 41(3):853–860
Article Google Scholar
Socher R, Perelygin A, Wu JY, Chuang J, Manning CD, Ng AY, Potts C (2013) Recursive deep models for semantic compositionality over a sentiment treebank. In: Proceedings of the 2013 conference on empirical methods in natural language processing, EMNLP ’13, pp 1631–1642
Stamatatos E (2009) A survey of modern authorship attribution methods. J Am Soc Inf Sci Technol 60(3):538–556
Article Google Scholar
Stamatatos E (2013) On the robustness of authorship attribution based on character n-gram features. J Law Policy 21(2):421–439
Google Scholar

Download references

Acknowledgements

This work was done under partial support of the Mexican Government (CONACYT Project 240844, SNI, COFAA - IPN, SIP - IPN 20162204, 20151406, 20144274).

Author information

Authors and Affiliations

Centro de Investigación en Computación (CIC), Instituto Politécnico Nacional (IPN), Mexico City, Mexico
Helena Gómez-Adorno & Grigori Sidorov
Escuela Superior de Ingeniería Mecánica y Eléctrica Unidad Zacatenco (ESIME-Zacatenco), Instituto Politécnico Nacional (IPN), Mexico City, Mexico
Juan-Pablo Posadas-Durán
Facultad de Ciencias de la Computación, Benemérita Universidad Autónoma de Puebla (BUAP), Puebla, Mexico
David Pinto

Authors

Helena Gómez-Adorno
View author publications
You can also search for this author inPubMed Google Scholar
Juan-Pablo Posadas-Durán
View author publications
You can also search for this author inPubMed Google Scholar
Grigori Sidorov
View author publications
You can also search for this author inPubMed Google Scholar
David Pinto
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Helena Gómez-Adorno.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Gómez-Adorno, H., Posadas-Durán, JP., Sidorov, G. et al. Document embeddings learned on various types of n-grams for cross-topic authorship attribution. Computing 100, 741–756 (2018). https://doi.org/10.1007/s00607-018-0587-8

Download citation

Received: 19 June 2017
Accepted: 06 January 2018
Published: 25 January 2018
Issue Date: July 2018
DOI: https://doi.org/10.1007/s00607-018-0587-8

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Document embeddings learned on various types of n-grams for cross-topic authorship attribution

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Application of the distributed document representation in the authorship attribution task for small corpora

An Interpretable Authorship Attribution Algorithm Based on Distance-Related Characterizations of Tokens

Enhancing Authorship Attribution Through Embedding Fusion: A Novel Approach with Masked and Encoder-Decoder Language Models

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Subscribe and save

Buy Now