Abstract
Identification of authorship is a research topic in natural language processing that has been interesting in recent years. Previously, texts were studied through a large variety of feature extraction methods to identify the author of the content. Advanced approaches based on deep learning have recently been applied to authorship attribution. This paper introduces a new model called ViBert4Author (V4A), a fine-tuning version of the pre-trained PhoBERT language model with the addition of dense layer and soft-max through combining the same algorithms. The feature extraction method is used for author classification in Vietnamese literature. In addition, our article also introduces a dataset that has been collected based on self-developed tools, the dataset on building over 800 works from 8 authors named VN-Literature. We also performed many tests on English datasets to evaluate the model: blogs, emails published on Kaggle, and pre-trained multi-languages for testing. We give a comprehensive analysis of the advantages and disadvantages of the proposed method. In addition, we evaluate the extraction of additional features (stylometric and hybrid features) in our assessment of approaches using the F1-score measure. The results show that our proposed model has improved performance over previous methods, in which the model that combines stylistic features and modern methods achieves outstanding performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Chaski, C.E.: Who’s at the keyboard? authorship attribution in digital evidence investigations. International Journal of Digital Evidence 4, 1–13 (2005)
Lambers, M., Veenman, C.J.: Forensic authorship attribution using compression distances to prototypes. In: Proceeding of the 3rd International Workshop on Computational Forensics, 13–24 (2009)
Iqbal, F., Binsalleeh, H., Fung, B.C.M., Debbabi, M.: Mining write prints from anonymous e-mails for forensic investigation. Digit. Investig. 7(1–2), 56–64 (2010)
Kimler: Using style markers for detecting plagiarism in natural language (2003)
Huang and Mizuho IWAIHARA: Authorship Attribution Based on Pre-Trained Language Model and Capsule Network (2022)
Gollub, T., et al.: Recent trends in digital text forensics and its evaluation – plagiarism detection, author identification, and author profiling (2013)
Schwartz, R., Tsur, O., Rappoport, A., Koppel, M.: Authorship attribution of micro messages. In: Conf. Empirical Methods in Natural Language Processing, pp. 1880–1891 (2013)
Shrestha, P., et al.: Convolutional neural networks for authorship attribution of short texts (2017)
Bagnall, D.: Author identification using multi–headed recurrent neural network. In: Working Notes Papers of the CLEF 2015 Evaluation Labs, vol. 1391 (2015)
Ding, S.H.H., Fung, B.C.M., Iqbal, F., Cheung, W.K.: Learning stylometric representations for authorship analysis. In: IEEE Transactions on Cybernetics, pp. 107–121 (2016)
Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inform. Sci. Technol. 60(3), 538–556 (2009)
Huang, W., Su, R., Iwaihara, M.: Contribution of improved character embedding and latent posting styles to authorship attribution of short texts, pp. 261–269. Springer (2020)
Wang, X., Iwaihara, M.: Integrating RoBERTa fine-tuning and user writing styles for authorship attribution of short texts (2021)
Iyer, R.R., Rose, C.P.: A machine learning framework for authorship identification from texts. arXiv Prepr. arXiv: 1912.10204 (2019)
Anwar, W., Bajwa, I.S., Choudhary, M.A., Ramzan, S.: An empirical study on forensic analysis of urdu text using LDA-based authorship attribution (2018)
Dmitrin, Y.V, Botov, D.S, Klenin, J.D, Nikolaev, I.E.: Comparison of deep neural network architectures for authorship attribution of Russian social media texts (2018)
Uchendu, A., Le, T., Shu, K., Lee, D.: Authorship attribution for neural text generation (2020)
Ruder, S., Ghaffari, P., Breslin, J.G.: Character-level and multi-channel convolutional neural networks for large-scale authorship attribution. arXiv Prepr. arXiv:1609.06686 (2016)
Klimt, B., Yang, Y.: A new dataset for email classification research. In: Boulicaut, J.F., Esposito, F., Giannotti, F., Pedreschi, D. (ed.), Machine Learning: ECML (2004)
Aggarwal, C., Zhai, C.: A survey of text clustering algorithms. In: Aggarwal, C.C., Zhai, C. (eds.) Mining Text Data, pp. 77–128. Springer, US (2012)
Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedings of the 2014 Conference on EMNLP, pp. 1746–1751 (2014)
Schuster, M., Paliwal, K.K.: Bidirectional recurrent neural networks. IEEE Trans. Signal Process. 45(11), 2673–2681 (1997)
Medsker, L., Jain, L.C.: Recurrent Neural Networks: Design and Applications. CRC Press (1999)
Yang, X., Yang, L., Bi, R., Lin, H.: A comprehensive verification of transformer in text classification. In: China National Conference on Chinese Computational Linguistics, pp. 207–218. Springer (2019)
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale (2020)
Nguyen, D.Q., Nguyen, T.A.: PhoBERT: Pre-trained language modelsfor vietnamese. In: Findings of the Association for Computational Lin-guistics: EMNLP 2020, pp. 1037–1042. Association for ComputationalLinguistics, Online (2020)
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv:1907.11692 (2019)
Grave, E., Bojanowski, P., Gupta, P., Joulin, A., Mikolov, T.: Learning word vectors for 157 languages. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Miyazaki, Japan (2018)
Sari, Y., Stevenson, M., Vlachos, A.: Topic or style? exploring the most useful features for authorship attribution. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 343–353. Association for Computational Linguistics, Santa Fe, New Mexico, USA (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Dong, K.D., Nguyen, D.T. (2022). Vietnamese Text’s Writing Styles Based Authorship Identification Model. In: Dang, T.K., Küng, J., Chung, T.M. (eds) Future Data and Security Engineering. Big Data, Security and Privacy, Smart City and Industry 4.0 Applications. FDSE 2022. Communications in Computer and Information Science, vol 1688. Springer, Singapore. https://doi.org/10.1007/978-981-19-8069-5_23
Download citation
DOI: https://doi.org/10.1007/978-981-19-8069-5_23
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-19-8068-8
Online ISBN: 978-981-19-8069-5
eBook Packages: Computer ScienceComputer Science (R0)