Abstract
Authorship Attribution is a long-standing problem in Natural Language Processing. Several statistical and computational methods have been used to find a solution to this problem. In this article, we have proposed methods to deal with the authorship attribution problem in Bengali. More specifically, we proposed a supervised framework consisting of lexical and shallow features and investigated the possibility of using topic-modeling-inspired features, to classify documents according to their authors. We have created a corpus from nearly all the literary works of three eminent Bengali authors, consisting of 3,000 disjoint samples. Our models showed better performance than the state-of-the-art, with more than 98% test accuracy for the shallow features and 100% test accuracy for the topic-based features. Further experiments with GloVe vectors [Pennington et al. 2014] showed comparable results, but flexible patterns based on content words and high-frequency words [Schwartz et al. 2013] failed to perform as well as expected.
- David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3 (2003), 993--1022. Google ScholarCross Ref
- Tenenbaum Blei, Griffiths and Jordan. 2004. Hierarchical topic models and the nested Chinese restaurant process. Adv. Neural Info. Process. Syst. 16 (2004), 17.Google Scholar
- Victoria Bobicev, Marina Sokolova, Khaled El Emam, and Stan Matwin. 2013. Authorship attribution in health forums. In Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP’13). INCOMA Ltd. Shoumen, Bulgaria, 74--82.Google Scholar
- Dasha Bogdanova and Angeliki Lazaridou. 2014. Cross-language authorship attribution. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC-2014). European Language Resources Association (ELRA).Google Scholar
- Tanmoy Chakraborty. 2012. Authorship identification using stylometry analysis in Bengali literature. CoRR abs/1208.6268 (2012). http://arxiv.org/abs/1208.6268Google Scholar
- Suprabhat Das and Pabitra Mitra. 2011. Author identification in Bengali literary works. In Pattern Recognition and Machine Intelligence, Sergei O. Kuznetsov, Deba P. Mandal, Malay K. Kundu, and Sankar K. Pal (Eds.). Lecture Notes in Computer Science, Vol. 6744. Springer, Berlin, 220--226. Google ScholarDigital Library
- Farkhund Iqbal, Rachid Hadjidj, Benjamin C. M. Fung, and Mourad Debbabi. 2008. A novel approach of mining write-prints for authorship attribution in e-mail forensics. Dig. Invest. 5, Supplement (2008), S42--S51. Google ScholarDigital Library
- Siladitya Jana. 2015. Sister Nivedita’s influence on J. C. Bose’s writings. J. Assoc. Info. Sci. Technol. 66, 3 (2015), 645--650.Google ScholarDigital Library
- Patrick Juola. 2006. Authorship attribution. Found. Trends Inf. Retr. 1, 3 (Dec. 2006), 233--334. Google ScholarDigital Library
- Moshe Koppel, Jonathan Schler, and Shlomo Argamon. 2009. Computational methods in authorship attribution. J. Am. Soc. Inf. Sci. Technol. 60, 1 (Jan. 2009), 9--26. Google ScholarDigital Library
- Shibamouli Lahiri and Rada Mihalcea. 2013. Authorship attribution using word network features. CoRR abs/1311.2978 (2013). http://arxiv.org/abs/1311.2978Google Scholar
- R. Layton, P. Watters, and R. Dazeley. 2010a. Authorship attribution for twitter in 140 characters or less. In Proceedings of the 2nd Cybercrime and Trustworthy Computing Workshop (CTC’10). 1--8. Google ScholarDigital Library
- Robert Layton, Paul Watters, and Richard Dazeley. 2010b. Authorship attribution for twitter in 140 characters or less. In Proceedings of the 2nd Cybercrime and Trustworthy Computing Workshop (CTC’10). 1--8. Google ScholarDigital Library
- Wei Li and Andrew McCallum. 2006. Pachinko allocation: DAG-structured mixture models of topic correlations. In Proceedings of the 23rd International Conference on Machine Learning. ACM, 577--584. Google ScholarDigital Library
- Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: Proceedings of the 27th Annual Conference on Neural Information Processing Systems. 3111--3119. Google ScholarDigital Library
- Frederick Mosteller and David L. Wallace. 1963. Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed federalist papers. J. Amer. Statist. Assoc. 58, 302 (1963), 275--309.Google Scholar
- Sibansu Mukhopadhyay, Tirthankar Dasgupta, and Anupam Basu. 2012. Development of an online repository of Bangla literary texts and its ontological representation for advance search options. In Proceedings of the Workshop on Indian Language and Data: Resources and Evaluation Workshop Programme. Citeseer, 93.Google Scholar
- S. Nagaprasad, T. Raghunadha Reddy, P. Vijayapal Reddy, A. Vinaya Babu, and B. VishnuVardhan. 2015. Empirical evaluations using character and word n-grams on authorship attribution for Telugu text. In Intelligent Computing and Applications, Durbadal Mandal, Rajib Kar, Swagatam Das, and Bijaya Ketan Panigrahi (Eds.). Advances in Intelligent Systems and Computing, Vol. 343. Springer, India, 613--623.Google Scholar
- A. Jamal Nasir, Nico Görnitz, and Ulf Brefeld. 2014. An off-the-shelf approach to authorship attribution. In Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers (COLING’14). Dublin City University and Association for Computational Linguistics, 895--904.Google Scholar
- Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12 (2011), 2825--2830. Google ScholarDigital Library
- Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). Association for Computational Linguistics, Doha, Qatar, 1532--1543. Retrieved from http://www.aclweb.org/anthology/D14-1162Google ScholarCross Ref
- Shanta Phani, Shibamouli Lahiri, and Arindam Biswas. Authorship attribution in Bengali language. In Proceedings of the 12th International Conference on Natural Language Processing (ICON’15).Google Scholar
- Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. 2004. The author-topic model for authors and documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. AUAI Press, 487--494. Google ScholarDigital Library
- Conrad Sanderson and Simon Guenter. 2006. Short text authorship attribution via sequence kernels, Markov chains and author unmasking: An investigation. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP’06). Association for Computational Linguistics, Stroudsburg, PA, 482--491. Google ScholarDigital Library
- Ruchita Sarawgi, Kailash Gajulapalli, and Yejin Choi. 2011. Gender attribution: Tracing stylometric evidence beyond topic and genre. In Proceedings of the 15th Conference on Computational Natural Language Learning. Association for Computational Linguistics, 78--86. Google ScholarDigital Library
- Jacques Savoy. 2013. Authorship attribution based on a probabilistic topic model. Info. Process. Manage. 49, 1 (2013), 341--354. Google ScholarDigital Library
- Roy Schwartz, Oren Tsur, Ari Rappoport, and Moshe Koppel. 2013. Authorship attribution of micro-messages. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1880--1891. http://aclweb.org/anthology/D13-1193Google Scholar
- Santiago Segarra, Mark Eisen, and Alejandro Ribeiro. 2014. Authorship attribution through function word adjacency networks. CoRR abs/1406.4469 (2014). http://arxiv.org/abs/1406.4469Google Scholar
- Yanir Seroussi, Fabian Bohnert, and Ingrid Zukerman. 2012. Authorship attribution with author-aware topic models. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Jeju Island, Korea, 264--269. Google ScholarDigital Library
- Yanir Seroussi, Ingrid Zukerman, and Fabian Bohnert. 2014. Authorship attribution with topic models. Volume 40, Issue 2, June 2014 (2014), 269--310. Google ScholarDigital Library
- Efstathios Stamatatos. 2009. A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60, 3 (March 2009), 538--556. Google ScholarCross Ref
- Mark Steyvers, Padhraic Smyth, Michal Rosen-Zvi, and Thomas Griffiths. 2004. Probabilistic author-topic models for information discovery. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 306--315. Google ScholarDigital Library
- Andreas van Cranenburgh. 2012. Literary authorship attribution with phrase-structure fragments. In Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature. Association for Computational Linguistics, 59--63.Google Scholar
- Ying Zhao, Justin Zobel, and Phil Vines. 2006. Using Relative Entropy for Authorship Attribution. Springer, Berlin, 92--105. Google ScholarDigital Library
Index Terms
- A Supervised Learning Approach for Authorship Attribution of Bengali Literary Texts
Recommendations
Authorship Attribution of Brazilian Literary Texts Through Machine Learning Techniques
Intelligent SystemsAbstractAuthorship attribution is the process of identifying the author of a particular document. This task has been performed by experts in the field. However, with the advancement of natural language processing tools and machine learning techniques, ...
Authorship Attribution of Short Historical Arabic Texts Based on Lexical Features
CYBERC '13: Proceedings of the 2013 International Conference on Cyber-Enabled Distributed Computing and Knowledge DiscoveryIn this paper the authors investigate the authorship of several short historical texts that are written by ten ancient Arabic travelers: this Arabic dataset, which was collected by the authors in 2011, is called AAAT dataset. Several experiments of ...
Using Lexical Stress in Authorship Attribution of Historical Texts
TSD 2015: Proceedings of the 18th International Conference on Text, Speech, and Dialogue - Volume 9302This paper presents some early results from a comprehensive project, whose goal is to investigate the use of intonation and lexical stress in authorship attribution. We demonstrate how lexical stress patterns extracted from written text can be used to ...
Comments