short-paper

A Supervised Learning Approach for Authorship Attribution of Bengali Literary Texts

Authors:
Shanta Phani

Information Technology, IIEST, India

Information Technology, IIEST, India
View Profile

,
Shibamouli Lahiri

University of Michigan, Ann Arbor, MI, USA

University of Michigan, Ann Arbor, MI, USA
View Profile

,
Arindam Biswas

Information Technology, IIEST, India

Information Technology, IIEST, India
View Profile

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 16 Issue 4Article No.: 28pp 1–15https://doi.org/10.1145/3099473

Published:16 August 2017Publication History

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

Authorship Attribution is a long-standing problem in Natural Language Processing. Several statistical and computational methods have been used to find a solution to this problem. In this article, we have proposed methods to deal with the authorship attribution problem in Bengali. More specifically, we proposed a supervised framework consisting of lexical and shallow features and investigated the possibility of using topic-modeling-inspired features, to classify documents according to their authors. We have created a corpus from nearly all the literary works of three eminent Bengali authors, consisting of 3,000 disjoint samples. Our models showed better performance than the state-of-the-art, with more than 98% test accuracy for the shallow features and 100% test accuracy for the topic-based features. Further experiments with GloVe vectors [Pennington et al. 2014] showed comparable results, but flexible patterns based on content words and high-frequency words [Schwartz et al. 2013] failed to perform as well as expected.

References

David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3 (2003), 993--1022. Google ScholarCross Ref
Tenenbaum Blei, Griffiths and Jordan. 2004. Hierarchical topic models and the nested Chinese restaurant process. Adv. Neural Info. Process. Syst. 16 (2004), 17.Google Scholar
Victoria Bobicev, Marina Sokolova, Khaled El Emam, and Stan Matwin. 2013. Authorship attribution in health forums. In Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP’13). INCOMA Ltd. Shoumen, Bulgaria, 74--82.Google Scholar
Dasha Bogdanova and Angeliki Lazaridou. 2014. Cross-language authorship attribution. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC-2014). European Language Resources Association (ELRA).Google Scholar
Tanmoy Chakraborty. 2012. Authorship identification using stylometry analysis in Bengali literature. CoRR abs/1208.6268 (2012). http://arxiv.org/abs/1208.6268Google Scholar
Suprabhat Das and Pabitra Mitra. 2011. Author identification in Bengali literary works. In Pattern Recognition and Machine Intelligence, Sergei O. Kuznetsov, Deba P. Mandal, Malay K. Kundu, and Sankar K. Pal (Eds.). Lecture Notes in Computer Science, Vol. 6744. Springer, Berlin, 220--226. Google ScholarDigital Library
Farkhund Iqbal, Rachid Hadjidj, Benjamin C. M. Fung, and Mourad Debbabi. 2008. A novel approach of mining write-prints for authorship attribution in e-mail forensics. Dig. Invest. 5, Supplement (2008), S42--S51. Google ScholarDigital Library
Siladitya Jana. 2015. Sister Nivedita’s influence on J. C. Bose’s writings. J. Assoc. Info. Sci. Technol. 66, 3 (2015), 645--650.Google ScholarDigital Library
Patrick Juola. 2006. Authorship attribution. Found. Trends Inf. Retr. 1, 3 (Dec. 2006), 233--334. Google ScholarDigital Library
Moshe Koppel, Jonathan Schler, and Shlomo Argamon. 2009. Computational methods in authorship attribution. J. Am. Soc. Inf. Sci. Technol. 60, 1 (Jan. 2009), 9--26. Google ScholarDigital Library
Shibamouli Lahiri and Rada Mihalcea. 2013. Authorship attribution using word network features. CoRR abs/1311.2978 (2013). http://arxiv.org/abs/1311.2978Google Scholar
R. Layton, P. Watters, and R. Dazeley. 2010a. Authorship attribution for twitter in 140 characters or less. In Proceedings of the 2nd Cybercrime and Trustworthy Computing Workshop (CTC’10). 1--8. Google ScholarDigital Library
Robert Layton, Paul Watters, and Richard Dazeley. 2010b. Authorship attribution for twitter in 140 characters or less. In Proceedings of the 2nd Cybercrime and Trustworthy Computing Workshop (CTC’10). 1--8. Google ScholarDigital Library
Wei Li and Andrew McCallum. 2006. Pachinko allocation: DAG-structured mixture models of topic correlations. In Proceedings of the 23rd International Conference on Machine Learning. ACM, 577--584. Google ScholarDigital Library
Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: Proceedings of the 27th Annual Conference on Neural Information Processing Systems. 3111--3119. Google ScholarDigital Library
Frederick Mosteller and David L. Wallace. 1963. Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed federalist papers. J. Amer. Statist. Assoc. 58, 302 (1963), 275--309.Google Scholar
Sibansu Mukhopadhyay, Tirthankar Dasgupta, and Anupam Basu. 2012. Development of an online repository of Bangla literary texts and its ontological representation for advance search options. In Proceedings of the Workshop on Indian Language and Data: Resources and Evaluation Workshop Programme. Citeseer, 93.Google Scholar
S. Nagaprasad, T. Raghunadha Reddy, P. Vijayapal Reddy, A. Vinaya Babu, and B. VishnuVardhan. 2015. Empirical evaluations using character and word n-grams on authorship attribution for Telugu text. In Intelligent Computing and Applications, Durbadal Mandal, Rajib Kar, Swagatam Das, and Bijaya Ketan Panigrahi (Eds.). Advances in Intelligent Systems and Computing, Vol. 343. Springer, India, 613--623.Google Scholar
A. Jamal Nasir, Nico Görnitz, and Ulf Brefeld. 2014. An off-the-shelf approach to authorship attribution. In Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers (COLING’14). Dublin City University and Association for Computational Linguistics, 895--904.Google Scholar
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12 (2011), 2825--2830. Google ScholarDigital Library
Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). Association for Computational Linguistics, Doha, Qatar, 1532--1543. Retrieved from http://www.aclweb.org/anthology/D14-1162Google ScholarCross Ref
Shanta Phani, Shibamouli Lahiri, and Arindam Biswas. Authorship attribution in Bengali language. In Proceedings of the 12th International Conference on Natural Language Processing (ICON’15).Google Scholar
Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. 2004. The author-topic model for authors and documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. AUAI Press, 487--494. Google ScholarDigital Library
Conrad Sanderson and Simon Guenter. 2006. Short text authorship attribution via sequence kernels, Markov chains and author unmasking: An investigation. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP’06). Association for Computational Linguistics, Stroudsburg, PA, 482--491. Google ScholarDigital Library
Ruchita Sarawgi, Kailash Gajulapalli, and Yejin Choi. 2011. Gender attribution: Tracing stylometric evidence beyond topic and genre. In Proceedings of the 15th Conference on Computational Natural Language Learning. Association for Computational Linguistics, 78--86. Google ScholarDigital Library
Jacques Savoy. 2013. Authorship attribution based on a probabilistic topic model. Info. Process. Manage. 49, 1 (2013), 341--354. Google ScholarDigital Library
Roy Schwartz, Oren Tsur, Ari Rappoport, and Moshe Koppel. 2013. Authorship attribution of micro-messages. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1880--1891. http://aclweb.org/anthology/D13-1193Google Scholar
Santiago Segarra, Mark Eisen, and Alejandro Ribeiro. 2014. Authorship attribution through function word adjacency networks. CoRR abs/1406.4469 (2014). http://arxiv.org/abs/1406.4469Google Scholar
Yanir Seroussi, Fabian Bohnert, and Ingrid Zukerman. 2012. Authorship attribution with author-aware topic models. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Jeju Island, Korea, 264--269. Google ScholarDigital Library
Yanir Seroussi, Ingrid Zukerman, and Fabian Bohnert. 2014. Authorship attribution with topic models. Volume 40, Issue 2, June 2014 (2014), 269--310. Google ScholarDigital Library
Efstathios Stamatatos. 2009. A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60, 3 (March 2009), 538--556. Google ScholarCross Ref
Mark Steyvers, Padhraic Smyth, Michal Rosen-Zvi, and Thomas Griffiths. 2004. Probabilistic author-topic models for information discovery. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 306--315. Google ScholarDigital Library
Andreas van Cranenburgh. 2012. Literary authorship attribution with phrase-structure fragments. In Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature. Association for Computational Linguistics, 59--63.Google Scholar
Ying Zhao, Justin Zobel, and Phil Vines. 2006. Using Relative Entropy for Authorship Attribution. Springer, Berlin, 92--105. Google ScholarDigital Library

Index Terms

A Supervised Learning Approach for Authorship Attribution of Bengali Literary Texts
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Information extraction
      2. Language resources

Recommendations

Authorship Attribution of Brazilian Literary Texts Through Machine Learning Techniques
Intelligent Systems
Abstract
Authorship attribution is the process of identifying the author of a particular document. This task has been performed by experts in the field. However, with the advancement of natural language processing tools and machine learning techniques, ...
Read More
Authorship Attribution of Short Historical Arabic Texts Based on Lexical Features
CYBERC '13: Proceedings of the 2013 International Conference on Cyber-Enabled Distributed Computing and Knowledge Discovery

In this paper the authors investigate the authorship of several short historical texts that are written by ten ancient Arabic travelers: this Arabic dataset, which was collected by the authors in 2011, is called AAAT dataset. Several experiments of ...
Read More
Using Lexical Stress in Authorship Attribution of Historical Texts
TSD 2015: Proceedings of the 18th International Conference on Text, Speech, and Dialogue - Volume 9302

This paper presents some early results from a comprehensive project, whose goal is to investigate the use of intonation and lexical stress in authorship attribution. We demonstrate how lexical stress patterns extracted from written text can be used to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Asian and Low-Resource Language Information Processing Volume 16, Issue 4
December 2017
146 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3097269
Editor:
Nianwen Xue
Brandeis University, Waltham, USA
Issue’s Table of Contents
Copyright © 2017 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 16 August 2017
- Revised: 1 May 2017
- Accepted: 1 May 2017
- Received: 1 September 2016
Published in tallip Volume 16, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Authorship attribution
Naive bayes
lexical features
machine learning
topic model
Qualifiers
- short-paper
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 252
  Total Downloads
- Downloads (Last 12 months)22
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A Supervised Learning Approach for Authorship Attribution of Bengali Literary Texts

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Authorship Attribution of Brazilian Literary Texts Through Machine Learning Techniques

Authorship Attribution of Short Historical Arabic Texts Based on Lexical Features

Using Lexical Stress in Authorship Attribution of Historical Texts

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A Supervised Learning Approach for Authorship Attribution of Bengali Literary Texts

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

References

Cited By

Index Terms

Recommendations

Authorship Attribution of Brazilian Literary Texts Through Machine Learning Techniques

Authorship Attribution of Short Historical Arabic Texts Based on Lexical Features

Using Lexical Stress in Authorship Attribution of Historical Texts

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media