skip to main content
short-paper

A Supervised Learning Approach for Authorship Attribution of Bengali Literary Texts

Published:16 August 2017Publication History
Skip Abstract Section

Abstract

Authorship Attribution is a long-standing problem in Natural Language Processing. Several statistical and computational methods have been used to find a solution to this problem. In this article, we have proposed methods to deal with the authorship attribution problem in Bengali. More specifically, we proposed a supervised framework consisting of lexical and shallow features and investigated the possibility of using topic-modeling-inspired features, to classify documents according to their authors. We have created a corpus from nearly all the literary works of three eminent Bengali authors, consisting of 3,000 disjoint samples. Our models showed better performance than the state-of-the-art, with more than 98% test accuracy for the shallow features and 100% test accuracy for the topic-based features. Further experiments with GloVe vectors [Pennington et al. 2014] showed comparable results, but flexible patterns based on content words and high-frequency words [Schwartz et al. 2013] failed to perform as well as expected.

References

  1. David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. J. Mach. Learn. Res. 3 (2003), 993--1022. Google ScholarGoogle ScholarCross RefCross Ref
  2. Tenenbaum Blei, Griffiths and Jordan. 2004. Hierarchical topic models and the nested Chinese restaurant process. Adv. Neural Info. Process. Syst. 16 (2004), 17.Google ScholarGoogle Scholar
  3. Victoria Bobicev, Marina Sokolova, Khaled El Emam, and Stan Matwin. 2013. Authorship attribution in health forums. In Proceedings of the International Conference Recent Advances in Natural Language Processing (RANLP’13). INCOMA Ltd. Shoumen, Bulgaria, 74--82.Google ScholarGoogle Scholar
  4. Dasha Bogdanova and Angeliki Lazaridou. 2014. Cross-language authorship attribution. In Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC-2014). European Language Resources Association (ELRA).Google ScholarGoogle Scholar
  5. Tanmoy Chakraborty. 2012. Authorship identification using stylometry analysis in Bengali literature. CoRR abs/1208.6268 (2012). http://arxiv.org/abs/1208.6268Google ScholarGoogle Scholar
  6. Suprabhat Das and Pabitra Mitra. 2011. Author identification in Bengali literary works. In Pattern Recognition and Machine Intelligence, Sergei O. Kuznetsov, Deba P. Mandal, Malay K. Kundu, and Sankar K. Pal (Eds.). Lecture Notes in Computer Science, Vol. 6744. Springer, Berlin, 220--226. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Farkhund Iqbal, Rachid Hadjidj, Benjamin C. M. Fung, and Mourad Debbabi. 2008. A novel approach of mining write-prints for authorship attribution in e-mail forensics. Dig. Invest. 5, Supplement (2008), S42--S51. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Siladitya Jana. 2015. Sister Nivedita’s influence on J. C. Bose’s writings. J. Assoc. Info. Sci. Technol. 66, 3 (2015), 645--650.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Patrick Juola. 2006. Authorship attribution. Found. Trends Inf. Retr. 1, 3 (Dec. 2006), 233--334. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Moshe Koppel, Jonathan Schler, and Shlomo Argamon. 2009. Computational methods in authorship attribution. J. Am. Soc. Inf. Sci. Technol. 60, 1 (Jan. 2009), 9--26. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Shibamouli Lahiri and Rada Mihalcea. 2013. Authorship attribution using word network features. CoRR abs/1311.2978 (2013). http://arxiv.org/abs/1311.2978Google ScholarGoogle Scholar
  12. R. Layton, P. Watters, and R. Dazeley. 2010a. Authorship attribution for twitter in 140 characters or less. In Proceedings of the 2nd Cybercrime and Trustworthy Computing Workshop (CTC’10). 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Robert Layton, Paul Watters, and Richard Dazeley. 2010b. Authorship attribution for twitter in 140 characters or less. In Proceedings of the 2nd Cybercrime and Trustworthy Computing Workshop (CTC’10). 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Wei Li and Andrew McCallum. 2006. Pachinko allocation: DAG-structured mixture models of topic correlations. In Proceedings of the 23rd International Conference on Machine Learning. ACM, 577--584. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Tomas Mikolov, Ilya Sutskever, Kai Chen, Gregory S. Corrado, and Jeffrey Dean. 2013. Distributed representations of words and phrases and their compositionality. In Advances in Neural Information Processing Systems 26: Proceedings of the 27th Annual Conference on Neural Information Processing Systems. 3111--3119. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Frederick Mosteller and David L. Wallace. 1963. Inference in an authorship problem: A comparative study of discrimination methods applied to the authorship of the disputed federalist papers. J. Amer. Statist. Assoc. 58, 302 (1963), 275--309.Google ScholarGoogle Scholar
  17. Sibansu Mukhopadhyay, Tirthankar Dasgupta, and Anupam Basu. 2012. Development of an online repository of Bangla literary texts and its ontological representation for advance search options. In Proceedings of the Workshop on Indian Language and Data: Resources and Evaluation Workshop Programme. Citeseer, 93.Google ScholarGoogle Scholar
  18. S. Nagaprasad, T. Raghunadha Reddy, P. Vijayapal Reddy, A. Vinaya Babu, and B. VishnuVardhan. 2015. Empirical evaluations using character and word n-grams on authorship attribution for Telugu text. In Intelligent Computing and Applications, Durbadal Mandal, Rajib Kar, Swagatam Das, and Bijaya Ketan Panigrahi (Eds.). Advances in Intelligent Systems and Computing, Vol. 343. Springer, India, 613--623.Google ScholarGoogle Scholar
  19. A. Jamal Nasir, Nico Görnitz, and Ulf Brefeld. 2014. An off-the-shelf approach to authorship attribution. In Proceedings of the 25th International Conference on Computational Linguistics: Technical Papers (COLING’14). Dublin City University and Association for Computational Linguistics, 895--904.Google ScholarGoogle Scholar
  20. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, and Édouard Duchesnay. 2011. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 12 (2011), 2825--2830. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. GloVe: Global vectors for word representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP’14). Association for Computational Linguistics, Doha, Qatar, 1532--1543. Retrieved from http://www.aclweb.org/anthology/D14-1162Google ScholarGoogle ScholarCross RefCross Ref
  22. Shanta Phani, Shibamouli Lahiri, and Arindam Biswas. Authorship attribution in Bengali language. In Proceedings of the 12th International Conference on Natural Language Processing (ICON’15).Google ScholarGoogle Scholar
  23. Michal Rosen-Zvi, Thomas Griffiths, Mark Steyvers, and Padhraic Smyth. 2004. The author-topic model for authors and documents. In Proceedings of the 20th Conference on Uncertainty in Artificial Intelligence. AUAI Press, 487--494. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Conrad Sanderson and Simon Guenter. 2006. Short text authorship attribution via sequence kernels, Markov chains and author unmasking: An investigation. In Proceedings of the 2006 Conference on Empirical Methods in Natural Language Processing (EMNLP’06). Association for Computational Linguistics, Stroudsburg, PA, 482--491. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Ruchita Sarawgi, Kailash Gajulapalli, and Yejin Choi. 2011. Gender attribution: Tracing stylometric evidence beyond topic and genre. In Proceedings of the 15th Conference on Computational Natural Language Learning. Association for Computational Linguistics, 78--86. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jacques Savoy. 2013. Authorship attribution based on a probabilistic topic model. Info. Process. Manage. 49, 1 (2013), 341--354. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Roy Schwartz, Oren Tsur, Ari Rappoport, and Moshe Koppel. 2013. Authorship attribution of micro-messages. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 1880--1891. http://aclweb.org/anthology/D13-1193Google ScholarGoogle Scholar
  28. Santiago Segarra, Mark Eisen, and Alejandro Ribeiro. 2014. Authorship attribution through function word adjacency networks. CoRR abs/1406.4469 (2014). http://arxiv.org/abs/1406.4469Google ScholarGoogle Scholar
  29. Yanir Seroussi, Fabian Bohnert, and Ingrid Zukerman. 2012. Authorship attribution with author-aware topic models. In Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers). Association for Computational Linguistics, Jeju Island, Korea, 264--269. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Yanir Seroussi, Ingrid Zukerman, and Fabian Bohnert. 2014. Authorship attribution with topic models. Volume 40, Issue 2, June 2014 (2014), 269--310. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Efstathios Stamatatos. 2009. A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60, 3 (March 2009), 538--556. Google ScholarGoogle ScholarCross RefCross Ref
  32. Mark Steyvers, Padhraic Smyth, Michal Rosen-Zvi, and Thomas Griffiths. 2004. Probabilistic author-topic models for information discovery. In Proceedings of the 10th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, 306--315. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Andreas van Cranenburgh. 2012. Literary authorship attribution with phrase-structure fragments. In Proceedings of the NAACL-HLT 2012 Workshop on Computational Linguistics for Literature. Association for Computational Linguistics, 59--63.Google ScholarGoogle Scholar
  34. Ying Zhao, Justin Zobel, and Phil Vines. 2006. Using Relative Entropy for Authorship Attribution. Springer, Berlin, 92--105. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. A Supervised Learning Approach for Authorship Attribution of Bengali Literary Texts

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in

      Full Access

      • Published in

        cover image ACM Transactions on Asian and Low-Resource Language Information Processing
        ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 16, Issue 4
        December 2017
        146 pages
        ISSN:2375-4699
        EISSN:2375-4702
        DOI:10.1145/3097269
        Issue’s Table of Contents

        Copyright © 2017 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 16 August 2017
        • Revised: 1 May 2017
        • Accepted: 1 May 2017
        • Received: 1 September 2016
        Published in tallip Volume 16, Issue 4

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • short-paper
        • Research
        • Refereed

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader