Abstract
The present paper investigates the application of the multi-layer perceptron (MLP) to the task of categorizing texts based on their authors’ style. This task is of particular importance for information retrieval applications involving very large document databases. The emphasis of this article is to determine the extent to which the MLP model can be fine-tuned to successfully analyse such data, uncovering the stylistic differences among authors. The MLP-based method is compared and contrasted to statistical techniques, such as discriminant analysis, that are widely used in stylistic studies. The comparison of the methods is based on their classification performance, to provide an objective evaluation of the advantages of each method. A second aim of the study presented here is to compare the effectiveness of distinct features in the task of uncovering the author identity for each method. To evaluate to a greater depth the effectiveness of the entire approach, the results of the proposed MLP-based method are compared to those of established approaches, such as the support vector machines (SVM), using both the original parameters employed by the MLP as well as term frequency–inverse document frequency (TF–IDF) parameters, and the cascade correlation approach. It is found that the proposed MLP-based approach possesses a number of advantages, such as high classification accuracy, broadly comparable to that of the SVM, coupled with the ability to algorithmically reduce the set of parameters used without adversely affecting the classification accuracy.



Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.References
Gurney PJ, Gurney LW (1998) Subsets and homogeneity: authorship attribution in the Scriptores Historiae Augustae. Lit Linguist Comput 13(3):133–140
Holmes DI (1994) Authorship attribution. Comput Humanit 28:86–106
Mosteller F, Wallace DL (1984) Applied Bayesian and classical inference: the case of the Federalist papers, 2nd edn. Springer, New York
Holmes DI, Singh S, Tweedie FJ (1996) Neural network applications in stylometry: the Federalist papers. Comput Humanit 30:1–10
Lowe D, Matthews R (1995) Shakespeare vs. Fletcher: a stylometric analysis by radial basis functions. Comput Humanit 29:449–461
Tambouratzis G, Hairetakis N, Markantonatou S, Carayannis G (2003) Applying the SOM model to text classification according to register and stylistic content. Int J Neural Syst 13(1):1–11
Haykin S (1999) Neural networks: a comprehensive foundation, 2nd edn. Prentice Hall, Englewood Cliffs
Somers H, Tweedie F (2003) Authorship attribution and practice. Comput Humanit 37:407–429
Tambouratzis G, Markantonatou S, Hairetakis N, Vassiliou M, Tambouratzis D, Carayannis G (2004) Discriminating the registers and styles in the Modern Greek language—part 2: extending the feature vector to optimise author discrimination. Lit Linguist Comput 19(2):221–242
Kolman E, Margaliot M (2005) Are artificial neural networks white boxes? IEEE Trans Neural Netw 16(4):844–852
Mackay D (1992) A practical Bayesian framework for backpropagation networks. Neural Comput 4(3):448–472
Nguyen D, Widrow B (1990) Improving the learning speed of 2-layer neural networks by choosing initial values of the adaptive weights. Proc Int Jt Conf Neural Netw 3:21–26
Papageorgiou H, Prokopidis P, Giouli V, Piperidis S (2000) A unified PoS tagging architecture and its application to Greek, vol 3. Second international conference on language resources and evaluation proceedings. Athens, pp 1455–1462
Duda RO, Hart PE, Stork DG (2001) Pattern classification. Wiley, New York
Fahlman S, Lebiere C (1990) The cascade-correlation learning architecture. Adv Neural Inform Process Syst 2:524–532 Morgan Kaufmann
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inform Process Manag 24(5):513–523
Vapnik V (1998) Statistical learning theory. Wiley Interscience, New York
Diederich J, Kinderman J, Leopold E, Paass G (2003) Authorship attribution with support vector machines. Appl Intell 19:109–123
Bhamidipati NL, Pal SK (2007) Stemming via distribution-based word segregation for classification and retrieval. IEEE Trans Syst Man Cybern B Cybern 37(2):350–360
Drucker H, Wu D, Vapnik V (1999) Support vector machines for spam categorization. IEEE Trans Neural Netw 10(5):1048–1054
Koprinska I, Poon J, Clark J, Chan J (2007) Learning to classify e-mail. Inform Sci 177(10):2167–2187
Chang C-C, Lin C-J (2001) LIBSVM: a library for support vector machines. Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm
Acknowledgments
The present study was partly funded by the PENED 03ED97 research project of the General Secretariat for Research and Technology of the Hellenic Ministry of Development.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Tsimboukakis, N., Tambouratzis, G. A comparative study on authorship attribution classification tasks using both neural network and statistical methods. Neural Comput & Applic 19, 573–582 (2010). https://doi.org/10.1007/s00521-009-0314-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-009-0314-7