Abstract
As any other text categorization task, authorship attribution requires a large number of training examples. These examples, which are easily obtained for most of the tasks, are particularly difficult to obtain for this case. Based on this fact, in this paper we investigate the possibility of using Web-based text mining methods for the identification of the author of a given poem. In particular, we propose a semi-supervised method that is specially suited to work with justfew training examples in order to tackle the problem of the lack of data with the same writing style. The method considers the automatic extraction of the unlabeled examples from the Web and its iterative integration into the training data set. To the knowledge of the authors, a semi-supervised method which makes use of the Web as support lexical resource has not been previously employed in this task. The results obtained on poem categorization show that this method may improve the classification accuracy and it is appropriate to handle the attribution of short documents.
This work was done under partial support of CONACYT-Mexico (43990, C01-39957), MCyT-Spain (TIN2006-15265-C06-04) and PROMEP (UGTO-121).
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Argamon, S., Levitan, S.: Measuring the Usefulness of Function Words for Author-ship Attribution. Association for Literary and Linguistic Computing/ Association Com-puter Humanities, University of Victoria, Canada (2005)
Chaski, C.: Who’s at the Keyword? Authorship Attribution in Digital Evidence Investigations. International Journal of Digital Evidence 4(1) (2005)
Coyotl-Morales, R.M., Villaseñor-Pineda, L., Montes-y-Gómez, M., Rosso, P.: Authorship Attribution using Word Sequences. In: Martínez-Trinidad, J.F., Carrasco Ochoa, J.A., Kittler, J. (eds.) CIARP 2006. LNCS, vol. 4225, Springer, Heidelberg (2006)
Diederich, J., Kindermann, J., Leopold, E., Paas, G.: Authorship Attribution with Sup-port Vector Machines. Applied Intelligence 19(1), 109–123 (2003)
Guzmán-Cabrera, R., Montes-y-Gómez, M., Rosso, P., Villaseñor-Pineda, L.: Improving Text Classification using Web Corpora. In: 5th Atlantic Web Intelligence Conference, AWIC 2007. Advances in Soft Computing, vol. 43. Springer, Heidelberg (2007)
Guzmán-Cabrera, R., Montes-y-Gómez, M., Rosso, P., Villaseñor-Pineda, L.: Taking Ad-vantage of the Web for Text Classification with Imbalanced Classes. In: Gelbukh, A., Kuri Morales, Á.F. (eds.) MICAI 2007. LNCS (LNAI), vol. 4827. Springer, Heidelberg (2007)
Holmes, D.: Authorship Attribution. Computers and the Humanities, vol. 28, pp. 87–106. Kluwer Academic Publishers, Dordrecht (1995)
Joachims, T.: Transductive inference for text classification using support vector machines. In: Proceedings of the Sixteenth International Conference on Machine Learning (1999)
Kaster, A., Siersdorfer, S., Weikum, G.: Combining Text and Linguistic Document Representations for Authorship Attribution. In: 28th Int. Workshop Stylistic Analysis of Text for Information Access, SIGIR 1, MPI, Saarbrücken (2005)
Malyutov, M.B.: Authorship Attribution of Texts: a Review. Proceedings of the program Information transfer held in ZIF. University of Bielefeld, Germany (2004)
Mihalcea, R.: Co-training and Self-training for Word Sense Disambiguation. In: Proc. of the Conference on Natural Lenguage Learning (CoNLL 2004), Boston, USA (2004)
Nigam, K., Mccallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Machine Learning 39(2/3), 103–134 (2000)
Peng, F., Schuurmans, D., Keselj, V., Wang, S.: Augmenting Naïve Bayes Classifiers with Statistical Languages Models. Information Retrieval 7, 317–345 (2004)
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Seeger, M.: Learning with labeled and unlabeled data. Technical report, Institute for Adaptive and Neural Computation, University of Edinburgh, Edinburgh, United Kingdom (2001)
Solorio, T.: Using unlabeled data to improve classifier accuracy, Master Degree Thesis, Computer Science Department, INAOE, Mexico (2002)
Stamatatos, E., Fakotakis, N.: Computer-Based Authorship Attribution Without Lexical Measures. Computers and the Humanities 35, 193–214 (2001)
Witten, I.H., Frank, E.: Data Mining-practical Machine Learning Tools and Techniques whit Java Implementation. Morgan Kaufmann, San Francisco (2000)
Zelikovitz, S., Hirsh, H.: Integrating background knowledge into nearest-Neighbor text classification. In: Advances in Case-Based Reasoning, ECCBR Proceedings (2002)
Zelikovitz, S., Kogan, M.: Using Web Searches on Important Words to Create Back-ground Sets for LSI Classification. In: 19th International FLAIRS conference, Melbourne Beach, Florida (May 2006)
Zhao, Y., Zobel, J.: Effective and Scalable Authorship Attribution Using Function Words. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.-H. (eds.) AIRS 2005. LNCS, vol. 3689, pp. 174–189. Springer, Heidelberg (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Guzmán-Cabrera, R., Montes-y-Gómez, M., Rosso, P., Villaseñor-Pineda, L. (2008). A Web-Based Self-training Approach for Authorship Attribution. In: Nordström, B., Ranta, A. (eds) Advances in Natural Language Processing. GoTAL 2008. Lecture Notes in Computer Science(), vol 5221. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85287-2_16
Download citation
DOI: https://doi.org/10.1007/978-3-540-85287-2_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-85286-5
Online ISBN: 978-3-540-85287-2
eBook Packages: Computer ScienceComputer Science (R0)