A Web-Based Self-training Approach for Authorship Attribution

Guzmán-Cabrera, Rafael; Montes-y-Gómez, Manuel; Rosso, Paolo; Villaseñor-Pineda, Luis

doi:10.1007/978-3-540-85287-2_16

A Web-Based Self-training Approach for Authorship Attribution

Rafael Guzmán-Cabrera^2,3,
Manuel Montes-y-Gómez⁴,
Paolo Rosso³ &
…
Luis Villaseñor-Pineda⁴

Conference paper

1462 Accesses
5 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5221))

Abstract

As any other text categorization task, authorship attribution requires a large number of training examples. These examples, which are easily obtained for most of the tasks, are particularly difficult to obtain for this case. Based on this fact, in this paper we investigate the possibility of using Web-based text mining methods for the identification of the author of a given poem. In particular, we propose a semi-supervised method that is specially suited to work with justfew training examples in order to tackle the problem of the lack of data with the same writing style. The method considers the automatic extraction of the unlabeled examples from the Web and its iterative integration into the training data set. To the knowledge of the authors, a semi-supervised method which makes use of the Web as support lexical resource has not been previously employed in this task. The results obtained on poem categorization show that this method may improve the classification accuracy and it is appropriate to handle the attribution of short documents.

This work was done under partial support of CONACYT-Mexico (43990, C01-39957), MCyT-Spain (TIN2006-15265-C06-04) and PROMEP (UGTO-121).

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Argamon, S., Levitan, S.: Measuring the Usefulness of Function Words for Author-ship Attribution. Association for Literary and Linguistic Computing/ Association Com-puter Humanities, University of Victoria, Canada (2005)
Google Scholar
Chaski, C.: Who’s at the Keyword? Authorship Attribution in Digital Evidence Investigations. International Journal of Digital Evidence 4(1) (2005)
Google Scholar
Coyotl-Morales, R.M., Villaseñor-Pineda, L., Montes-y-Gómez, M., Rosso, P.: Authorship Attribution using Word Sequences. In: Martínez-Trinidad, J.F., Carrasco Ochoa, J.A., Kittler, J. (eds.) CIARP 2006. LNCS, vol. 4225, Springer, Heidelberg (2006)
Google Scholar
Diederich, J., Kindermann, J., Leopold, E., Paas, G.: Authorship Attribution with Sup-port Vector Machines. Applied Intelligence 19(1), 109–123 (2003)
Article MATH Google Scholar
Guzmán-Cabrera, R., Montes-y-Gómez, M., Rosso, P., Villaseñor-Pineda, L.: Improving Text Classification using Web Corpora. In: 5th Atlantic Web Intelligence Conference, AWIC 2007. Advances in Soft Computing, vol. 43. Springer, Heidelberg (2007)
Google Scholar
Guzmán-Cabrera, R., Montes-y-Gómez, M., Rosso, P., Villaseñor-Pineda, L.: Taking Ad-vantage of the Web for Text Classification with Imbalanced Classes. In: Gelbukh, A., Kuri Morales, Á.F. (eds.) MICAI 2007. LNCS (LNAI), vol. 4827. Springer, Heidelberg (2007)
Google Scholar
Holmes, D.: Authorship Attribution. Computers and the Humanities, vol. 28, pp. 87–106. Kluwer Academic Publishers, Dordrecht (1995)
Google Scholar
Joachims, T.: Transductive inference for text classification using support vector machines. In: Proceedings of the Sixteenth International Conference on Machine Learning (1999)
Google Scholar
Kaster, A., Siersdorfer, S., Weikum, G.: Combining Text and Linguistic Document Representations for Authorship Attribution. In: 28th Int. Workshop Stylistic Analysis of Text for Information Access, SIGIR 1, MPI, Saarbrücken (2005)
Google Scholar
Malyutov, M.B.: Authorship Attribution of Texts: a Review. Proceedings of the program Information transfer held in ZIF. University of Bielefeld, Germany (2004)
Google Scholar
Mihalcea, R.: Co-training and Self-training for Word Sense Disambiguation. In: Proc. of the Conference on Natural Lenguage Learning (CoNLL 2004), Boston, USA (2004)
Google Scholar
Nigam, K., Mccallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Machine Learning 39(2/3), 103–134 (2000)
Article MATH Google Scholar
Peng, F., Schuurmans, D., Keselj, V., Wang, S.: Augmenting Naïve Bayes Classifiers with Statistical Languages Models. Information Retrieval 7, 317–345 (2004)
Article Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article MathSciNet Google Scholar
Seeger, M.: Learning with labeled and unlabeled data. Technical report, Institute for Adaptive and Neural Computation, University of Edinburgh, Edinburgh, United Kingdom (2001)
Google Scholar
Solorio, T.: Using unlabeled data to improve classifier accuracy, Master Degree Thesis, Computer Science Department, INAOE, Mexico (2002)
Google Scholar
Stamatatos, E., Fakotakis, N.: Computer-Based Authorship Attribution Without Lexical Measures. Computers and the Humanities 35, 193–214 (2001)
Article Google Scholar
Witten, I.H., Frank, E.: Data Mining-practical Machine Learning Tools and Techniques whit Java Implementation. Morgan Kaufmann, San Francisco (2000)
Google Scholar
Zelikovitz, S., Hirsh, H.: Integrating background knowledge into nearest-Neighbor text classification. In: Advances in Case-Based Reasoning, ECCBR Proceedings (2002)
Google Scholar
Zelikovitz, S., Kogan, M.: Using Web Searches on Important Words to Create Back-ground Sets for LSI Classification. In: 19th International FLAIRS conference, Melbourne Beach, Florida (May 2006)
Google Scholar
Zhao, Y., Zobel, J.: Effective and Scalable Authorship Attribution Using Function Words. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.-H. (eds.) AIRS 2005. LNCS, vol. 3689, pp. 174–189. Springer, Heidelberg (2005)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

FIMEE, Universidad de Guanajuato, México
Rafael Guzmán-Cabrera
NLE Lab, DSIC, Universidad Politécnica de Valencia, Spain
Rafael Guzmán-Cabrera & Paolo Rosso
LabTL, Instituto Nacional de Astrofísica, Óptica y Electrónica, México
Manuel Montes-y-Gómez & Luis Villaseñor-Pineda

Authors

Rafael Guzmán-Cabrera
View author publications
You can also search for this author in PubMed Google Scholar
Manuel Montes-y-Gómez
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Rosso
View author publications
You can also search for this author in PubMed Google Scholar
Luis Villaseñor-Pineda
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science and Engineering, Chalmers University of Technology, 41296, Göteborg, Sweden
Bengt Nordström & Aarne Ranta &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Guzmán-Cabrera, R., Montes-y-Gómez, M., Rosso, P., Villaseñor-Pineda, L. (2008). A Web-Based Self-training Approach for Authorship Attribution. In: Nordström, B., Ranta, A. (eds) Advances in Natural Language Processing. GoTAL 2008. Lecture Notes in Computer Science(), vol 5221. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85287-2_16

Download citation

DOI: https://doi.org/10.1007/978-3-540-85287-2_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-85286-5
Online ISBN: 978-3-540-85287-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics