Skip to main content

A Web-Based Self-training Approach for Authorship Attribution

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 5221))

Abstract

As any other text categorization task, authorship attribution requires a large number of training examples. These examples, which are easily obtained for most of the tasks, are particularly difficult to obtain for this case. Based on this fact, in this paper we investigate the possibility of using Web-based text mining methods for the identification of the author of a given poem. In particular, we propose a semi-supervised method that is specially suited to work with justfew training examples in order to tackle the problem of the lack of data with the same writing style. The method considers the automatic extraction of the unlabeled examples from the Web and its iterative integration into the training data set. To the knowledge of the authors, a semi-supervised method which makes use of the Web as support lexical resource has not been previously employed in this task. The results obtained on poem categorization show that this method may improve the classification accuracy and it is appropriate to handle the attribution of short documents.

This work was done under partial support of CONACYT-Mexico (43990, C01-39957), MCyT-Spain (TIN2006-15265-C06-04) and PROMEP (UGTO-121).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Argamon, S., Levitan, S.: Measuring the Usefulness of Function Words for Author-ship Attribution. Association for Literary and Linguistic Computing/ Association Com-puter Humanities, University of Victoria, Canada (2005)

    Google Scholar 

  2. Chaski, C.: Who’s at the Keyword? Authorship Attribution in Digital Evidence Investigations. International Journal of Digital Evidence 4(1) (2005)

    Google Scholar 

  3. Coyotl-Morales, R.M., Villaseñor-Pineda, L., Montes-y-Gómez, M., Rosso, P.: Authorship Attribution using Word Sequences. In: Martínez-Trinidad, J.F., Carrasco Ochoa, J.A., Kittler, J. (eds.) CIARP 2006. LNCS, vol. 4225, Springer, Heidelberg (2006)

    Google Scholar 

  4. Diederich, J., Kindermann, J., Leopold, E., Paas, G.: Authorship Attribution with Sup-port Vector Machines. Applied Intelligence 19(1), 109–123 (2003)

    Article  MATH  Google Scholar 

  5. Guzmán-Cabrera, R., Montes-y-Gómez, M., Rosso, P., Villaseñor-Pineda, L.: Improving Text Classification using Web Corpora. In: 5th Atlantic Web Intelligence Conference, AWIC 2007. Advances in Soft Computing, vol. 43. Springer, Heidelberg (2007)

    Google Scholar 

  6. Guzmán-Cabrera, R., Montes-y-Gómez, M., Rosso, P., Villaseñor-Pineda, L.: Taking Ad-vantage of the Web for Text Classification with Imbalanced Classes. In: Gelbukh, A., Kuri Morales, Á.F. (eds.) MICAI 2007. LNCS (LNAI), vol. 4827. Springer, Heidelberg (2007)

    Google Scholar 

  7. Holmes, D.: Authorship Attribution. Computers and the Humanities, vol. 28, pp. 87–106. Kluwer Academic Publishers, Dordrecht (1995)

    Google Scholar 

  8. Joachims, T.: Transductive inference for text classification using support vector machines. In: Proceedings of the Sixteenth International Conference on Machine Learning (1999)

    Google Scholar 

  9. Kaster, A., Siersdorfer, S., Weikum, G.: Combining Text and Linguistic Document Representations for Authorship Attribution. In: 28th Int. Workshop Stylistic Analysis of Text for Information Access, SIGIR 1, MPI, Saarbrücken (2005)

    Google Scholar 

  10. Malyutov, M.B.: Authorship Attribution of Texts: a Review. Proceedings of the program Information transfer held in ZIF. University of Bielefeld, Germany (2004)

    Google Scholar 

  11. Mihalcea, R.: Co-training and Self-training for Word Sense Disambiguation. In: Proc. of the Conference on Natural Lenguage Learning (CoNLL 2004), Boston, USA (2004)

    Google Scholar 

  12. Nigam, K., Mccallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Machine Learning 39(2/3), 103–134 (2000)

    Article  MATH  Google Scholar 

  13. Peng, F., Schuurmans, D., Keselj, V., Wang, S.: Augmenting Naïve Bayes Classifiers with Statistical Languages Models. Information Retrieval 7, 317–345 (2004)

    Article  Google Scholar 

  14. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  MathSciNet  Google Scholar 

  15. Seeger, M.: Learning with labeled and unlabeled data. Technical report, Institute for Adaptive and Neural Computation, University of Edinburgh, Edinburgh, United Kingdom (2001)

    Google Scholar 

  16. Solorio, T.: Using unlabeled data to improve classifier accuracy, Master Degree Thesis, Computer Science Department, INAOE, Mexico (2002)

    Google Scholar 

  17. Stamatatos, E., Fakotakis, N.: Computer-Based Authorship Attribution Without Lexical Measures. Computers and the Humanities 35, 193–214 (2001)

    Article  Google Scholar 

  18. Witten, I.H., Frank, E.: Data Mining-practical Machine Learning Tools and Techniques whit Java Implementation. Morgan Kaufmann, San Francisco (2000)

    Google Scholar 

  19. Zelikovitz, S., Hirsh, H.: Integrating background knowledge into nearest-Neighbor text classification. In: Advances in Case-Based Reasoning, ECCBR Proceedings (2002)

    Google Scholar 

  20. Zelikovitz, S., Kogan, M.: Using Web Searches on Important Words to Create Back-ground Sets for LSI Classification. In: 19th International FLAIRS conference, Melbourne Beach, Florida (May 2006)

    Google Scholar 

  21. Zhao, Y., Zobel, J.: Effective and Scalable Authorship Attribution Using Function Words. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.-H. (eds.) AIRS 2005. LNCS, vol. 3689, pp. 174–189. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2008 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Guzmán-Cabrera, R., Montes-y-Gómez, M., Rosso, P., Villaseñor-Pineda, L. (2008). A Web-Based Self-training Approach for Authorship Attribution. In: Nordström, B., Ranta, A. (eds) Advances in Natural Language Processing. GoTAL 2008. Lecture Notes in Computer Science(), vol 5221. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-85287-2_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-85287-2_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-85286-5

  • Online ISBN: 978-3-540-85287-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics