Skip to main content

Automatic Authorship Attribution for Texts in Croatian Language Using Combinations of Features

  • Conference paper
Knowledge-Based and Intelligent Information and Engineering Systems (KES 2010)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 6277))

Abstract

In this work we investigate the use of various character, lexical, and syntactic level features and their combinations in automatic authorship attribution. Since the majority of text representation features are language specific, we examine their application on texts written in Croatian language. Our work differs from the similar work in at least three aspects. Firstly, we use slightly different set of features than previously proposed. Secondly, we use four different data sets and compare the same features across those data sets to draw stronger conclusions. The data sets that we use consist of articles, blogs, books, and forum posts written in Croatian language. Finally, we employ a classification method based on a strong classifier. We use the Support Vector Machines to learn classifiers which achieve excellent results for longer texts: 91% accuracy and F 1 measure for blogs, 93% acc. and F 1 for articles, and 99% acc. and F 1 for books. Experiments conducted on forum posts show that more complex features need to be employed for shorter texts.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Argamon, S., Levitan, S.: Measuring the usefulness of function words for authorship attribution. In: Proceedings of ACH/ALLC (2005)

    Google Scholar 

  2. Chang, C.C., Lin, C.J.: LIBSVM: a library for support vector machines (2001), http://www.csie.ntu.edu.tw/~cjlin/libsvm

  3. Cortes, C., Vapnik, V.: Support-vector networks. Machine L. 20(3), 273–297 (1995)

    MATH  Google Scholar 

  4. Coyotl-Morales, R., Villaseñor-Pineda, L., Montes-y Gómez, M., Rosso, P., Lenguaje, L.: Authorship attribution using word sequences. In: Martínez-Trinidad, J.F., Carrasco Ochoa, J.A., Kittler, J. (eds.) CIARP 2006. LNCS, vol. 4225, pp. 844–853. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  5. De Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining e-mail content for author identification forensics. ACM Sigmod Record 30(4), 55–64 (2001)

    Article  Google Scholar 

  6. Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship attribution with support vector machines. Applied Intelligence 19(1), 109–123 (2003)

    Article  MATH  Google Scholar 

  7. Holmes, D.: Authorship attribution. Comp. and Humanities 28(2), 87–106 (1994)

    Article  Google Scholar 

  8. Kešelj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: Proceedings of the Conference Pacific Association for Computational Linguistics, PACLING, vol. 3, pp. 255–264 (2003)

    Google Scholar 

  9. Keerthi, S., Lin, C.: Asymptotic behaviors of support vector machines with Gaussian kernel. Neural Computation 15(7), 1667–1689 (2003)

    Article  MATH  Google Scholar 

  10. Koppel, M., Schler, J.: Exploiting stylistic idiosyncrasies for authorship attribution. Proceedings of IJCAI 3, 69–72 (2003)

    Google Scholar 

  11. Kukushkina, O., Polikarpov, A., Khmelev, D.: Using literal and grammatical statistics for authorship attribution. Probl. of Info. Trans. 37(2), 172–184 (2001)

    Article  MATH  MathSciNet  Google Scholar 

  12. Luyckx, K., Daelemans, W.: Shallow text analysis and machine learning for authorship attribution. In: Proceedings of the fifteenth meeting of Computational Linguistics in the Netherlands (CLIN 2004), pp. 149–160 (2005)

    Google Scholar 

  13. Malenica, M., Smuc, T., Šnajder, J., Dalbelo Bašić, B.: Language morphology offset: Text classification on a croatian-english parallel corpus. Information Processing and Management 44(1), 325–339 (2008)

    Article  Google Scholar 

  14. Mendenhall, T.C.: The characteristic curves of composition. Science (214S), 237 (1887)

    Google Scholar 

  15. Peng, F., Schuurmans, D., Wang, S., Keselj, V.: Language independent authorship attribution using character level language models. In: Proc. 10th Conf. on European Chapter of the Assoc. Comp. Ling., vol. 1, pp. 267–274. ACL (2003)

    Google Scholar 

  16. Stamatatos, E.: Ensemble-based author identification using character n-grams. In: Proc. 3rd Int. Workshop on Text-based Information Retrieval, pp. 41–46 (2006)

    Google Scholar 

  17. Stamatatos, E.: A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 60(3), 538–556 (2009)

    Article  Google Scholar 

  18. Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Computer-based authorship attribution without lexical measures. Comp. and Hum. 35(2), 193–214 (2001)

    Article  Google Scholar 

  19. Uzuner, O., Katz, B.: A comparative study of language models for book and author recognition. In: Dale, R., Wong, K.-F., Su, J., Kwong, O.Y. (eds.) IJCNLP 2005. LNCS (LNAI), vol. 3651, pp. 969–980. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  20. van Halteren, H., Tweedie, F., Baayen, H.: Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. Computers and the Humanities 28(2), 87–106 (1996)

    Google Scholar 

  21. Šilić, A., Chauchat, J.H., Dalbelo Bašić, B., Morin, A.: N-grams and morphological normalization in text classification: A comparison on a croatian-english parallel corpus. In: Neves, J., Santos, M.F., Machado, J.M. (eds.) EPIA 2007. LNCS (LNAI), vol. 4874, pp. 671–682. Springer, Heidelberg (2007)

    Google Scholar 

  22. Šnajder, J., Dalbelo Bašić, B., Tadić, M.: Automatic acquisition of inflectional lexica for morphological normalisation. Information Processing and Management 44(5), 1720–1731 (2008)

    Article  Google Scholar 

  23. Zhao, Y., Zobel, J.: Effective and scalable authorship attribution using function words. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.-H. (eds.) AIRS 2005. LNCS, vol. 3689, pp. 174–189. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Reicher, T., Krišto, I., Belša, I., Šilić, A. (2010). Automatic Authorship Attribution for Texts in Croatian Language Using Combinations of Features. In: Setchi, R., Jordanov, I., Howlett, R.J., Jain, L.C. (eds) Knowledge-Based and Intelligent Information and Engineering Systems. KES 2010. Lecture Notes in Computer Science(), vol 6277. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-15390-7_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-15390-7_3

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-15389-1

  • Online ISBN: 978-3-642-15390-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics