Skip to main content

Automatically Determining an Anonymous Author’s Native Language

  • Conference paper
Intelligence and Security Informatics (ISI 2005)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 3495))

Included in the following conference series:

Abstract

Text authored by an unidentified assailant can offer valuable clues to the assailant’s identity. In this paper, we show that stylistic text features can be exploited to determine an anonymous author’s native language with high accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Koppel, M., Argamon, S., Shimony, A.: Automatically categorizing written texts by author gender. Literary and Linguistic Computing 17(4) (2002)

    Google Scholar 

  2. Lado, R.: Linguistics Across Cultures. University of Michigan Press, Ann Arbor (1961)

    Google Scholar 

  3. Corder, S.P.: Error Analysis and Interlanguage. Oxford University Press, Oxford (1981)

    Google Scholar 

  4. Tomokiyo, L.M., Jones, R.: You’re Not From ’Round Here, Are You? Naive Bayes Detection of Non-native Utterance Text. In: NAACL 2001 (2001)

    Google Scholar 

  5. Mosteller, F., Wallace, D.L.: Inference and Disputed Authorship: The Federalist. Addison Wesley, Reading (1964)

    MATH  Google Scholar 

  6. Yule, G.U.: On sentence length as a statistical characteristic of style in prose with application to two cases of disputed authorship. Biometrika 30, 363–390 (1938)

    Google Scholar 

  7. Baayen, H., van Halteren, H., Tweedie, F.: Outside the cave of shadows: Using syntactic annotation to enhance authorship attribution. Literary and Linguistic Computing, vol. 11 (1996)

    Google Scholar 

  8. Argamon-Engelson, S., Koppel, M., Avneri, G.: Style-based text categorization: What newspaper am I reading? In: Proc. of AAAI Workshop on Learning for Text Categorization, pp. 1–4 (1998)

    Google Scholar 

  9. Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Computer-based authorship attribution without lexical measures. Computers and the Humanities 35, 193–214 (2001)

    Article  Google Scholar 

  10. Koppel, M., Schler, J.: Exploiting Stylistic Idiosyncrasies for Authorship Attribution. In: Proceedings of IJCAI 2003 Workshop on Computational Approaches to Style Analysis and Synthesis, Acapulco, Mexico (2003)

    Google Scholar 

  11. Peng, F., Schuurmans, D., Wang, S.: Augmenting Naive Bayes Classifiers with Statistical Language Models. Inf. Retr. 7(3-4), 317–345 (2004)

    Article  Google Scholar 

  12. Foster, D.: Author Unknown: On the Trail of Anonymous. Henry Holt, New York (2000)

    Google Scholar 

  13. Dagneaux, E., Denness, S., Granger, S.: Computer-aided Error Analysis System. An International Journal of Educational Technology and Applied Linguistics 26(2), 163–174 (1998)

    Google Scholar 

  14. Tono, Y., Kaneko, T., Isahara, H., Saiga, T., Izumi, E.: The Standard Speaking Test (SST) Corpus: A 1 million-word spoken corpus of Japanese learners of English and its implications for L2 lexicography. In: Second Asialex International Congress, Korea, pp. 257–262 (2001)

    Google Scholar 

  15. Chodorow, M., Leacock, C.: An unsupervised method for detecting grammatical errors. In: Proceedings of 1st Meeting of N. American Chapter of Assoc. for Computational Linguistics, pp. 140–147 (2000)

    Google Scholar 

  16. Francis, W., Kucera, H.: Frequency Analysis of English Usage: Lexicon and Grammar. Houghton Mifflin Company, Boston (1982)

    Google Scholar 

  17. Brill, E.: A simple rule-based part-of-speech tagger. In: Proceedings of 3rd Conference on Applied Natural Language Processing, pp. 152–155 (1992)

    Google Scholar 

  18. Granger, S., Dagneaux, E., Meunier, F.: The International Corpus of Learner English. Handbook and CD-ROM. Presses Universitaires de Louvain, Louvain-la-Neuve (2002)

    Google Scholar 

  19. Crammer, K., Singer, Y.: On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research 2, 265–292 (2001)

    Article  Google Scholar 

  20. Joachims, T.: Text categorization with support vector machines: learning with many relevant features. In: Proceedings of 10th European Conference on Machine Learning, pp. 137–142 (1998)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2005 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Koppel, M., Schler, J., Zigdon, K. (2005). Automatically Determining an Anonymous Author’s Native Language. In: Kantor, P., et al. Intelligence and Security Informatics. ISI 2005. Lecture Notes in Computer Science, vol 3495. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11427995_17

Download citation

  • DOI: https://doi.org/10.1007/11427995_17

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-25999-2

  • Online ISBN: 978-3-540-32063-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics