Skip to main content
Log in

Authorship attribution in the wild

  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

Most previous work on authorship attribution has focused on the case in which we need to attribute an anonymous document to one of a small set of candidate authors. In this paper, we consider authorship attribution as found in the wild: the set of known candidates is extremely large (possibly many thousands) and might not even include the actual author. Moreover, the known texts and the anonymous texts might be of limited length. We show that even in these difficult cases, we can use similarity-based methods along with multiple randomized feature sets to achieve high precision. Moreover, we show the precise relationship between attribution precision and four parameters: the size of the candidate set, the quantity of known-text by the candidates, the length of the anonymous text and a certain robustness score associated with a attribution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. For purposes of clarity, we note the following: Recall, as discussed in Sect. 6, is simply H*P, the product of coverage and precision. Furthermore, results shown in Sect. 6 refer to those at score σ or above, while results shown in this section refer to those at a given score or in a given score interval.

References

  • Abbasi, A., & Chen, H. (2008). Writeprints: A stylometric approach to identity-level identification and similarity detection. ACM Transactions on Information Systems, 26(2), 7.

    Google Scholar 

  • Argamon, S. (2008). Interpreting burrows’s delta: Geometric and probabilistic foundations. Literary and Linguistic Computing, 23(2), 131–147.

    Article  Google Scholar 

  • Bryll, R., Gutierrez-Osuna, R., & Quek, F. (2003). Attribute bagging: Improving accuracy of classifier ensembles by using random feature subsets. Pattern Recognition, 36(6), 1291–1302.

    Article  Google Scholar 

  • Burrows, J. F. (2002). Delta: A measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing, 17, 267–287.

    Article  Google Scholar 

  • Clough, P. (2000). Plagiarism in natural and programming languages: An overview of current tools and technologies, Research Memoranda: CS-00-05, Department of Computer Science, University of Sheffield, UK.

  • Hoover, D. L. (2003). Multivariate analysis and the study of style variation. Literary and Linguistic Computing, 18, 341–360.

    Article  Google Scholar 

  • Juola, P. (2008). Author attribution, foundations and trends in information. Retrieval, 1(3), 233–334.

    Article  Google Scholar 

  • Keselj, V., Peng, F., Cercone, N., & Thomas, C. (2003). N-Gram-Based Author Profiles for Authorship Attribution. In Proceeding of PACLING’03 (pp. 255–264). Halifax, Canada.

  • Koppel, M., Schler, J., Argamon, S., & Messeri, E. (2006). Authorship attribution with thousands of candidate authors. In Proceedings of the 29th ACM SIGIR Conference on Research and Development on Information Retrieval. Seattle, Washington.

  • Koppel, M., Schler, J., & Bonchek-Dokow, E. (2007). Measuring differentiability: Unmasking pseudonymous authors. JMLR, 8, 1261–1276.

    Google Scholar 

  • Koppel, M., Schler, J., & Argamon, S. (2008). Computational methods in authorship attribution. JASIST, 60(1), 9–26.

    Article  Google Scholar 

  • Luyckx, K., & Daelemans, W. (2008). Authorship attribution and verification with many authors and limited data. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008) (pp. 513–520). Manchester, UK.

  • Madigan, D., Genkin, A., Lewis, D. D., Argamon, S., Fradkin, D., & Ye, L. (2005). Author identification on the large scale. In Proceedings of the Meeting of the Classification Society of North America, 2005.

  • Meyer zu Eissen, S., Stein, B., & Kulig, M. (2007). Plagiarism detection without reference collections. In R. Decker & H. J. Lenz (Eds.), Advances in data analysis (pp. 359–366). Springer, Berlin.

  • Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management: An International Journal, 24(5), 513–523.

    Article  Google Scholar 

  • Stamatatos, E. (2009). A survey of modern authorship attribution methods. JASIST, 60(3), 538–556.

    Article  Google Scholar 

  • van Halteren, H., Baayen, H., Tweedie, F., Haverkort, M., & Neijt, A. (2005). New machine learning methods demonstrate the existence of a human stylome. Journal of Quantitative Linguistics, 12(1), 65–77.

    Article  Google Scholar 

  • Zhao, Y., & Zobel, J. (2005). Effective authorship attribution using function word. In Proceedings of the 2nd AIRS Asian information retrieval symposium (pp. 174–190). Berlin: Springer.

  • Zheng, R., Li, J., Chen, H., & Huang, Z. (2006). A framework for authorship identification of online messages: Writing-style features and classification techniques. Journal of the American Society for Information Science and Technology, 57(3), 378–393.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Moshe Koppel.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Koppel, M., Schler, J. & Argamon, S. Authorship attribution in the wild. Lang Resources & Evaluation 45, 83–94 (2011). https://doi.org/10.1007/s10579-009-9111-2

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-009-9111-2

Keywords

Navigation