Authorship attribution in the wild

Koppel, Moshe; Schler, Jonathan; Argamon, Shlomo

doi:10.1007/s10579-009-9111-2

Authorship attribution in the wild

Published: 13 January 2010

Volume 45, pages 83–94, (2011)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Moshe Koppel¹,
Jonathan Schler¹ &
Shlomo Argamon²

2285 Accesses
148 Citations
6 Altmetric
1 Mention
Explore all metrics

Abstract

Most previous work on authorship attribution has focused on the case in which we need to attribute an anonymous document to one of a small set of candidate authors. In this paper, we consider authorship attribution as found in the wild: the set of known candidates is extremely large (possibly many thousands) and might not even include the actual author. Moreover, the known texts and the anonymous texts might be of limited length. We show that even in these difficult cases, we can use similarity-based methods along with multiple randomized feature sets to achieve high precision. Moreover, we show the precise relationship between attribution precision and four parameters: the size of the candidate set, the quantity of known-text by the candidates, the length of the anonymous text and a certain robustness score associated with a attribution.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

For purposes of clarity, we note the following: Recall, as discussed in Sect. 6, is simply H*P, the product of coverage and precision. Furthermore, results shown in Sect. 6 refer to those at score σ or above, while results shown in this section refer to those at a given score or in a given score interval.

References

Abbasi, A., & Chen, H. (2008). Writeprints: A stylometric approach to identity-level identification and similarity detection. ACM Transactions on Information Systems, 26(2), 7.
Google Scholar
Argamon, S. (2008). Interpreting burrows’s delta: Geometric and probabilistic foundations. Literary and Linguistic Computing, 23(2), 131–147.
Article Google Scholar
Bryll, R., Gutierrez-Osuna, R., & Quek, F. (2003). Attribute bagging: Improving accuracy of classifier ensembles by using random feature subsets. Pattern Recognition, 36(6), 1291–1302.
Article Google Scholar
Burrows, J. F. (2002). Delta: A measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing, 17, 267–287.
Article Google Scholar
Clough, P. (2000). Plagiarism in natural and programming languages: An overview of current tools and technologies, Research Memoranda: CS-00-05, Department of Computer Science, University of Sheffield, UK.
Hoover, D. L. (2003). Multivariate analysis and the study of style variation. Literary and Linguistic Computing, 18, 341–360.
Article Google Scholar
Juola, P. (2008). Author attribution, foundations and trends in information. Retrieval, 1(3), 233–334.
Article Google Scholar
Keselj, V., Peng, F., Cercone, N., & Thomas, C. (2003). N-Gram-Based Author Profiles for Authorship Attribution. In Proceeding of PACLING’03 (pp. 255–264). Halifax, Canada.
Koppel, M., Schler, J., Argamon, S., & Messeri, E. (2006). Authorship attribution with thousands of candidate authors. In Proceedings of the 29th ACM SIGIR Conference on Research and Development on Information Retrieval. Seattle, Washington.
Koppel, M., Schler, J., & Bonchek-Dokow, E. (2007). Measuring differentiability: Unmasking pseudonymous authors. JMLR, 8, 1261–1276.
Google Scholar
Koppel, M., Schler, J., & Argamon, S. (2008). Computational methods in authorship attribution. JASIST, 60(1), 9–26.
Article Google Scholar
Luyckx, K., & Daelemans, W. (2008). Authorship attribution and verification with many authors and limited data. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008) (pp. 513–520). Manchester, UK.
Madigan, D., Genkin, A., Lewis, D. D., Argamon, S., Fradkin, D., & Ye, L. (2005). Author identification on the large scale. In Proceedings of the Meeting of the Classification Society of North America, 2005.
Meyer zu Eissen, S., Stein, B., & Kulig, M. (2007). Plagiarism detection without reference collections. In R. Decker & H. J. Lenz (Eds.), Advances in data analysis (pp. 359–366). Springer, Berlin.
Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management: An International Journal, 24(5), 513–523.
Article Google Scholar
Stamatatos, E. (2009). A survey of modern authorship attribution methods. JASIST, 60(3), 538–556.
Article Google Scholar
van Halteren, H., Baayen, H., Tweedie, F., Haverkort, M., & Neijt, A. (2005). New machine learning methods demonstrate the existence of a human stylome. Journal of Quantitative Linguistics, 12(1), 65–77.
Article Google Scholar
Zhao, Y., & Zobel, J. (2005). Effective authorship attribution using function word. In Proceedings of the 2nd AIRS Asian information retrieval symposium (pp. 174–190). Berlin: Springer.
Zheng, R., Li, J., Chen, H., & Huang, Z. (2006). A framework for authorship identification of online messages: Writing-style features and classification techniques. Journal of the American Society for Information Science and Technology, 57(3), 378–393.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Bar-Ilan University, Ramat-Gan, Israel
Moshe Koppel & Jonathan Schler
Illinois Institute of Technology, Chicago, IL, USA
Shlomo Argamon

Authors

Moshe Koppel
View author publications
You can also search for this author in PubMed Google Scholar
Jonathan Schler
View author publications
You can also search for this author in PubMed Google Scholar
Shlomo Argamon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Moshe Koppel.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Koppel, M., Schler, J. & Argamon, S. Authorship attribution in the wild. Lang Resources & Evaluation 45, 83–94 (2011). https://doi.org/10.1007/s10579-009-9111-2

Download citation

Published: 13 January 2010
Issue Date: March 2011
DOI: https://doi.org/10.1007/s10579-009-9111-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Authorship attribution in the wild

Abstract

Access this article

Similar content being viewed by others

Large Scale Authorship Attribution of Online Reviews

On Improving Authorship Attribution of Source Code

Authorship Attribution Using Regression Model

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Authorship attribution in the wild

Abstract

Access this article

Similar content being viewed by others

Large Scale Authorship Attribution of Online Reviews

On Improving Authorship Attribution of Source Code

Authorship Attribution Using Regression Model

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation