Abstract
Most previous work on authorship attribution has focused on the case in which we need to attribute an anonymous document to one of a small set of candidate authors. In this paper, we consider authorship attribution as found in the wild: the set of known candidates is extremely large (possibly many thousands) and might not even include the actual author. Moreover, the known texts and the anonymous texts might be of limited length. We show that even in these difficult cases, we can use similarity-based methods along with multiple randomized feature sets to achieve high precision. Moreover, we show the precise relationship between attribution precision and four parameters: the size of the candidate set, the quantity of known-text by the candidates, the length of the anonymous text and a certain robustness score associated with a attribution.
Similar content being viewed by others
Notes
For purposes of clarity, we note the following: Recall, as discussed in Sect. 6, is simply H*P, the product of coverage and precision. Furthermore, results shown in Sect. 6 refer to those at score σ or above, while results shown in this section refer to those at a given score or in a given score interval.
References
Abbasi, A., & Chen, H. (2008). Writeprints: A stylometric approach to identity-level identification and similarity detection. ACM Transactions on Information Systems, 26(2), 7.
Argamon, S. (2008). Interpreting burrows’s delta: Geometric and probabilistic foundations. Literary and Linguistic Computing, 23(2), 131–147.
Bryll, R., Gutierrez-Osuna, R., & Quek, F. (2003). Attribute bagging: Improving accuracy of classifier ensembles by using random feature subsets. Pattern Recognition, 36(6), 1291–1302.
Burrows, J. F. (2002). Delta: A measure of stylistic difference and a guide to likely authorship. Literary and Linguistic Computing, 17, 267–287.
Clough, P. (2000). Plagiarism in natural and programming languages: An overview of current tools and technologies, Research Memoranda: CS-00-05, Department of Computer Science, University of Sheffield, UK.
Hoover, D. L. (2003). Multivariate analysis and the study of style variation. Literary and Linguistic Computing, 18, 341–360.
Juola, P. (2008). Author attribution, foundations and trends in information. Retrieval, 1(3), 233–334.
Keselj, V., Peng, F., Cercone, N., & Thomas, C. (2003). N-Gram-Based Author Profiles for Authorship Attribution. In Proceeding of PACLING’03 (pp. 255–264). Halifax, Canada.
Koppel, M., Schler, J., Argamon, S., & Messeri, E. (2006). Authorship attribution with thousands of candidate authors. In Proceedings of the 29th ACM SIGIR Conference on Research and Development on Information Retrieval. Seattle, Washington.
Koppel, M., Schler, J., & Bonchek-Dokow, E. (2007). Measuring differentiability: Unmasking pseudonymous authors. JMLR, 8, 1261–1276.
Koppel, M., Schler, J., & Argamon, S. (2008). Computational methods in authorship attribution. JASIST, 60(1), 9–26.
Luyckx, K., & Daelemans, W. (2008). Authorship attribution and verification with many authors and limited data. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING 2008) (pp. 513–520). Manchester, UK.
Madigan, D., Genkin, A., Lewis, D. D., Argamon, S., Fradkin, D., & Ye, L. (2005). Author identification on the large scale. In Proceedings of the Meeting of the Classification Society of North America, 2005.
Meyer zu Eissen, S., Stein, B., & Kulig, M. (2007). Plagiarism detection without reference collections. In R. Decker & H. J. Lenz (Eds.), Advances in data analysis (pp. 359–366). Springer, Berlin.
Salton, G., & Buckley, C. (1988). Term-weighting approaches in automatic text retrieval. Information Processing and Management: An International Journal, 24(5), 513–523.
Stamatatos, E. (2009). A survey of modern authorship attribution methods. JASIST, 60(3), 538–556.
van Halteren, H., Baayen, H., Tweedie, F., Haverkort, M., & Neijt, A. (2005). New machine learning methods demonstrate the existence of a human stylome. Journal of Quantitative Linguistics, 12(1), 65–77.
Zhao, Y., & Zobel, J. (2005). Effective authorship attribution using function word. In Proceedings of the 2nd AIRS Asian information retrieval symposium (pp. 174–190). Berlin: Springer.
Zheng, R., Li, J., Chen, H., & Huang, Z. (2006). A framework for authorship identification of online messages: Writing-style features and classification techniques. Journal of the American Society for Information Science and Technology, 57(3), 378–393.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Koppel, M., Schler, J. & Argamon, S. Authorship attribution in the wild. Lang Resources & Evaluation 45, 83–94 (2011). https://doi.org/10.1007/s10579-009-9111-2
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-009-9111-2