Abstract
In this paper, we revisit author identification research by conducting a new kind of large-scale reproducibility study: we select 15 of the most influential papers for author identification and recruit a group of students to reimplement them from scratch. Since no open source implementations have been released for the selected papers to date, our public release will have a significant impact on researchers entering the field. This way, we lay the groundwork for integrating author identification with information retrieval to eventually scale the former to the web. Furthermore, we assess the reproducibility of all reimplemented papers in detail, and conduct the first comparative evaluation of all approaches on three well-known corpora.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
Interestingly, Collberg et al.’s study itself has been challenged for lack of rigor and has been reproduced more thoroughly: http://cs.brown.edu/~sk/Memos/Examining-Reproducibility/.
- 2.
Materials and code of this study are available at www.uni-weimar.de/medien/webis/ publications and the latest versions of the code in its GitHub repositories at www.github.com/pan-webis-de (for a convenient overview, see www.github.com/search?q=ECIR+2016+user:pan-webis-de).
- 3.
Confer the repository of the reimplementation of Seroussi et al.’s approach to follow up on this.
References
Argamon, S., Juola, P.: Overview of the international authorship identification competition at PAN-. In: CLEF 2011 Notebooks (2011)
Arguello, J., Diaz, F., Lin, J., Trotman, A.: RIGOR @ SIGIR (2015)
Armstrong, T.G., Moffat, A., Webber, W., Zobel, J.: Improvements that don’t add up: ad-hoc retrieval results since. In: CIKM 2009, pp. 601–610 (1998)
Arun, R., Suresh, V., Veni Madhavan, C.E.: Stopword graphs and authorship attribution in text corpora. In: ICSC, pp. 192–196 (2009)
Benedetto, D., Caglioti, E., Loreto, V.: Language trees and zipping. Phys. Rev. Lett. 88, 048702 (2002)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet allocation. J. Mach. Learn. Res. 3, 993–1022 (2003)
Burrows, J.: Delta: a measure of stylistic difference and a guide to likely authorship. Lit. Ling. Comp. 17(3), 267–287 (2002)
Chang, C.-C., Chih-Jen Lin, L.: A library for support vector machines. ACM TIST 2, 27:1–27:27 (2011)
Collberg, C., Proebstring, T., Warren, A.M.: Repeatability, benefaction in computer systems research: a study and a modest proposal. TR 14–04, University of Arizona (2015)
de Vel, O., Anderson, A., Corney, M., Mohay, G.: Mining e-mail content for author identification forensics. SIGMOD Rec. 30(4), 55–64 (2001)
Di Buccio, E., Di Nunzio, G.M., Ferro, N., Harman, D., Maistro, M., Silvello, G.: Unfolding off-the-shelf IR systems for reproducibility. In: RIGOR @ SIGIR (2015)
Escalante, H.J., Solorio, T., Montes-y Gómez, M.: Local histograms of character n-grams for authorship attribution. In: HLT 2011, pp. 288–298 (2011)
Ferro, N., Silvello, G.: Rank-biased precision reloaded: reproducibility and generalization. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds.) ECIR 2015. LNCS, vol. 9022, pp. 768–780. Springer, Heidelberg (2015)
Gamon, M.: Linguistic correlates of style: authorship classification with deep linguistic analysis features. In: COLING (2004)
Hagen, M., Potthast, M., Büchner, M., Stein, B.: Twitter sentiment detection via ensemble classification using averaged confidence scores. In: Hanbury, A., Kazai, G., Rauber, A., Fuhr, N. (eds.) ECIR 2015. LNCS, vol. 9022, pp. 741–754. Springer, Heidelberg (2015)
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., Witten, I.H.: The WEKA data mining software: an update. SIGKDD Explor. 11(1), 10–18 (2009)
Hanbury, A., Kazai, G., Rauber, A., Fuhr, N.: Proceedings of ECIR (2015)
Holmes, D.I.: The evolution of stylometry in humanities scholarship. Lit. Ling. Comp. 13(3), 111–117 (1998)
Hopfgartner, F., Hanbury, A., Müller, H., Kando, N., Mercer, S., Kalpathy-Cramer, J., Potthast, M., Gollub, T., Krithara, A., Lin, J., Balog, K., Eggel, I.: Report on the Evaluation-as-a-Service (EaaS) expert workshop. SIGIR Forum 49(1), 57–65 (2015)
Juola, P.: Authorship attribution. FnTIR 1, 234–334 (2008)
Juola, P.: An overview of the traditional authorship attribution subtask. In: CLEF Notebooks (2012)
Keselj, V., Peng, F., Cercone, N., Thomas, C.: N-gram-based author profiles for authorship attribution. In: PACLING 2003, pp. 255–264 (2003)
Khmelev, D.V., Teahan, W.J.: A repetition based measure for verification of text collections and for text categorization. In: SIGIR 2003, pp. 104–110 (2003)
Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring differentiability: unmasking pseudonymous authors. J. Mach. Learn. Res. 8, 1261–1276 (2007)
Koppel, M., Schler, J., Argamon, S.: Authorship attribution in the wild. LRE 45(1), 83–94 (2011)
Lin, J.: The open-source information retrieval reproducibility challenge. In: RIGOR @ SIGIR (2015)
Mendenhall, T.C.: The characteristic curves of composition. Science ns–9(214S), 237–246 (1887)
Ounis, I., Amati, G., Plachouras, V., He, B., Macdonald, C., Lioma, C.: Terrier: a high performance and scalable information retrieval platform. In: OCIR @ SIGIR (2006)
Peng, F., Schuurmans, D., Wang, S.: Augmenting naive Bayes classifiers with statistical language models. Inf. Retr. 7(3–4), 317–345 (2004)
Rangel, F., Rosso, P., Celli, F., Potthast, M., Stein, B., Daelemans, W.: Overview of the 3rd author profiling task at PAN. In: CLEF 2015 Notebooks (2015)
Rudman, J.: The state of authorship attribution studies: some problems and solutions. Comput. Humanit. 31(4), 351–365 (1997)
Seroussi, Y., Bohnert, F., Zukerman, I.: Authorship attribution with author-aware topic models. In: ACL 2012, pp. 264–269 (2012)
Sidorov, G., Velasquez, F., Stamatatos, E., Gelbukh, A., Chanona-Hernández, L.: Syntactic n-grams as machine learning features for natural language processing. Expert Syst. Appl. 41(3), 853–860 (2014)
Stamatatos, E.: Authorship attribution based on feature set subspacing ensembles. Int. J. Artif. Intell. Tools 15(5), 823–838 (2006)
Stamatatos, E.: Author identification using imbalanced and limited training texts. In: DEXA 2007, pp. 237–241 (2007)
Stamatatos, E.: A survey of modern authorship attribution methods. JASIST 60, 538–556 (2009)
Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic text categorization in terms of genre and author. Comput. Linguist. 26(4), 471–495 (2000)
Stamatatos, E., Daelemans, W., Verhoeven, B., Stein, B., Potthast, M., Juola, P., Sánchez-Pérez, M.A., Barrón-Cedeño, A.: Overview of the author identification task at PAN. In: CLEF 2014 Notebooks (2014)
Stodden, V.: The scientific method in practice: reproducibility in the computational sciences. MIT Sloan Research Paper No. 4773–10 (2010)
Tax, N., Bockting, S., Hiemstra, D.: A cross-benchmark comparison of 87 learning to rank methods. IPM 51(6), 757–772 (2015)
Teahan, W.J., Harper, D.J.: Using compression-based language models for text categorization, pp. 141–165. In: Language Modeling for Information Retrieval (2003)
van Halteren, H.: Linguistic profiling for author recognition and verification. In: ACL 2004, pp. 199–206 (2004)
Zheng, R., Li, J., Chen, H., Huang, Z.: A framework for authorship identification of online messages: writing-style features and classification techniques. JASIST 57(3), 378–393 (2006)
Acknowledgements
This study was supported by the German National Academic Foundation (German: Studienstiftung des deutschen Volkes). The foundation helped to recruit students among its scholars and organized our auditing workshop as part of its 2015 summer academy in La Colle-sur-Loup, France. We thank the foundation for their generous support. Our special thanks go to Dorothea Trebesius, Matthias Frenz, and Martina Rothmann-Stang who provided for our every need at the workshop.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this paper
Cite this paper
Potthast, M. et al. (2016). Who Wrote the Web? Revisiting Influential Author Identification Research Applicable to Information Retrieval. In: Ferro, N., et al. Advances in Information Retrieval. ECIR 2016. Lecture Notes in Computer Science(), vol 9626. Springer, Cham. https://doi.org/10.1007/978-3-319-30671-1_29
Download citation
DOI: https://doi.org/10.1007/978-3-319-30671-1_29
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-30670-4
Online ISBN: 978-3-319-30671-1
eBook Packages: Computer ScienceComputer Science (R0)