Abstract
This paper describes and evaluates an unsupervised author clustering model called Spatium. The proposed strategy can be adapted without any difficulty to different natural languages (such as Dutch, English, and Greek) and it can be applied to different text genres (newspaper articles, reviews, excerpts of novels, etc.). As features, we suggest using the m most frequent terms of each text (isolated words and punctuation symbols with m set to at most 200). Applying a distance measure, we define whether there is enough evidence that two texts were written by the same author. The evaluations are based on six test collections (PAN Author Clustering task at CLEF 2016). A more detailed analysis shows the strengths of our approach but also indicates the problems and provides reasons for some of the potential failures of the Spatium model.
Keywords
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Amigo, E., Gonzalo, J., Artiles, J., Verdejo, F.: A comparison of extrinsic clustering evaluation metrics based on formal constraints. Inf. Retr. 12(4), 461–486 (2009)
Burrows, J.F.: Delta: a measure of stylistic difference and a guide to likely authorship. Lit. Linguist. Comput. 17(3), 267–287 (2002)
Craig, H., Kinney, A.F.: Shakespeare, Computers, and the Mystery of Authorship. Cambridge University Press, Cambridge (2009)
Hernández, D.M., Bécue-Bertaut, M., Barahona, I.: How scientific literature has been evolving over the time? A novel statistical approach using tracking verbal-based methods. In: JSM Proceedings, Section on Statistical Learning and Data Mining, Alexandria, pp. 1121–1131. American Statistical Association (2014)
Holmes, D.I.: The evolution of stylometry in humanities scholarship. Lit. Linguist. Comput. 13(3), 111–117 (1998)
Jockers, M.L., Witten, D.M.: A comparative study of machine learning methods for authorship attribution. Lit. Linguist. Comput. 25(2), 215–223 (2010)
Kocher, M., Savoy, J.: A simple and efficient algorithm for authorship verification. J. Am. Soc. Inf. Sci. Technol. 68(1), 259–269 (2017)
Kocher, M., Savoy, J.: Author clustering using spatium. In: Proceedings of ACM/IEEE Joint Conference on Digital Libraries (2017, to appear)
Kocher, M., Savoy, J.: Distance measures in author profiling. Inf. Process. Manag. 53(5), 1103–1119 (2017)
Labbé, D.: Experiments on authorship attribution by intertextual distance in English. J. Quant. Linguist. 14(1), 33–80 (2007)
Layton, R., Watters, P., Dazeley, R.: Evaluating authorship distance methods using the positive silhouette coefficient. Nat. Lang. Eng. 19, 517–535 (2013)
Manning, C.D., Raghaven, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
Savoy, J.: Estimating the probability of an authorship attribution. J. Am. Soc. Inf. Sci. Technol. 67(6), 1462–1472 (2016)
Savoy, J.: Comparative evaluation of term selection functions for authorship attribution. Digit. Scholarsh. Hum. 30(2), 246–261 (2015)
Sebastiani, F.: Machine learning in automatic text categorization. ACM Comput. Surv. 34(1), 1–27 (2002)
Stamatatos, E., Tschuggnall, M., Verhoeven, B., Daelemans, W., Specht, G., Stein, B., Potthast, M.: Clustering by authorship within and across documents. In: Working Notes of the CLEF 2016 Evaluation Labs, CEUR Workshop Proceedings, CEUR-WS.org (2016)
Witten, I.H., Frank, E., Hall, M.A.: Data Mining. Practical Machine Learning Tools and Techniques. Morgan Kaufmann, Burlington (2011)
Zhao, Y., Zobel, J.: Searching with style: authorship attribution in classic literature. In: Proceedings of the Thirtieth Australasian Computer Science Conference, Ballarat, pp. 59–68 (2007)
Acknowledgments
The authors want to thank the task coordinators for their valuable effort to promote test collections in authorship attribution. This research was supported, in part, by the NSF under Grant #200021_149665/1.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Kocher, M., Savoy, J. (2017). Author Clustering with an Adaptive Threshold. In: Jones, G., et al. Experimental IR Meets Multilinguality, Multimodality, and Interaction. CLEF 2017. Lecture Notes in Computer Science(), vol 10456. Springer, Cham. https://doi.org/10.1007/978-3-319-65813-1_19
Download citation
DOI: https://doi.org/10.1007/978-3-319-65813-1_19
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-65812-4
Online ISBN: 978-3-319-65813-1
eBook Packages: Computer ScienceComputer Science (R0)