research-article

Authorship Attribution Based on Specific Vocabulary

Author:

Jacques SavoyAuthors Info & Claims

ACM Transactions on Information Systems (TOIS), Volume 30, Issue 2

Article No.: 12, Pages 1 - 30

https://doi.org/10.1145/2180868.2180874

Published: 01 May 2012 Publication History

Abstract

In this article we propose a technique for computing a standardized Z score capable of defining the specific vocabulary found in a text (or part thereof) compared to that of an entire corpus. Assuming that the term occurrence follows a binomial distribution, this method is then applied to weight terms (words and punctuation symbols in the current study), representing the lexical specificity of the underlying text. In a final stage, to define an author profile we suggest averaging these text representations and then applying them along with a distance measure to derive a simple and efficient authorship attribution scheme. To evaluate this algorithm and demonstrate its effectiveness, we develop two experiments, the first based on 5,408 newspaper articles (Glasgow Herald) written in English by 20 distinct authors and the second on 4,326 newspaper articles (La Stampa) written in Italian by 20 distinct authors. These experiments demonstrate that the suggested classification scheme tends to perform better than the Delta rule method based on the most frequent words, better than the chi-square distance based on word profiles and punctuation marks, better than the KLD scheme based on a predefined set of words, and better than the naïve Bayes approach.

References

[1]

Argamon, S. 2006. Introduction to the special topic selection on the computational analysis of style. J. Amer. Soc. Inf. Sci. Technol. 57, 11, 1503--1505.

Digital Library

[2]

Argamon, S. 2008. Interpreting Burrows’s delta: Geometric and probabilistic foundations. Liter. Linguist. Comput. 23, 2, 131--147.

[3]

Argamon, S., Koppel, M., Pennebaker, J. W., and Schler, J. 2009. Automatically profiling the author of an anonymous text. Comm. ACM 52, 2, 119--123.

Digital Library

[4]

Baayen, H. R. 2001. Word Frequency Distributions. Kluwer Academic Press, Dordrecht.

[5]

Baayen, H. R. 2008. Analyzing Linguistic Data: A Practical Introduction to Statistics Using R. Cambridge University Press.

[6]

Baayen, H. R. and Halteren, H. V. 2002. An experiment in authorship attribution. In Proceedings of the 6th International Conference on Statistical Analysis of Texttual Data (JADT’2002). 69--75.

[7]

Bilisoly, R. 2008. Practical Text Mining with Perl. John Wiley Sons, Hoboken, NJ.

Digital Library

[8]

Binonga, J. N. G. and Smith, M. W. 1999. The application of principal component analysis to stylometry. Liter. Linguist. Comput. 14, 4, 445--465.

[9]

Bishop, C. M. 2007. Pattern Recognition and Machine Learning. Springer.

[10]

Brill, E. 1995. Transformation-Based error driven learning and natural language processing: A case study in part-of-speech tagging. Comput. Linguist. 21, 4, 543--565.

Digital Library

[11]

Burrows, J. F. 1992. Not unless you ask nicely: The interpretative nexus between analysis and information. Liter. Linguist. Comput. 7, 1, 91--109.

[12]

Burrows, J. F. 2002. Delta: A measure of stylistic difference and a guide to likely authorship. Liter. Linguist. Comput. 17, 3, 267--287.

[13]

Carpenter, R. H. and Seltzer, R. V. 1970. On Nixon’s Kennedy style. Speaker and Gavel 7, 41--43.

[14]

Conover, W. J. 1980. Practical Nonparametric Statistics 2nd Ed. John Wiley and Sons, New York.

[15]

Craig, H. and Kinney, A. F. Eds. 2009. Shakespeare, Computers, and the Mystery of Authorship. Cambridge University Press.

[16]

Crawley, M. J. 2007. The R Book. John Wiley and Sons, Chichester.

Digital Library

[17]

Cristianini, N. and Shawe-Taylor, J. 2000. An Introduction to Support Vector Machines. Cambridge University Press.

Digital Library

[18]

Dixon, P. and Mannion, D. 1993. Goldsmith’s periodical essays: A statistical analysis. Liter. Linguist. Comput. 8, 1, 1--19.

[19]

Dolamic, L. and Savoy, J. 2010. When stopword lists make the difference. J. Amer. Soc. Inf. Sci. Technol. 61, 1, 200--203.

Digital Library

[20]

Duan, K.-B. and Keerthi, S. S. 2005. Which is the best multiclass SVM method? An empirical study. In Proceedings of the 6th International Workshop on Multiple Classifier System. 278--285.

Digital Library

[21]

Efron, B. and Thisted, R. 1976. Estimating the number of unseen species: How many words did Shakespeare know? Biomerika 63, 3, 435--447.

[22]

Fautsch, C. and Savoy, J. 2009. Algorithmic stemmers or morphological analysis: An evaluation. J. Amer. Soc. Inf. Sci. Technol. 60, 8, 1616--1624.

Digital Library

[23]

Finn, A. and Kushmerick, N. 2005. Learning to classify documents according to genre. J. Amer. Soc. Inf. Sci. Technol. 57, 11, 1506--1518.

Digital Library

[24]

Fox, C. 1990. A stop list for general text. ACM SIGIR Forum 24, 19--35.

Digital Library

[25]

Francis, W. N. and Kuc̀era, H. 1982. Frequency Analysis of English Usage: Lexicon and Grammar. Houghton Mifflin, Boston, MA.

[26]

Gale, W. A. and Church, K. W. 1994. What is wrong with adding one? In Corpus-Based Research into Language, N. Oostdijk and P. de Hann Eds., Harcourt Brace.

[27]

Greenacre, M. 2007. Correspondence Analysis in Practice 2nd Ed. Chapman and Hall/CRC, Boca Raton, FL.

[28]

Grefensette, G. and Tapanainen, P. 1994. What is a word? What is a sentence? Problems of tokenization. In Proceedings of the 3rd Conference on Computational Lexicography and Text Research.

[29]

Grieve, J. 2007. Quantitative authorship attribution: An evaluation of techniques. Liter. Linguist. Comput. 22, 3, 251--270.

[30]

Harman, D. 1991. How effective is suffixing? J. Amer. Soc. Inf. Sci. 42, 1, 7--15.

[31]

Hastie, T., Tibshirani, R., and Friedman, J. 2009. The Elements of Statistical Learning, Data Mining, Inference, and Prediction 2nd Ed. Springer, New York.

[32]

Holmes, D. I. 1992. A stylometric analysis of Mormon scripture and related texts. J Roy. Statist. Soc. A155, 1, 91--120.

[33]

Holmes, D. I. 1998. The evolution of stylometry in humanities scholarship. Liter. Linguist. Comput. 13, 3, 111--117.

[34]

Holmes, D. I. and Forsyth, R. S. 1995. The Federalist revisited: New directions in authorship attribution. Liter. Linguist. Comput. 10, 2, 111--127.

[35]

Holmes, D. I. and Crofts, D. W. 2010. The Diary of a Public Man: A Case Study in Traditional and Non-Traditional Authorship Attribution. Liter. Linguist. Comput. 25, 2, 179--197.

[36]

Holte, R. C. 1993. Very simple classification rules perform well on most commonly used datasets. Mach. Learn. 11, 1, 63--90.

Digital Library

[37]

Hoover, D. L. 2003. Another perspective on vocabulary richness. Comput. Humanit. 37, 151--178.

[38]

Hoover, D. L. 2004a. Delta prime? Liter. Linguist. Comput. 19, 4, 477--495.

[39]

Hoover, D. L. 2004b. Testing Burrows’s delta. Liter. Linguist. Comput. 19, 4, 453--475.

[40]

Hoover, D. L. 2006. Stylometry, chronology and the styles of Henry James. In Proceedings of the Digital Humanities Conference. 78--80.

[41]

Hoover, D. L. 2007. Updating delta and delta prime. Graduate School of Library and Information Science, University of Illinois, 79--80.

[42]

Hoover, D. L. and Hess, S. 2009. An exercise in non-ideal authorship attribution: The mysterious Maria Ward. Liter. Linguist. Comput. 24, 4, 467--489.

[43]

Hosmer, D. and Lemeshow, S. 2001. Applied Logistic Regression 2nd Ed. John Wiley and Sons, New York.

[44]

Joachims, T. 2002. Learning to Classify Text Using Support Vector Machines. Methods, Theory, and Algorithms. Kluwer, Boston.

Digital Library

[45]

Jockers, M. L. and Witten, D. M. 2010. A comparative study of machine learning methods for authorship attribution. Liter. Linguist. Comput. 25, 2, 215--223.

[46]

Jockers, M. L., Witten, D. M., and Criddle, C. S. 2008. Reassessing authorship of the Book of Mormon using delta and nearest shrunken centroid classification. Liter. Linguist. Comput. 23, 4, 465--491.

[47]

Johnson, K. 2008. Quantitative Methods in Linguistics. Blackwell, Malden, MA.

[48]

Juola, P. 2006. Authorship attribution. Found. Trends Inf. Retriev. 1, 3.

Digital Library

[49]

Kešelj, V., Peng, F., Cercone, N., and Thomas C. 2003. N-Gram-Based author profiles for authorship attribution. In Proceedings of the Conference Pacific Association for Computational Linguistics (PACLING’03). 255--264.

[50]

Knuth, D. E. 1981. The Art of Computer Programming, Vol. 2 Seminumerical Algorithms. Addison-Wesley, Reading, MA.

[51]

Koppel, M., Schler, J., and Argamon, S. 2009. Computational methods in authorship attribution. J. Amer. Soc. Inf. Sci. Technol. 60, 1, 9--26.

Digital Library

[52]

Labbé, D. 2001. Normalisation et lemmatisation d’une question ouverte. J. Soc. Franc. Statist. 142, 4, 37--57.

[53]

Labbé, D. 2007. Experiments on authorship attribution by intertextual distance in English. J. Quant. Linguist. 14, 1, 33--80.

[54]

Ledger, G. and Merriam, R. 1994. Shakespeare, Fletcher, and The Two Noble Kinsmen. Liter. Linguist. Comput. 9, 3, 235--248.

[55]

Lidstone, G. J. 1920. Note on the general case of the Bayes-Laplace formula for inductive or a posteriori probabilities. Trans. Faculty Actuar. 8, 182--192.

[56]

Love, H. 2002. Attributing Authorship: An Introduction. Cambridge University Press.

[57]

Manning, C. D. and Schütze, H. 2000. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, MA.

Digital Library

[58]

Manning, C. D., Raghavan, P., and Schütze, H. 2008. Introduction to Information Retrieval. Cambridge University Press.

Digital Library

[59]

Marcus, M. P., Santorini, B., and Marcinkiewicz, M. A. 1993. Building a large annotated corpus of english: The penn treebank. Comput. Linguist. 19, 2, 313--330.

[60]

McNamee, P. and Mayfield, J. 2004. Character n-gram tokenization for European language text retrieval. Inform. Retrieval 7, 1--2, 73--97.

Digital Library

[61]

Merriam, T. 1998. Heterogeneous authorship in early Shakespeare and the problem of Henry V. Liter. Linguist. Comput. 13, 15--28.

[62]

Miranda-Garcia, A. and Calle-Martin, J. 2005. Yule’s characteristic K revisited. Lang. Resour. Eval. 39, 4, 287--294.

[63]

Miranda Garcia, A. and Calle Martin, J. 2007. Function words in authorship attribution studies. Liter. Linguist. Comput. 22, 1, 49--66.

[64]

Mitchell, T. M. 1997. Machine Learning. McGraw-Hill, New York.

Digital Library

[65]

Morton, A. Q. 1986. Once. A test of authorship based on words which are not repeated in the sample. Liter. Linguist. Comput. 1, 1, 1--8.

Digital Library

[66]

Mosteller, F. and Wallace, D. L. 1964. Inference and Disputed Authorship, The Federalist. Addison-Wesley, Reading, MA. Reprint 2007.

[67]

Muller, C. 1992. Principes et Méthodes de Statistique Lexicale. Honoré Champion, Paris.

[68]

Murtagh, F. 2005. Correspondence Analysis and Data Coding with Java and R. Chapman and Hall/CRC, Boca Raton, FL.

[69]

Nugues, P. 2006. An Introduction to Language Processing with Perl and Prolog. Springer, Berlin.

Digital Library

[70]

Peters, C. 2001. Cross-Language Information Retrieval and Evaluation. Lectures Notes in Computer Science, vol. 2069, Springer.

[71]

Peters, C., Gonzalo, J., Braschler, M. and Kluck, M. 2004. Comparative Evaluation of Multilingual Information Access Systems. Lectures Notes in Computer Science, vol. 3237, Springer.

Digital Library

[72]

Porter, M. F. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.

[73]

Sampson, G. 2001. Empirical Linguistics. Continuum, London, UK.

[74]

Savoy, J. 2001. Report on clef-2001 experiments. In Cross-Language Information Retrieval and Evaluation, C. Peters, M. Braschler, J. Gonzalo, and M. Kluck Eds., Lectures Notes in Computer Science, vol. 2069, Springer, 27--43.

[75]

Savoy, J. 2010. Lexical analysis of US political speeches. J. Quant. Linguist. 17, 2, 123--141.

[76]

Sebastiani, F. 2002. Machine learning in automatic text categorization. ACM Comput. Surv. 14, 1, 1--27.

Digital Library

[77]

Sichel, H. S. 1975. On a distribution law for word frequencies. J. Amer. Statist. Assoc. 70, 351, 542--547.

[78]

Stamatatos, E. 2009. A survey of modern authorship attribution methods. J. Amer. Soc. Inf. Sci. Technol. 60, 3.

Digital Library

[79]

Stamatatos, E., Fakotakis, N., and Kokkinakis, G. 2001. Automatic text categorization in terms of genre and author. Comput. Linguist. 26, 4, 471--495.

Digital Library

[80]

Stein, S. and Argamon, S. 2006. A Mathematical explanation of Burrows’s delta. In Proceedings of the Digital Humanities Conference.

[81]

Thisted, R. and Efron, B. 1987. Did Shakespeare write a newly-discovered poem? Biomerika 74, 3, 445--455.

[82]

Tuldava, J. 2004. The development of statistical stylistics a survey. J. Quant. Linguist. 11, 1--2, 141--151.

[83]

Weiss, S. M., Indurkhya, N. and Zhang, T. 2010. Fundamentals of Predictive Text Mining. Springer, London.

Digital Library

[84]

Witten, I. H. and Franck, E. 2005. Data Mining. Practical Machine Learning Tools and Techniques. Elsevier, Amsterdam.

Digital Library

[85]

Yang, Y. and Liu, J. X. 1999. A re-examination of text categorization methods. In Proceedings of the ACM SIGIR Conference and Development in Information Retrieval. 42--49.

Digital Library

[86]

Yang, Y. and Pedersen, J. O. 1997. A comparative study of feature selection in text categorization. In Proceedings of the 14th Conference on Machine Learning (ICML’97). 412--420.

Digital Library

[87]

Yang, A. C.-C., Peng, C.-K., Yien, H.-W. and Goldberger, A. L. 2003. Information categorization approach to literary authorship disputes. Physica A, 329, 473--483.

[88]

Zhai, C. X. and Lafferty, J. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst. 22, 2, 179--214.

Digital Library

[89]

Zhao, Y. 2007. Effective authorship attribution in large document collections. Ph.D. thesis, RMIT Melbourne.

[90]

Zhao, Y. and Zobel, J. 2005. Effective and scalable authorship attribution using function words. In Proceedings of the 2nd AIRS Asian Information Retrieval Symposium. 174--189.

Digital Library

[91]

Zhao, Y. and Zobel, J. 2007a. Searching with style: Authorship attribution in classic literature. In Proceedings of the 30th Australasian Computer Science Conference (ACSC’07). 59--68.

Digital Library

[92]

Zhao, Y. and Zobel, J. 2007b. Entropy-Based authorship search in large document collection. In Proceedings of the European Conference on IR Research (ECIR2007). Lecture Notes in Computer Science, vol. 4425, Springer, 381--392.

Digital Library

[93]

Zheng, R., Li, J., Chen, H., and Huang, Z. 2006. A framework for authorship identification of online messages: Writing-Style features and classification techniques. J. Amer. Soc. Inf. Sci. Technol. 57, 3, 378--393.

Digital Library

Cited By

Benešová MFaltýnek DKormaníková LKučera O(2024)I repeat therefore I am: The parasyntactic perspectiveJournal of Linguistics/Jazykovedný casopis10.2478/jazcas-2024-000574:2(477-494)Online publication date: 2-Feb-2024
https://doi.org/10.2478/jazcas-2024-0005
Al-Omari MElhersh Hal Huneety AMashaqba B(2024)Authorship analysis of three Jordanian columnists: is there a linguistic fingerprint?Cogent Arts & Humanities10.1080/23311983.2024.243434511:1Online publication date: 3-Dec-2024
https://doi.org/10.1080/23311983.2024.2434345
Cammarota VBozza SRoten CTaroni F(2024)Stylometry and forensic science: A literature reviewForensic Science International: Synergy10.1016/j.fsisyn.2024.1004819(100481)Online publication date: 2024
https://doi.org/10.1016/j.fsisyn.2024.100481
Show More Cited By

Index Terms

Authorship Attribution Based on Specific Vocabulary

Recommendations

Arabic Authorship Attribution: An Extensive Study on Twitter Posts

Law enforcement faces problems in tracing the true identity of offenders in cybercrime investigations. Most offenders mask their true identity, impersonate people of high authority, or use identity deception and obfuscation tactics to avoid detection ...
Deep Bangla Authorship Attribution Using Transformer Models
Computational Data and Social Networks
Abstract
Authorship attribution is one of the renowned problems in the domain of Natural Language Processing (NLP). Leveraging the state-of-the-art (SOTA) techniques of NLP such as transformer models, this problem domain has achieved a considerable ...
Authorship Attribution of Russian Forum Posts with Different Types of N-gram Features
NLPIR '19: Proceedings of the 2019 3rd International Conference on Natural Language Processing and Information Retrieval

Authorship attribution is an important field in online security. Recently there have been numerous successful works in authorship attribution in various European languages. Character n-grams are reported to be the best choice in authorship attribution, ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems

ACM Transactions on Information Systems Volume 30, Issue 2

May 2012

245 pages

ISSN:1046-8188

EISSN:1558-2868

DOI:10.1145/2180868

Issue’s Table of Contents

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 2012

Accepted: 01 March 2012

Revised: 01 November 2011

Received: 01 July 2011

Published in TOIS Volume 30, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

53
Total Citations
View Citations
812
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)2

Reflects downloads up to 15 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Benešová MFaltýnek DKormaníková LKučera O(2024)I repeat therefore I am: The parasyntactic perspectiveJournal of Linguistics/Jazykovedný casopis10.2478/jazcas-2024-000574:2(477-494)Online publication date: 2-Feb-2024
https://doi.org/10.2478/jazcas-2024-0005
Al-Omari MElhersh Hal Huneety AMashaqba B(2024)Authorship analysis of three Jordanian columnists: is there a linguistic fingerprint?Cogent Arts & Humanities10.1080/23311983.2024.243434511:1Online publication date: 3-Dec-2024
https://doi.org/10.1080/23311983.2024.2434345
Cammarota VBozza SRoten CTaroni F(2024)Stylometry and forensic science: A literature reviewForensic Science International: Synergy10.1016/j.fsisyn.2024.1004819(100481)Online publication date: 2024
https://doi.org/10.1016/j.fsisyn.2024.100481
Ishihara S(2023)Weight of authorship evidence with multiple categories of stylometric features: A multinomial-based discrete modelScience & Justice10.1016/j.scijus.2022.12.00763:2(181-199)Online publication date: Mar-2023
https://doi.org/10.1016/j.scijus.2022.12.007
Lukin ERoberts JBerdik DMugar EJuola P(2023)Adjectives and adverbs as stylometric analysis parametersInternational Journal of Digital Humanities10.1007/s42803-023-00065-y5:2-3(233-245)Online publication date: 22-May-2023
https://doi.org/10.1007/s42803-023-00065-y
Hamid MMarjana NTumpa EKhan MAfroz URahman M(2023)A Method for Bengali Author Detection Using State of the Arts Supervised Machine Learning ClassifiersArtificial Intelligence and Industrial Applications10.1007/978-3-031-43520-1_3(21-33)Online publication date: 15-Sep-2023
https://doi.org/10.1007/978-3-031-43520-1_3
Ishihara SCarne M(2022)Likelihood ratio estimation for authorship text evidence: An empirical comparison of score- and feature-based methodsForensic Science International10.1016/j.forsciint.2022.111268334(111268)Online publication date: May-2022
https://doi.org/10.1016/j.forsciint.2022.111268
Bespalyuk KProtsak K(2021)DESIGN THINKING AS AN EFFECTIVE METHOD OF ADAPTATION TO CHANGEJournal of Lviv Polytechnic National University. Series of Economics and Management Issues10.23939/semi2021.01.1215:1(121-131)Online publication date: 1-Jun-2021
https://doi.org/10.23939/semi2021.01.121
Faltýnek DMatlach V(2021)Hapax remains: Regularity of low-frequency words in authorial textsDigital Scholarship in the Humanities10.1093/llc/fqab07737:3(693-715)Online publication date: 29-Oct-2021
https://doi.org/10.1093/llc/fqab077
Blake J(2020)Intelligent CALLNew Technological Applications for Foreign and Second Language Learning and Teaching10.4018/978-1-7998-2591-3.ch001(1-23)Online publication date: 2020
https://doi.org/10.4018/978-1-7998-2591-3.ch001
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents