skip to main content
research-article

Authorship Attribution Based on Specific Vocabulary

Published: 01 May 2012 Publication History

Abstract

In this article we propose a technique for computing a standardized Z score capable of defining the specific vocabulary found in a text (or part thereof) compared to that of an entire corpus. Assuming that the term occurrence follows a binomial distribution, this method is then applied to weight terms (words and punctuation symbols in the current study), representing the lexical specificity of the underlying text. In a final stage, to define an author profile we suggest averaging these text representations and then applying them along with a distance measure to derive a simple and efficient authorship attribution scheme. To evaluate this algorithm and demonstrate its effectiveness, we develop two experiments, the first based on 5,408 newspaper articles (Glasgow Herald) written in English by 20 distinct authors and the second on 4,326 newspaper articles (La Stampa) written in Italian by 20 distinct authors. These experiments demonstrate that the suggested classification scheme tends to perform better than the Delta rule method based on the most frequent words, better than the chi-square distance based on word profiles and punctuation marks, better than the KLD scheme based on a predefined set of words, and better than the naïve Bayes approach.

References

[1]
Argamon, S. 2006. Introduction to the special topic selection on the computational analysis of style. J. Amer. Soc. Inf. Sci. Technol. 57, 11, 1503--1505.
[2]
Argamon, S. 2008. Interpreting Burrows’s delta: Geometric and probabilistic foundations. Liter. Linguist. Comput. 23, 2, 131--147.
[3]
Argamon, S., Koppel, M., Pennebaker, J. W., and Schler, J. 2009. Automatically profiling the author of an anonymous text. Comm. ACM 52, 2, 119--123.
[4]
Baayen, H. R. 2001. Word Frequency Distributions. Kluwer Academic Press, Dordrecht.
[5]
Baayen, H. R. 2008. Analyzing Linguistic Data: A Practical Introduction to Statistics Using R. Cambridge University Press.
[6]
Baayen, H. R. and Halteren, H. V. 2002. An experiment in authorship attribution. In Proceedings of the 6th International Conference on Statistical Analysis of Texttual Data (JADT’2002). 69--75.
[7]
Bilisoly, R. 2008. Practical Text Mining with Perl. John Wiley Sons, Hoboken, NJ.
[8]
Binonga, J. N. G. and Smith, M. W. 1999. The application of principal component analysis to stylometry. Liter. Linguist. Comput. 14, 4, 445--465.
[9]
Bishop, C. M. 2007. Pattern Recognition and Machine Learning. Springer.
[10]
Brill, E. 1995. Transformation-Based error driven learning and natural language processing: A case study in part-of-speech tagging. Comput. Linguist. 21, 4, 543--565.
[11]
Burrows, J. F. 1992. Not unless you ask nicely: The interpretative nexus between analysis and information. Liter. Linguist. Comput. 7, 1, 91--109.
[12]
Burrows, J. F. 2002. Delta: A measure of stylistic difference and a guide to likely authorship. Liter. Linguist. Comput. 17, 3, 267--287.
[13]
Carpenter, R. H. and Seltzer, R. V. 1970. On Nixon’s Kennedy style. Speaker and Gavel 7, 41--43.
[14]
Conover, W. J. 1980. Practical Nonparametric Statistics 2nd Ed. John Wiley and Sons, New York.
[15]
Craig, H. and Kinney, A. F. Eds. 2009. Shakespeare, Computers, and the Mystery of Authorship. Cambridge University Press.
[16]
Crawley, M. J. 2007. The R Book. John Wiley and Sons, Chichester.
[17]
Cristianini, N. and Shawe-Taylor, J. 2000. An Introduction to Support Vector Machines. Cambridge University Press.
[18]
Dixon, P. and Mannion, D. 1993. Goldsmith’s periodical essays: A statistical analysis. Liter. Linguist. Comput. 8, 1, 1--19.
[19]
Dolamic, L. and Savoy, J. 2010. When stopword lists make the difference. J. Amer. Soc. Inf. Sci. Technol. 61, 1, 200--203.
[20]
Duan, K.-B. and Keerthi, S. S. 2005. Which is the best multiclass SVM method? An empirical study. In Proceedings of the 6th International Workshop on Multiple Classifier System. 278--285.
[21]
Efron, B. and Thisted, R. 1976. Estimating the number of unseen species: How many words did Shakespeare know? Biomerika 63, 3, 435--447.
[22]
Fautsch, C. and Savoy, J. 2009. Algorithmic stemmers or morphological analysis: An evaluation. J. Amer. Soc. Inf. Sci. Technol. 60, 8, 1616--1624.
[23]
Finn, A. and Kushmerick, N. 2005. Learning to classify documents according to genre. J. Amer. Soc. Inf. Sci. Technol. 57, 11, 1506--1518.
[24]
Fox, C. 1990. A stop list for general text. ACM SIGIR Forum 24, 19--35.
[25]
Francis, W. N. and Kuc̀era, H. 1982. Frequency Analysis of English Usage: Lexicon and Grammar. Houghton Mifflin, Boston, MA.
[26]
Gale, W. A. and Church, K. W. 1994. What is wrong with adding one? In Corpus-Based Research into Language, N. Oostdijk and P. de Hann Eds., Harcourt Brace.
[27]
Greenacre, M. 2007. Correspondence Analysis in Practice 2nd Ed. Chapman and Hall/CRC, Boca Raton, FL.
[28]
Grefensette, G. and Tapanainen, P. 1994. What is a word? What is a sentence? Problems of tokenization. In Proceedings of the 3rd Conference on Computational Lexicography and Text Research.
[29]
Grieve, J. 2007. Quantitative authorship attribution: An evaluation of techniques. Liter. Linguist. Comput. 22, 3, 251--270.
[30]
Harman, D. 1991. How effective is suffixing? J. Amer. Soc. Inf. Sci. 42, 1, 7--15.
[31]
Hastie, T., Tibshirani, R., and Friedman, J. 2009. The Elements of Statistical Learning, Data Mining, Inference, and Prediction 2nd Ed. Springer, New York.
[32]
Holmes, D. I. 1992. A stylometric analysis of Mormon scripture and related texts. J Roy. Statist. Soc. A155, 1, 91--120.
[33]
Holmes, D. I. 1998. The evolution of stylometry in humanities scholarship. Liter. Linguist. Comput. 13, 3, 111--117.
[34]
Holmes, D. I. and Forsyth, R. S. 1995. The Federalist revisited: New directions in authorship attribution. Liter. Linguist. Comput. 10, 2, 111--127.
[35]
Holmes, D. I. and Crofts, D. W. 2010. The Diary of a Public Man: A Case Study in Traditional and Non-Traditional Authorship Attribution. Liter. Linguist. Comput. 25, 2, 179--197.
[36]
Holte, R. C. 1993. Very simple classification rules perform well on most commonly used datasets. Mach. Learn. 11, 1, 63--90.
[37]
Hoover, D. L. 2003. Another perspective on vocabulary richness. Comput. Humanit. 37, 151--178.
[38]
Hoover, D. L. 2004a. Delta prime? Liter. Linguist. Comput. 19, 4, 477--495.
[39]
Hoover, D. L. 2004b. Testing Burrows’s delta. Liter. Linguist. Comput. 19, 4, 453--475.
[40]
Hoover, D. L. 2006. Stylometry, chronology and the styles of Henry James. In Proceedings of the Digital Humanities Conference. 78--80.
[41]
Hoover, D. L. 2007. Updating delta and delta prime. Graduate School of Library and Information Science, University of Illinois, 79--80.
[42]
Hoover, D. L. and Hess, S. 2009. An exercise in non-ideal authorship attribution: The mysterious Maria Ward. Liter. Linguist. Comput. 24, 4, 467--489.
[43]
Hosmer, D. and Lemeshow, S. 2001. Applied Logistic Regression 2nd Ed. John Wiley and Sons, New York.
[44]
Joachims, T. 2002. Learning to Classify Text Using Support Vector Machines. Methods, Theory, and Algorithms. Kluwer, Boston.
[45]
Jockers, M. L. and Witten, D. M. 2010. A comparative study of machine learning methods for authorship attribution. Liter. Linguist. Comput. 25, 2, 215--223.
[46]
Jockers, M. L., Witten, D. M., and Criddle, C. S. 2008. Reassessing authorship of the Book of Mormon using delta and nearest shrunken centroid classification. Liter. Linguist. Comput. 23, 4, 465--491.
[47]
Johnson, K. 2008. Quantitative Methods in Linguistics. Blackwell, Malden, MA.
[48]
Juola, P. 2006. Authorship attribution. Found. Trends Inf. Retriev. 1, 3.
[49]
Kešelj, V., Peng, F., Cercone, N., and Thomas C. 2003. N-Gram-Based author profiles for authorship attribution. In Proceedings of the Conference Pacific Association for Computational Linguistics (PACLING’03). 255--264.
[50]
Knuth, D. E. 1981. The Art of Computer Programming, Vol. 2 Seminumerical Algorithms. Addison-Wesley, Reading, MA.
[51]
Koppel, M., Schler, J., and Argamon, S. 2009. Computational methods in authorship attribution. J. Amer. Soc. Inf. Sci. Technol. 60, 1, 9--26.
[52]
Labbé, D. 2001. Normalisation et lemmatisation d’une question ouverte. J. Soc. Franc. Statist. 142, 4, 37--57.
[53]
Labbé, D. 2007. Experiments on authorship attribution by intertextual distance in English. J. Quant. Linguist. 14, 1, 33--80.
[54]
Ledger, G. and Merriam, R. 1994. Shakespeare, Fletcher, and The Two Noble Kinsmen. Liter. Linguist. Comput. 9, 3, 235--248.
[55]
Lidstone, G. J. 1920. Note on the general case of the Bayes-Laplace formula for inductive or a posteriori probabilities. Trans. Faculty Actuar. 8, 182--192.
[56]
Love, H. 2002. Attributing Authorship: An Introduction. Cambridge University Press.
[57]
Manning, C. D. and Schütze, H. 2000. Foundations of Statistical Natural Language Processing. The MIT Press, Cambridge, MA.
[58]
Manning, C. D., Raghavan, P., and Schütze, H. 2008. Introduction to Information Retrieval. Cambridge University Press.
[59]
Marcus, M. P., Santorini, B., and Marcinkiewicz, M. A. 1993. Building a large annotated corpus of english: The penn treebank. Comput. Linguist. 19, 2, 313--330.
[60]
McNamee, P. and Mayfield, J. 2004. Character n-gram tokenization for European language text retrieval. Inform. Retrieval 7, 1--2, 73--97.
[61]
Merriam, T. 1998. Heterogeneous authorship in early Shakespeare and the problem of Henry V. Liter. Linguist. Comput. 13, 15--28.
[62]
Miranda-Garcia, A. and Calle-Martin, J. 2005. Yule’s characteristic K revisited. Lang. Resour. Eval. 39, 4, 287--294.
[63]
Miranda Garcia, A. and Calle Martin, J. 2007. Function words in authorship attribution studies. Liter. Linguist. Comput. 22, 1, 49--66.
[64]
Mitchell, T. M. 1997. Machine Learning. McGraw-Hill, New York.
[65]
Morton, A. Q. 1986. Once. A test of authorship based on words which are not repeated in the sample. Liter. Linguist. Comput. 1, 1, 1--8.
[66]
Mosteller, F. and Wallace, D. L. 1964. Inference and Disputed Authorship, The Federalist. Addison-Wesley, Reading, MA. Reprint 2007.
[67]
Muller, C. 1992. Principes et Méthodes de Statistique Lexicale. Honoré Champion, Paris.
[68]
Murtagh, F. 2005. Correspondence Analysis and Data Coding with Java and R. Chapman and Hall/CRC, Boca Raton, FL.
[69]
Nugues, P. 2006. An Introduction to Language Processing with Perl and Prolog. Springer, Berlin.
[70]
Peters, C. 2001. Cross-Language Information Retrieval and Evaluation. Lectures Notes in Computer Science, vol. 2069, Springer.
[71]
Peters, C., Gonzalo, J., Braschler, M. and Kluck, M. 2004. Comparative Evaluation of Multilingual Information Access Systems. Lectures Notes in Computer Science, vol. 3237, Springer.
[72]
Porter, M. F. 1980. An algorithm for suffix stripping. Program 14, 3, 130--137.
[73]
Sampson, G. 2001. Empirical Linguistics. Continuum, London, UK.
[74]
Savoy, J. 2001. Report on clef-2001 experiments. In Cross-Language Information Retrieval and Evaluation, C. Peters, M. Braschler, J. Gonzalo, and M. Kluck Eds., Lectures Notes in Computer Science, vol. 2069, Springer, 27--43.
[75]
Savoy, J. 2010. Lexical analysis of US political speeches. J. Quant. Linguist. 17, 2, 123--141.
[76]
Sebastiani, F. 2002. Machine learning in automatic text categorization. ACM Comput. Surv. 14, 1, 1--27.
[77]
Sichel, H. S. 1975. On a distribution law for word frequencies. J. Amer. Statist. Assoc. 70, 351, 542--547.
[78]
Stamatatos, E. 2009. A survey of modern authorship attribution methods. J. Amer. Soc. Inf. Sci. Technol. 60, 3.
[79]
Stamatatos, E., Fakotakis, N., and Kokkinakis, G. 2001. Automatic text categorization in terms of genre and author. Comput. Linguist. 26, 4, 471--495.
[80]
Stein, S. and Argamon, S. 2006. A Mathematical explanation of Burrows’s delta. In Proceedings of the Digital Humanities Conference.
[81]
Thisted, R. and Efron, B. 1987. Did Shakespeare write a newly-discovered poem? Biomerika 74, 3, 445--455.
[82]
Tuldava, J. 2004. The development of statistical stylistics a survey. J. Quant. Linguist. 11, 1--2, 141--151.
[83]
Weiss, S. M., Indurkhya, N. and Zhang, T. 2010. Fundamentals of Predictive Text Mining. Springer, London.
[84]
Witten, I. H. and Franck, E. 2005. Data Mining. Practical Machine Learning Tools and Techniques. Elsevier, Amsterdam.
[85]
Yang, Y. and Liu, J. X. 1999. A re-examination of text categorization methods. In Proceedings of the ACM SIGIR Conference and Development in Information Retrieval. 42--49.
[86]
Yang, Y. and Pedersen, J. O. 1997. A comparative study of feature selection in text categorization. In Proceedings of the 14th Conference on Machine Learning (ICML’97). 412--420.
[87]
Yang, A. C.-C., Peng, C.-K., Yien, H.-W. and Goldberger, A. L. 2003. Information categorization approach to literary authorship disputes. Physica A, 329, 473--483.
[88]
Zhai, C. X. and Lafferty, J. 2004. A study of smoothing methods for language models applied to information retrieval. ACM Trans. Inf. Syst. 22, 2, 179--214.
[89]
Zhao, Y. 2007. Effective authorship attribution in large document collections. Ph.D. thesis, RMIT Melbourne.
[90]
Zhao, Y. and Zobel, J. 2005. Effective and scalable authorship attribution using function words. In Proceedings of the 2nd AIRS Asian Information Retrieval Symposium. 174--189.
[91]
Zhao, Y. and Zobel, J. 2007a. Searching with style: Authorship attribution in classic literature. In Proceedings of the 30th Australasian Computer Science Conference (ACSC’07). 59--68.
[92]
Zhao, Y. and Zobel, J. 2007b. Entropy-Based authorship search in large document collection. In Proceedings of the European Conference on IR Research (ECIR2007). Lecture Notes in Computer Science, vol. 4425, Springer, 381--392.
[93]
Zheng, R., Li, J., Chen, H., and Huang, Z. 2006. A framework for authorship identification of online messages: Writing-Style features and classification techniques. J. Amer. Soc. Inf. Sci. Technol. 57, 3, 378--393.

Cited By

View all
  • (2024)I repeat therefore I am: The parasyntactic perspectiveJournal of Linguistics/Jazykovedný casopis10.2478/jazcas-2024-000574:2(477-494)Online publication date: 2-Feb-2024
  • (2024)Authorship analysis of three Jordanian columnists: is there a linguistic fingerprint?Cogent Arts & Humanities10.1080/23311983.2024.243434511:1Online publication date: 3-Dec-2024
  • (2024)Stylometry and forensic science: A literature reviewForensic Science International: Synergy10.1016/j.fsisyn.2024.1004819(100481)Online publication date: 2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Information Systems
ACM Transactions on Information Systems  Volume 30, Issue 2
May 2012
245 pages
ISSN:1046-8188
EISSN:1558-2868
DOI:10.1145/2180868
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 May 2012
Accepted: 01 March 2012
Revised: 01 November 2011
Received: 01 July 2011
Published in TOIS Volume 30, Issue 2

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Authorship attribution
  2. lexical statistics
  3. text classification

Qualifiers

  • Research-article
  • Research
  • Refereed

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)20
  • Downloads (Last 6 weeks)2
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)I repeat therefore I am: The parasyntactic perspectiveJournal of Linguistics/Jazykovedný casopis10.2478/jazcas-2024-000574:2(477-494)Online publication date: 2-Feb-2024
  • (2024)Authorship analysis of three Jordanian columnists: is there a linguistic fingerprint?Cogent Arts & Humanities10.1080/23311983.2024.243434511:1Online publication date: 3-Dec-2024
  • (2024)Stylometry and forensic science: A literature reviewForensic Science International: Synergy10.1016/j.fsisyn.2024.1004819(100481)Online publication date: 2024
  • (2023)Weight of authorship evidence with multiple categories of stylometric features: A multinomial-based discrete modelScience & Justice10.1016/j.scijus.2022.12.00763:2(181-199)Online publication date: Mar-2023
  • (2023)Adjectives and adverbs as stylometric analysis parametersInternational Journal of Digital Humanities10.1007/s42803-023-00065-y5:2-3(233-245)Online publication date: 22-May-2023
  • (2023)A Method for Bengali Author Detection Using State of the Arts Supervised Machine Learning ClassifiersArtificial Intelligence and Industrial Applications10.1007/978-3-031-43520-1_3(21-33)Online publication date: 15-Sep-2023
  • (2022)Likelihood ratio estimation for authorship text evidence: An empirical comparison of score- and feature-based methodsForensic Science International10.1016/j.forsciint.2022.111268334(111268)Online publication date: May-2022
  • (2021)DESIGN THINKING AS AN EFFECTIVE METHOD OF ADAPTATION TO CHANGEJournal of Lviv Polytechnic National University. Series of Economics and Management Issues10.23939/semi2021.01.1215:1(121-131)Online publication date: 1-Jun-2021
  • (2021)Hapax remains: Regularity of low-frequency words in authorial textsDigital Scholarship in the Humanities10.1093/llc/fqab07737:3(693-715)Online publication date: 29-Oct-2021
  • (2020)Intelligent CALLNew Technological Applications for Foreign and Second Language Learning and Teaching10.4018/978-1-7998-2591-3.ch001(1-23)Online publication date: 2020
  • Show More Cited By

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media