Skip to main content

A Time Series Model of the Writing Process

  • Conference paper
  • First Online:
Machine Learning and Data Mining in Pattern Recognition (MLDM 2016)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9729))

Abstract

The necessity to operate with the huge number of anonymous documents abounding on the Internet is initiating the study of new methods for authorship recognition. The principal weakness of the methods used in this area is that they assess the similarity of text styles without any regard to their surroundings. This paper proposes a novel mathematical model of the writing process striving to quantify this dependency. A text is divided into a series of sequential sub-documents, which are represented via term histograms. The histograms proximity is estimated through a simple probability distance. Intending to typify the text writing style, a new characteristic representing the mean distance between a current sub-document and numerous earlier ones is advanced. An empirical distribution over the whole document of this feature specifies the writing style. So, dissimilarity of such distributions indicates a difference in the writing styles, and their coincidence implies the styles’ identity. Numerical experiments demonstrate high potential ability of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Binongo, J.: Who wrote the 15th book of Oz? An application of multivariate analysis to authorship attribution. Chance 16(C), 9–17 (2003)

    Google Scholar 

  2. Bolshoy, A., Volkovich, Z., Kirzhner, V., Barzily, Z.: Genome clustering: from linguistic models to classification of genetic texts, vol. 286. Springer Science & Business Media (2010)

    Google Scholar 

  3. Brown, P.F., Pietra, V.J.D., deSouza, P.V., Lai, J.C., Mercer, R.L.: Class-based \(n\)-gram models of natural language. Computational Linguistics 18(4), 467–479 (1992)

    Google Scholar 

  4. Cha, S.-H.: Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences 1(4), 300–307 (2007)

    MathSciNet  Google Scholar 

  5. Collins, J., Kaufer, D., Vlachos, P., Butler, B., Ishizaki, S.: Detecting collaborations in text: Comparing the authors’ rhetorical language choices in the federalist papers. Computers and the Humanities 38, 15–36 (2004)

    Article  Google Scholar 

  6. Coyotl-Morales, R.M., Villaseñor-Pineda, L., Montes-y-Gómez, M., Rosso, P.: Authorship attribution using word sequences. In: Martínez-Trinidad, J.F., Carrasco Ochoa, J.A., Kittler, J. (eds.) CIARP 2006. LNCS, vol. 4225, pp. 844–853. Springer, Heidelberg (2006)

    Chapter  Google Scholar 

  7. Diederich, J., Kindermann, J., Leopold, E., Paas, G.: Authorship attribution with support vector machines. Applied Intelligence 19(1), 109–123 (2003)

    Article  MATH  Google Scholar 

  8. Eissen, S.M., Stein, B., Kulig, M.: Plagiarism detection without reference collections. Springer, Berlin (2007)

    Google Scholar 

  9. Forsyth, R.: New directions in text categorization. Springer, Heidelberg (1999)

    Google Scholar 

  10. Frantzeskou, G., Stamatatos, E., Gritzalis, S., Katsikas, S.: Effective identification of source code authors using byte-level information. In: Proceedings of the 28th International Conference on Software Engineering, pp. 893–896. ACM Press, NewYork (2006)

    Google Scholar 

  11. Fristedt, B.E., Gray, L.F.: A Modern Approach to Probability Theory. Probability and Its Applications. Birkhäuser, Boston (1996)

    Google Scholar 

  12. Harmer, J.: How to Teach Writing. Pearson Education (2006)

    Google Scholar 

  13. Hughes, J.M., Foti, N.J., Krakauer, D.C., Rockmore, D.N.: Quantitative patterns of stylistic influence in the evolution of literature. Proc. Natl. Acad. Sci. USA 109(20), 7682–7686 (2012)

    Article  Google Scholar 

  14. Ionescu, R.T., Popescu, M.: Pq kernel. Pattern Recogn. Lett. 55(C), 51–57 (2015)

    Article  Google Scholar 

  15. Juola, P.: Authorship attribution. Foundations and Trends in Information Retrieval 1(3), 233–334 (2006)

    Article  Google Scholar 

  16. Kendall, M.G., Gibbons, J.D.: Rank Correlation Methods. Edward Arnold, London (1990)

    MATH  Google Scholar 

  17. Kestemont, M., Luyckx, K., Daelemans, W., Crombez, T.: Cross-genre authorship verification using unmasking. English Studies 93(3), 340–356 (2012)

    Article  Google Scholar 

  18. Kolmogorov, A.: Sulla determinazione empirica di una legge di distribuzione. G. Ist. Ital. Attuari 4 (1933)

    Google Scholar 

  19. Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Literary and Linguistic Computing 17(4), 401–412 (2002)

    Article  Google Scholar 

  20. Koppel, M., Schler, J.: Authorship verification as a one-class classification problem. In: Proceedings of the 21st International Conferenceon Machine Learning. Press (2004)

    Google Scholar 

  21. Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring differentiability: Unmasking pseudonymous authors. Journal of Machine Learning Research 8, 1261–1276 (2007)

    MATH  Google Scholar 

  22. Koppel, M., Winter, Y.: Determining if two documents are written by the same author. Journal of the American Society for Information Science and Technology 65(1), 178–187 (2014)

    Article  Google Scholar 

  23. Koppel, M., Schler, J., Argamon, S.: Computational methods in authorship attribution. JASIST 60(1), 9–26 (2009)

    Article  Google Scholar 

  24. Luyckx, K., Daelemans, W.: Authorship attribution and verification with many authors and limited data. In: Proceedings of the Twenty-Second International Conference on Computational Linguistics (COLING 2008), pp. 513–520 (2008)

    Google Scholar 

  25. Manning, C., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (2003)

    MATH  Google Scholar 

  26. Miao, Y., Kešelj, V., Milios, E.: Document clustering using character n-grams: a comparative evaluation with term-based and word-based clustering. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, CIKM 2005, pp. 357–358. ACM, New York (2005)

    Google Scholar 

  27. Peng, F., Schuurmans, D., Keselj, V., Wang, S.: Augmenting naive bayes classifiers with statistical languages models. Information Retrieval 7, 317–345 (2004)

    Article  Google Scholar 

  28. Rachev, S.T.: Probability metrics and the stability of stochastic models. Wiley series in probability and mathematical statistics: Applied probability and statistics. Wiley (1991)

    Google Scholar 

  29. Rudman, J.: The state of authorship attribution studies: Some problems and solutions. Computers and the Humanities 31, 351–365 (1998)

    Article  Google Scholar 

  30. Ryabko, D., Ryabko, B.: Nonparametric statistical inference for ergodic processes. IEEE Transactions on Information Theory 56(3), 1430–1435 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  31. Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)

    Article  MATH  Google Scholar 

  32. Sanderson, C., Guenter, S.: Short text authorship attribution via sequence kernels, markov chains and author unmasking: an investigation. In: Proceedings of the International Conference on Empirical Methods in Natural Language Processing, pp. 482–491 (2006)

    Google Scholar 

  33. Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)

    Article  Google Scholar 

  34. Smirnov, N.: Table for estimating the goodness of fit of empirical distributions. Annals of Mathematical Statistics 19 (1948)

    Google Scholar 

  35. Stamatatos, E.: A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 60(3), 538–556 (2009)

    Article  Google Scholar 

  36. Stamatatos, E., Daelemans, W., Verhoeven, B., Juola, P., Lopez Lopez, A., Potthast, M., Stein, B.: Overview of the author identification task at pan 2015. In: Cappellato, L., Ferro, N., Gareth, J., San Juan, E. (eds.) Working Notes Papers of the CLEF 2015 Evaluation Labs (2015)

    Google Scholar 

  37. Stamatatos, E., Daelemans, W., Verhoeven, B., Stein, B., Potthast, M., Juolaand, P., Sanchez-Perez, M.A., Barron-Cedeno, A.: Overview of the author identification task at pan 2014. In: Working Notes for CLEF 2014 Conference, Sheffield, UK, pp. 877–897 (2014)

    Google Scholar 

  38. Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic text categorization in terms of genre and author. Computational Linguistics 26(4), 461–485 (2000)

    Article  Google Scholar 

  39. Zhao, Y., Zobel, J.: Effective and scalable authorship attribution using function words. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.-H. (eds.) AIRS 2005. LNCS, vol. 3689, pp. 174–189. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  40. Zolotarev, V.M.: Modern Theory of Summation of Random Variables. Modern Probability & Statistics Series. VSP (1997)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zeev Volkovich .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Volkovich, Z. (2016). A Time Series Model of the Writing Process. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2016. Lecture Notes in Computer Science(), vol 9729. Springer, Cham. https://doi.org/10.1007/978-3-319-41920-6_10

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-41920-6_10

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-41919-0

  • Online ISBN: 978-3-319-41920-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics