A Time Series Model of the Writing Process

Volkovich, Zeev

doi:10.1007/978-3-319-41920-6_10

Zeev Volkovich¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9729))

Included in the following conference series:

International Conference on Machine Learning and Data Mining in Pattern Recognition

3023 Accesses
1 Citations

Abstract

The necessity to operate with the huge number of anonymous documents abounding on the Internet is initiating the study of new methods for authorship recognition. The principal weakness of the methods used in this area is that they assess the similarity of text styles without any regard to their surroundings. This paper proposes a novel mathematical model of the writing process striving to quantify this dependency. A text is divided into a series of sequential sub-documents, which are represented via term histograms. The histograms proximity is estimated through a simple probability distance. Intending to typify the text writing style, a new characteristic representing the mean distance between a current sub-document and numerous earlier ones is advanced. An empirical distribution over the whole document of this feature specifies the writing style. So, dissimilarity of such distributions indicates a difference in the writing styles, and their coincidence implies the styles’ identity. Numerical experiments demonstrate high potential ability of the proposed approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Binongo, J.: Who wrote the 15th book of Oz? An application of multivariate analysis to authorship attribution. Chance 16(C), 9–17 (2003)
Google Scholar
Bolshoy, A., Volkovich, Z., Kirzhner, V., Barzily, Z.: Genome clustering: from linguistic models to classification of genetic texts, vol. 286. Springer Science & Business Media (2010)
Google Scholar
Brown, P.F., Pietra, V.J.D., deSouza, P.V., Lai, J.C., Mercer, R.L.: Class-based \(n\)-gram models of natural language. Computational Linguistics 18(4), 467–479 (1992)
Google Scholar
Cha, S.-H.: Comprehensive survey on distance/similarity measures between probability density functions. International Journal of Mathematical Models and Methods in Applied Sciences 1(4), 300–307 (2007)
MathSciNet Google Scholar
Collins, J., Kaufer, D., Vlachos, P., Butler, B., Ishizaki, S.: Detecting collaborations in text: Comparing the authors’ rhetorical language choices in the federalist papers. Computers and the Humanities 38, 15–36 (2004)
Article Google Scholar
Coyotl-Morales, R.M., Villaseñor-Pineda, L., Montes-y-Gómez, M., Rosso, P.: Authorship attribution using word sequences. In: Martínez-Trinidad, J.F., Carrasco Ochoa, J.A., Kittler, J. (eds.) CIARP 2006. LNCS, vol. 4225, pp. 844–853. Springer, Heidelberg (2006)
Chapter Google Scholar
Diederich, J., Kindermann, J., Leopold, E., Paas, G.: Authorship attribution with support vector machines. Applied Intelligence 19(1), 109–123 (2003)
Article MATH Google Scholar
Eissen, S.M., Stein, B., Kulig, M.: Plagiarism detection without reference collections. Springer, Berlin (2007)
Google Scholar
Forsyth, R.: New directions in text categorization. Springer, Heidelberg (1999)
Google Scholar
Frantzeskou, G., Stamatatos, E., Gritzalis, S., Katsikas, S.: Effective identification of source code authors using byte-level information. In: Proceedings of the 28th International Conference on Software Engineering, pp. 893–896. ACM Press, NewYork (2006)
Google Scholar
Fristedt, B.E., Gray, L.F.: A Modern Approach to Probability Theory. Probability and Its Applications. Birkhäuser, Boston (1996)
Google Scholar
Harmer, J.: How to Teach Writing. Pearson Education (2006)
Google Scholar
Hughes, J.M., Foti, N.J., Krakauer, D.C., Rockmore, D.N.: Quantitative patterns of stylistic influence in the evolution of literature. Proc. Natl. Acad. Sci. USA 109(20), 7682–7686 (2012)
Article Google Scholar
Ionescu, R.T., Popescu, M.: Pq kernel. Pattern Recogn. Lett. 55(C), 51–57 (2015)
Article Google Scholar
Juola, P.: Authorship attribution. Foundations and Trends in Information Retrieval 1(3), 233–334 (2006)
Article Google Scholar
Kendall, M.G., Gibbons, J.D.: Rank Correlation Methods. Edward Arnold, London (1990)
MATH Google Scholar
Kestemont, M., Luyckx, K., Daelemans, W., Crombez, T.: Cross-genre authorship verification using unmasking. English Studies 93(3), 340–356 (2012)
Article Google Scholar
Kolmogorov, A.: Sulla determinazione empirica di una legge di distribuzione. G. Ist. Ital. Attuari 4 (1933)
Google Scholar
Koppel, M., Argamon, S., Shimoni, A.R.: Automatically categorizing written texts by author gender. Literary and Linguistic Computing 17(4), 401–412 (2002)
Article Google Scholar
Koppel, M., Schler, J.: Authorship verification as a one-class classification problem. In: Proceedings of the 21st International Conferenceon Machine Learning. Press (2004)
Google Scholar
Koppel, M., Schler, J., Bonchek-Dokow, E.: Measuring differentiability: Unmasking pseudonymous authors. Journal of Machine Learning Research 8, 1261–1276 (2007)
MATH Google Scholar
Koppel, M., Winter, Y.: Determining if two documents are written by the same author. Journal of the American Society for Information Science and Technology 65(1), 178–187 (2014)
Article Google Scholar
Koppel, M., Schler, J., Argamon, S.: Computational methods in authorship attribution. JASIST 60(1), 9–26 (2009)
Article Google Scholar
Luyckx, K., Daelemans, W.: Authorship attribution and verification with many authors and limited data. In: Proceedings of the Twenty-Second International Conference on Computational Linguistics (COLING 2008), pp. 513–520 (2008)
Google Scholar
Manning, C., Schutze, H.: Foundations of Statistical Natural Language Processing. MIT Press, Cambridge (2003)
MATH Google Scholar
Miao, Y., Kešelj, V., Milios, E.: Document clustering using character n-grams: a comparative evaluation with term-based and word-based clustering. In: Proceedings of the 14th ACM International Conference on Information and Knowledge Management, CIKM 2005, pp. 357–358. ACM, New York (2005)
Google Scholar
Peng, F., Schuurmans, D., Keselj, V., Wang, S.: Augmenting naive bayes classifiers with statistical languages models. Information Retrieval 7, 317–345 (2004)
Article Google Scholar
Rachev, S.T.: Probability metrics and the stability of stochastic models. Wiley series in probability and mathematical statistics: Applied probability and statistics. Wiley (1991)
Google Scholar
Rudman, J.: The state of authorship attribution studies: Some problems and solutions. Computers and the Humanities 31, 351–365 (1998)
Article Google Scholar
Ryabko, D., Ryabko, B.: Nonparametric statistical inference for ergodic processes. IEEE Transactions on Information Theory 56(3), 1430–1435 (2010)
Article MathSciNet MATH Google Scholar
Salton, G., Wong, A., Yang, C.S.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Sanderson, C., Guenter, S.: Short text authorship attribution via sequence kernels, markov chains and author unmasking: an investigation. In: Proceedings of the International Conference on Empirical Methods in Natural Language Processing, pp. 482–491 (2006)
Google Scholar
Sebastiani, F.: Machine learning in automated text categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Article Google Scholar
Smirnov, N.: Table for estimating the goodness of fit of empirical distributions. Annals of Mathematical Statistics 19 (1948)
Google Scholar
Stamatatos, E.: A survey of modern authorship attribution methods. Journal of the American Society for Information Science and Technology 60(3), 538–556 (2009)
Article Google Scholar
Stamatatos, E., Daelemans, W., Verhoeven, B., Juola, P., Lopez Lopez, A., Potthast, M., Stein, B.: Overview of the author identification task at pan 2015. In: Cappellato, L., Ferro, N., Gareth, J., San Juan, E. (eds.) Working Notes Papers of the CLEF 2015 Evaluation Labs (2015)
Google Scholar
Stamatatos, E., Daelemans, W., Verhoeven, B., Stein, B., Potthast, M., Juolaand, P., Sanchez-Perez, M.A., Barron-Cedeno, A.: Overview of the author identification task at pan 2014. In: Working Notes for CLEF 2014 Conference, Sheffield, UK, pp. 877–897 (2014)
Google Scholar
Stamatatos, E., Fakotakis, N., Kokkinakis, G.: Automatic text categorization in terms of genre and author. Computational Linguistics 26(4), 461–485 (2000)
Article Google Scholar
Zhao, Y., Zobel, J.: Effective and scalable authorship attribution using function words. In: Lee, G.G., Yamada, A., Meng, H., Myaeng, S.-H. (eds.) AIRS 2005. LNCS, vol. 3689, pp. 174–189. Springer, Heidelberg (2005)
Chapter Google Scholar
Zolotarev, V.M.: Modern Theory of Summation of Random Variables. Modern Probability & Statistics Series. VSP (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

Software Engineering Department, ORT Braude College of Engineering, 21982, Karmiel, Israel
Zeev Volkovich

Authors

Zeev Volkovich
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Zeev Volkovich .

Editor information

Editors and Affiliations

IBaI, Inst of Comp Vision and applied Comp Sci, Leipzig, Sachsen, Germany
Petra Perner

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Volkovich, Z. (2016). A Time Series Model of the Writing Process. In: Perner, P. (eds) Machine Learning and Data Mining in Pattern Recognition. MLDM 2016. Lecture Notes in Computer Science(), vol 9729. Springer, Cham. https://doi.org/10.1007/978-3-319-41920-6_10

Download citation

DOI: https://doi.org/10.1007/978-3-319-41920-6_10
Published: 28 June 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-41919-0
Online ISBN: 978-3-319-41920-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics