Abstract
Network data analysis is an emerging area of study that applies quantitative analysis to complex data from a variety of application fields. Methods used in network data analysis enable visualization of relational data in the form of graphs and also yield descriptive characteristics and predictive graph models. This paper presents an application of network data analysis to the authorship attribution problem. Specifically, we show how a representation of text as a word graph produces the well documented feature sets used in authorship attribution tasks such as the word frequency model and the part-of-speech (POS) bigram model. Analysis of these models along with word graph characteristics provides insights into the English language. Particularly, analysis of the nominal assortative mixture of parts of speech, a statistic that measures the tendency of words of the same POS in the word network to be connected by an edge, reveals regular structural properties of English grammar.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
It is necessary for calculating vertex degree to connect words that end a sentence to a dummy end vertex.
- 2.
Excluding symbols and list items markers.
- 3.
Consider the French use of articles le and la.
References
Azar, P.: Using algorithmic attribution techniques to determine authorship in unsigned judicial opinions. Stanf. Technol. Law Rev. 16(3) (2013). https://journals.law.stanford.edu/sites/default/files/stanford-technology-law-review-stlr/online/algorithmicattribution.pdf
Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship attribution with support vector machines. Appl. Intell. 19(1–2), 109–123 (2003). https://doi.org/10.1023/A:1023824908771
Hamel, L.H.: Knowledge Discovery with Support Vector Machines (Wiley Series on Methods and Applications in Data Mining). Wiley-Interscience (2011)
Hirst, G., Feiguina, O.: Bigrams of syntactic labels for authorship discrimination of short texts. In: Literary and Linguistic Computing (2007)
Kolaczyk, E.D., Csardi, G.: Statistical Analysis of Network Data with R (Use R!). Springer Science and Business Media (2014)
Lahiri, S.: Complexity of Word Collocation Networks: A Preliminary Structural Analysis. In: Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 96–105. Association for Computational Linguistics, Gothenburg, Sweden. http://www.aclweb.org/anthology/E14-3011 (2014)
Lahiri, S., Mihalcea, R.: Authorship attribution using word network features. CoRR abs/1311.2978 (2013). arXiv:1311.2978
Litvak, N., van der Hofstad, R.: Degree-degree correlations in random graphs with heavy-tailed degrees (2012). ArXiv e-prints
Litvak, N., van der Hofstad, R.: Uncovering disassortativity in large scale-free networks. 87(2), 022801 (2013). arXiv e-prints. https://doi.org/10.1103/PhysRevE.87.022801
Mihalcea, R., Radev, D.: Graph-based natural language processing and information retrieval. Cambridge University Press, United Kingdom (2011). https://doi.org/10.1017/CBO9780511976247
Piantadosi, S.T.: Zipf’s word frequency law in natural language: A critical review and future directions. Psychon. Bull. Rev. 21(5), 1112–1130 (2014). https://doi.org/10.3758/s13423-014-0585-6
Seroussi, Y., Zukerman, I., Bohnert, F.: Authorship attribution with topic models. Comput. Linguist. 40(2), 269–310 (2014)
Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (2009). https://doi.org/10.1002/asi.21001
Toutanove, K., Manning, C.D.: Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 63–70. https://nlp.stanford.edu/software/tagger.shtml (2000)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Leonard, T., Hamel, L., Daniels, N.M., Katenka, N.V. (2018). Assortative Mixture of English Parts of Speech. In: Cherifi, C., Cherifi, H., Karsai, M., Musolesi, M. (eds) Complex Networks & Their Applications VI. COMPLEX NETWORKS 2017. Studies in Computational Intelligence, vol 689. Springer, Cham. https://doi.org/10.1007/978-3-319-72150-7_38
Download citation
DOI: https://doi.org/10.1007/978-3-319-72150-7_38
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-72149-1
Online ISBN: 978-3-319-72150-7
eBook Packages: EngineeringEngineering (R0)