Assortative Mixture of English Parts of Speech

Leonard, Timothy; Hamel, Lutz; Daniels, Noah M.; Katenka, Natallia V.

doi:10.1007/978-3-319-72150-7_38

Timothy Leonard⁶,
Lutz Hamel⁶,
Noah M. Daniels⁶ &
…
Natallia V. Katenka⁶

Part of the book series: Studies in Computational Intelligence ((SCI,volume 689))

Included in the following conference series:

International Conference on Complex Networks and their Applications

4868 Accesses

Abstract

Network data analysis is an emerging area of study that applies quantitative analysis to complex data from a variety of application fields. Methods used in network data analysis enable visualization of relational data in the form of graphs and also yield descriptive characteristics and predictive graph models. This paper presents an application of network data analysis to the authorship attribution problem. Specifically, we show how a representation of text as a word graph produces the well documented feature sets used in authorship attribution tasks such as the word frequency model and the part-of-speech (POS) bigram model. Analysis of these models along with word graph characteristics provides insights into the English language. Particularly, analysis of the nominal assortative mixture of parts of speech, a statistic that measures the tendency of words of the same POS in the word network to be connected by an edge, reveals regular structural properties of English grammar.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 259.00; Price excludes VAT (USA)

Softcover Book: USD 329.99; Price excludes VAT (USA)

Hardcover Book: USD 329.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Modeling texts with networks: comparing five approaches to sentence representation

Article 20 June 2024

Network Motifs Are a Powerful Tool for Semantic Distinction

Language Comparison via Network Topology

Notes

1.
It is necessary for calculating vertex degree to connect words that end a sentence to a dummy end vertex.
2.
Excluding symbols and list items markers.
3.
Consider the French use of articles le and la.

References

Azar, P.: Using algorithmic attribution techniques to determine authorship in unsigned judicial opinions. Stanf. Technol. Law Rev. 16(3) (2013). https://journals.law.stanford.edu/sites/default/files/stanford-technology-law-review-stlr/online/algorithmicattribution.pdf
Diederich, J., Kindermann, J., Leopold, E., Paass, G.: Authorship attribution with support vector machines. Appl. Intell. 19(1–2), 109–123 (2003). https://doi.org/10.1023/A:1023824908771
Hamel, L.H.: Knowledge Discovery with Support Vector Machines (Wiley Series on Methods and Applications in Data Mining). Wiley-Interscience (2011)
Google Scholar
Hirst, G., Feiguina, O.: Bigrams of syntactic labels for authorship discrimination of short texts. In: Literary and Linguistic Computing (2007)
Google Scholar
Kolaczyk, E.D., Csardi, G.: Statistical Analysis of Network Data with R (Use R!). Springer Science and Business Media (2014)
Google Scholar
Lahiri, S.: Complexity of Word Collocation Networks: A Preliminary Structural Analysis. In: Proceedings of the Student Research Workshop at the 14th Conference of the European Chapter of the Association for Computational Linguistics, pp. 96–105. Association for Computational Linguistics, Gothenburg, Sweden. http://www.aclweb.org/anthology/E14-3011 (2014)
Lahiri, S., Mihalcea, R.: Authorship attribution using word network features. CoRR abs/1311.2978 (2013). arXiv:1311.2978
Litvak, N., van der Hofstad, R.: Degree-degree correlations in random graphs with heavy-tailed degrees (2012). ArXiv e-prints
Google Scholar
Litvak, N., van der Hofstad, R.: Uncovering disassortativity in large scale-free networks. 87(2), 022801 (2013). arXiv e-prints. https://doi.org/10.1103/PhysRevE.87.022801
Mihalcea, R., Radev, D.: Graph-based natural language processing and information retrieval. Cambridge University Press, United Kingdom (2011). https://doi.org/10.1017/CBO9780511976247
Piantadosi, S.T.: Zipf’s word frequency law in natural language: A critical review and future directions. Psychon. Bull. Rev. 21(5), 1112–1130 (2014). https://doi.org/10.3758/s13423-014-0585-6
Seroussi, Y., Zukerman, I., Bohnert, F.: Authorship attribution with topic models. Comput. Linguist. 40(2), 269–310 (2014)
Article Google Scholar
Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (2009). https://doi.org/10.1002/asi.21001
Toutanove, K., Manning, C.D.: Enriching the knowledge sources used in a maximum entropy part-of-speech tagger. In: In Proceedings of the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, pp. 63–70. https://nlp.stanford.edu/software/tagger.shtml (2000)

Download references

Author information

Authors and Affiliations

Department of Computer Science and Statistics, University of Rhode Island, Kingston, RI, USA
Timothy Leonard, Lutz Hamel, Noah M. Daniels & Natallia V. Katenka

Authors

Timothy Leonard
View author publications
You can also search for this author in PubMed Google Scholar
Lutz Hamel
View author publications
You can also search for this author in PubMed Google Scholar
Noah M. Daniels
View author publications
You can also search for this author in PubMed Google Scholar
Natallia V. Katenka
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Natallia V. Katenka .

Editor information

Editors and Affiliations

University of Lyon 2, Lyon, France
Chantal Cherifi
University of Burgundy, Dijon, France
Hocine Cherifi
École Normale Supérieure de Lyon, Lyon, France
Márton Karsai
University College London, London, United Kingdom
Mirco Musolesi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Leonard, T., Hamel, L., Daniels, N.M., Katenka, N.V. (2018). Assortative Mixture of English Parts of Speech. In: Cherifi, C., Cherifi, H., Karsai, M., Musolesi, M. (eds) Complex Networks & Their Applications VI. COMPLEX NETWORKS 2017. Studies in Computational Intelligence, vol 689. Springer, Cham. https://doi.org/10.1007/978-3-319-72150-7_38

Download citation

DOI: https://doi.org/10.1007/978-3-319-72150-7_38
Published: 27 November 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-72149-1
Online ISBN: 978-3-319-72150-7
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics