Abstract
The amount of textual data available from digitalized sources such as free online libraries or social media posts has increased drastically in the last decade. In this paper, the main idea to analyze authors by their grammatical writing style is presented. In particular, tasks like authorship attribution, plagiarism detection or author profiling are tackled using the presented algorithm, revealing promising results. Thereby all of the presented approaches are ultimately solved by machine learning algorithms.
You have full access to this open access chapter, Download conference paper PDF
Similar content being viewed by others
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
1 Introduction
One of the consequences of todays possibilities and ease to share information over the world wide web is the high availability of textual data, which is either created by social media users or made publicly available through large literary databases like Project GutenbergFootnote 1. Such data provides a huge source for scientific research in many different areas including text mining problems like web content mining or sentiment analysis [11], but also for social media text based recommender systems (e.g., [12]). A still very important field, which is discussed since the 19\(^{th}\) century and which attempts to solve the problem to automatically detect (information about) the writer of a text is authorship attribution [6]. Typical metrics to build stylistic fingerprints include lexical features like character n-grams (e.g., [4]), word frequencies (e.g., [2]) or average word/sentence lengths (e.g., [13]), syntactic features like Part-of-Speech (POS) tag frequencies (e.g., [4]) or structural features like average paragraph lengths or indentation usages (e.g., [13]). A related problem emerges from the fact that the vast amount of available text collections makes it easier for a potential plagiarist to find fragments that can be copied. On the contrary it becomes steadily harder for detection systems to find misuses by just comparing text, and thus advanced algorithms have to be developed. This paper gives an overview of our recent grammar-based research in the broad field of author analysis, including authorship attribution, profiling, plagiarism detection and Bible analysis. All of those applications are based on a pure analysis of the grammar syntax of authors and processed by commonly used machine learning algorithms.
2 Grammar-Based Text Analysis
While constructing sentences, an author has to adhere to the syntactic rules defined by a specific language. Nevertheless, the number of choices is large, which leads to the assumption that writers intuitively reuse preferred patterns to build their sentences. As a consequence, those patterns can be identified and utilized as a style marker. All applications presented in this paper rely on the analysis of sentences without considering the vocabulary used. Thereby, a parse tree (or syntax tree) for each sentence is calculated, which consists of structured POS tags and serves as the main processing unit to investigate the style of an author. Figure 1 shows the parse trees of the Einstein quote “Insanity: doing the same thing over and over again and expecting different results” (\(S_1\)) and a slightly modified version (\(S_2\)). It can be seen that the trees differ significantly, although the semantic meaning is the same. To quantify such differences of grammar trees, the concept of pq-grams is used [1]. In a brief simplification pq-grams can be seen as “n-grams for trees”, as they represent structural parts of the tree. A pq-gram consists of a stem p and a base q, whereby p defines how much nodes are included vertically, and q defines the number of nodes to be considered horizontally. For example, a valid pq-gram with \(p=2\) and \(q=3\) starting at the FRAG tag of the tree for \(S_1\) would be [FRAG-S-VP-CC-VP]. In order to obtain all possible pq-grams, the base is shifted left and right additionally while marking non existing nodes with *. Consequently, also the pq-grams [FRAG-S-*-*-VP], [FRAG-S-*-VP-CC], [FRAG-S-CC-VP-*] and [FRAG-S-VP-*-*] are valid. Finally, the pq-gram index contains all possible pq-grams of a grammar tree, starting at each node. Because the presented approaches solely analyze the grammar, the leafs of the trees (i.e., the words) have been omitted. The main procedure is as follows:
-
1.
Clean the document, split it into single sentences, calculate a parse tree for every sentenceFootnote 2 and compute the corresponding pq-gram index.
-
2.
Create a profile consisting of allFootnote 3 occurring pq-grams and transform the profile into a set of features.
-
3.
Use the generated features as input for classifiers in order to, for example, assign authorships or predict the age of a writer.
A profile is calculated by normalizing the number of each occurring pq-gram and assigning it a rank by performing a sort in descending order. Table 1 shows an example using \(p=q=2\). Each profile is then transformed into a set of features which serve as input for machine learning algorithms, whereby each pq-gram results in two features: (a) the pq-gram with the occurrence rate, and (b) the pq-gram with its rank. As an example, the first line of Table 1 would be transformed into the two features: {’NP-NN-*-*’: 4.07} and {’NP-NN-*-*–RANK’: 1}. Depending on the document size, the number of distinct features utilized in the following applications ranges between 1,000 and 15,000, which have been processed by common classifiers like Naive Bayes or Support Vector Machines (LibSVM), included in the WEKA framework [5].
3 Approaches
The presented analysis has been applied to several problem types. At first it was used with authorship attribution [8], i.e., it was evaluated if the author of a document can be predicted by analyzing only the grammar syntax. Experiments on different datasets reveal promising accuracies between 75–100 %, which can be compared to other state-of-the-art approaches. Related to that, several approaches have been developed to reveal potential plagiarism [7]. Using machine-learned classifications of sliding windows, an accuracy (F-score) of up to 40 % (for “short” documents with less than 100 sentences even 54 %) could be gained, which is a very good value for so-called intrinsic plagiarism detectors. In addition it could be shown, that grammar-based machine learning algorithms can also be successfully used to predict meta-information like the gender or age of an author (accuracy\(\sim \)70 %, [9]), but also to attribute authors of Old Hebrew Bible passages [10] with a conformance rate of 80–100% compared to current literary criticism knowledge. Summarizing, grammar analysis in combination with machine learners provide a solid base for tackling the mentioned problems as well as general text analysis problems, as the pq-gram extraction is universally applicable to any written text.
4 Conclusion
This paper gives an overview of the main idea to analyze authors by investigating the grammar style used to formulate sentences. The basic principle is to segment a text into sentences, calculate parse trees and to extract pq-grams, which represent the structure of the trees. Several approaches in different domains like authorship attribution or profiling reveal promising results by utilizing pq-gram features as input for common classifiers. Future work may focus on a fine-tuning of the configurations for the latter, as currently only the standard settings are used. Although it was shown that the grammar style is significant, it can additionally be assumed that the existing approaches can be enhanced by incorporating other commonly used features - in particular by features which include information about words and the vocabulary usage.
Notes
- 1.
https://www.gutenberg.org, visited April 2016.
- 2.
Using the Stanford Parser [3].
- 3.
Depending on the approach, the total maximum number of pq-grams in a profile has been restricted, e.g., to the 200 most frequent pq-grams.
References
Augsten, N., Böhlen, M., Gamper, J.: The pq-gram distance between ordered labeled trees. ACM Trans. Database Syst. (TODS) 35(1), 4 (2010)
Holmes, D.I.: The evolution of stylometry in humanities scholarship. Literary Linguist. Comput. 13(3), 111–117 (1998)
Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting on ACL, Sapporo, Japan, pp. 423–430 (2003)
Koppel, M., Schler, J., Argamon, S.: Computational methods in authorship attribution. J. Am. Soc. Inf. Sci. Technol. 60(1), 9–26 (2009)
Hall, M., et al.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (2009)
Tschuggnall, M., Specht, G.: Using grammar-profiles to intrinsically expose plagiarism in text documents. In: Métais, E., Meziane, F., Saraee, M., Sugumaran, V., Vadera, S. (eds.) NLDB 2013. LNCS, vol. 7934, pp. 297–302. Springer, Heidelberg (2013)
Tschuggnall, M., Specht, G.: Enhancing authorship attribution by utilizing syntax tree profiles. In: Proceedings of the 14th Conference of the European Chapter of the ACL (EACL), Gothenburg, Sweden, pp. 195–199, April 2014
Tschuggnall, M., Specht, G.: On the potential of grammar features for automated author profiling. Adv. Intell. Syst. 8(3&4), 255–265 (2015)
Tschuggnall, M., Specht, G., Riepl, C.: Algorithmisch unterstützte Literarkritik. Memorialband Richter, ATSAT 100, St. Ottilien (2016, to appear)
Vinodhini, G., Chandrasekaran, R.: Sentiment analysis, opinion mining: a survey. Int. J. 2(6) (2012)
Zangerle, E., Gassler, W., Specht, G.: On the impact of text similarity functions on hashtag recommendations in microblogging environments. Soc. Netw. Anal. Min. 3(4), 889–898 (2013)
Zheng, R., Li, J., Chen, H., Huang, Z.: A framework for authorship identification of online messages: writing-style features and classification techniques. J. Am. Soc. Inf. Sci. Technol. 57(3), 378–393 (2006)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2016 Springer International Publishing AG
About this paper
Cite this paper
Tschuggnall, M., Specht, G. (2016). From Plagiarism Detection to Bible Analysis: The Potential of Machine Learning for Grammar-Based Text Analysis. In: Berendt, B., et al. Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2016. Lecture Notes in Computer Science(), vol 9853. Springer, Cham. https://doi.org/10.1007/978-3-319-46131-1_27
Download citation
DOI: https://doi.org/10.1007/978-3-319-46131-1_27
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46130-4
Online ISBN: 978-3-319-46131-1
eBook Packages: Computer ScienceComputer Science (R0)