Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

One of the consequences of todays possibilities and ease to share information over the world wide web is the high availability of textual data, which is either created by social media users or made publicly available through large literary databases like Project GutenbergFootnote 1. Such data provides a huge source for scientific research in many different areas including text mining problems like web content mining or sentiment analysis [11], but also for social media text based recommender systems (e.g., [12]). A still very important field, which is discussed since the 19\(^{th}\) century and which attempts to solve the problem to automatically detect (information about) the writer of a text is authorship attribution [6]. Typical metrics to build stylistic fingerprints include lexical features like character n-grams (e.g., [4]), word frequencies (e.g., [2]) or average word/sentence lengths (e.g., [13]), syntactic features like Part-of-Speech (POS) tag frequencies (e.g., [4]) or structural features like average paragraph lengths or indentation usages (e.g., [13]). A related problem emerges from the fact that the vast amount of available text collections makes it easier for a potential plagiarist to find fragments that can be copied. On the contrary it becomes steadily harder for detection systems to find misuses by just comparing text, and thus advanced algorithms have to be developed. This paper gives an overview of our recent grammar-based research in the broad field of author analysis, including authorship attribution, profiling, plagiarism detection and Bible analysis. All of those applications are based on a pure analysis of the grammar syntax of authors and processed by commonly used machine learning algorithms.

Fig. 1.
figure 1

Parse trees of sentences \(S_1\) and \(S_2\).

2 Grammar-Based Text Analysis

While constructing sentences, an author has to adhere to the syntactic rules defined by a specific language. Nevertheless, the number of choices is large, which leads to the assumption that writers intuitively reuse preferred patterns to build their sentences. As a consequence, those patterns can be identified and utilized as a style marker. All applications presented in this paper rely on the analysis of sentences without considering the vocabulary used. Thereby, a parse tree (or syntax tree) for each sentence is calculated, which consists of structured POS tags and serves as the main processing unit to investigate the style of an author. Figure 1 shows the parse trees of the Einstein quote “Insanity: doing the same thing over and over again and expecting different results” (\(S_1\)) and a slightly modified version (\(S_2\)). It can be seen that the trees differ significantly, although the semantic meaning is the same. To quantify such differences of grammar trees, the concept of pq-grams is used [1]. In a brief simplification pq-grams can be seen as “n-grams for trees”, as they represent structural parts of the tree. A pq-gram consists of a stem p and a base q, whereby p defines how much nodes are included vertically, and q defines the number of nodes to be considered horizontally. For example, a valid pq-gram with \(p=2\) and \(q=3\) starting at the FRAG tag of the tree for \(S_1\) would be [FRAG-S-VP-CC-VP]. In order to obtain all possible pq-grams, the base is shifted left and right additionally while marking non existing nodes with *. Consequently, also the pq-grams [FRAG-S-*-*-VP], [FRAG-S-*-VP-CC], [FRAG-S-CC-VP-*] and [FRAG-S-VP-*-*] are valid. Finally, the pq-gram index contains all possible pq-grams of a grammar tree, starting at each node. Because the presented approaches solely analyze the grammar, the leafs of the trees (i.e., the words) have been omitted. The main procedure is as follows:

  1. 1.

    Clean the document, split it into single sentences, calculate a parse tree for every sentenceFootnote 2 and compute the corresponding pq-gram index.

  2. 2.

    Create a profile consisting of allFootnote 3 occurring pq-grams and transform the profile into a set of features.

  3. 3.

    Use the generated features as input for classifiers in order to, for example, assign authorships or predict the age of a writer.

Table 1. Example of a pq-gram Profile.

A profile is calculated by normalizing the number of each occurring pq-gram and assigning it a rank by performing a sort in descending order. Table 1 shows an example using \(p=q=2\). Each profile is then transformed into a set of features which serve as input for machine learning algorithms, whereby each pq-gram results in two features: (a) the pq-gram with the occurrence rate, and (b) the pq-gram with its rank. As an example, the first line of Table 1 would be transformed into the two features: {’NP-NN-*-*’: 4.07} and {’NP-NN-*-*–RANK’: 1}. Depending on the document size, the number of distinct features utilized in the following applications ranges between 1,000 and 15,000, which have been processed by common classifiers like Naive Bayes or Support Vector Machines (LibSVM), included in the WEKA framework [5].

3 Approaches

The presented analysis has been applied to several problem types. At first it was used with authorship attribution [8], i.e., it was evaluated if the author of a document can be predicted by analyzing only the grammar syntax. Experiments on different datasets reveal promising accuracies between 75–100 %, which can be compared to other state-of-the-art approaches. Related to that, several approaches have been developed to reveal potential plagiarism [7]. Using machine-learned classifications of sliding windows, an accuracy (F-score) of up to 40 % (for “short” documents with less than 100 sentences even 54 %) could be gained, which is a very good value for so-called intrinsic plagiarism detectors. In addition it could be shown, that grammar-based machine learning algorithms can also be successfully used to predict meta-information like the gender or age of an author (accuracy\(\sim \)70 %, [9]), but also to attribute authors of Old Hebrew Bible passages [10] with a conformance rate of 80–100% compared to current literary criticism knowledge. Summarizing, grammar analysis in combination with machine learners provide a solid base for tackling the mentioned problems as well as general text analysis problems, as the pq-gram extraction is universally applicable to any written text.

4 Conclusion

This paper gives an overview of the main idea to analyze authors by investigating the grammar style used to formulate sentences. The basic principle is to segment a text into sentences, calculate parse trees and to extract pq-grams, which represent the structure of the trees. Several approaches in different domains like authorship attribution or profiling reveal promising results by utilizing pq-gram features as input for common classifiers. Future work may focus on a fine-tuning of the configurations for the latter, as currently only the standard settings are used. Although it was shown that the grammar style is significant, it can additionally be assumed that the existing approaches can be enhanced by incorporating other commonly used features - in particular by features which include information about words and the vocabulary usage.