From Plagiarism Detection to Bible Analysis: The Potential of Machine Learning for Grammar-Based Text Analysis

Tschuggnall, Michael; Specht, Günther

doi:10.1007/978-3-319-46131-1_27

Michael Tschuggnall²⁰ &
Günther Specht²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9853))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

2962 Accesses
2 Citations
3 Altmetric

Abstract

The amount of textual data available from digitalized sources such as free online libraries or social media posts has increased drastically in the last decade. In this paper, the main idea to analyze authors by their grammatical writing style is presented. In particular, tasks like authorship attribution, plagiarism detection or author profiling are tackled using the presented algorithm, revealing promising results. Thereby all of the presented approaches are ultimately solved by machine learning algorithms.

You have full access to this open access chapter, Download conference paper PDF

On the use of character n-grams as the only intrinsic evidence of plagiarism

Article 31 January 2019

Imene Bensalem, Paolo Rosso & Salim Chikhi

Exactus Like: Plagiarism Detection in Scientific Texts

Paraphrase plagiarism identification with character-level features

Article 21 December 2017

Fernando Sánchez-Vega, Esaú Villatoro-Tello, … Luis Villaseñor-Pineda

Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

One of the consequences of todays possibilities and ease to share information over the world wide web is the high availability of textual data, which is either created by social media users or made publicly available through large literary databases like Project Gutenberg^{Footnote 1}. Such data provides a huge source for scientific research in many different areas including text mining problems like web content mining or sentiment analysis [11], but also for social media text based recommender systems (e.g., [12]). A still very important field, which is discussed since the 19\(^{th}\) century and which attempts to solve the problem to automatically detect (information about) the writer of a text is authorship attribution [6]. Typical metrics to build stylistic fingerprints include lexical features like character n-grams (e.g., [4]), word frequencies (e.g., [2]) or average word/sentence lengths (e.g., [13]), syntactic features like Part-of-Speech (POS) tag frequencies (e.g., [4]) or structural features like average paragraph lengths or indentation usages (e.g., [13]). A related problem emerges from the fact that the vast amount of available text collections makes it easier for a potential plagiarist to find fragments that can be copied. On the contrary it becomes steadily harder for detection systems to find misuses by just comparing text, and thus advanced algorithms have to be developed. This paper gives an overview of our recent grammar-based research in the broad field of author analysis, including authorship attribution, profiling, plagiarism detection and Bible analysis. All of those applications are based on a pure analysis of the grammar syntax of authors and processed by commonly used machine learning algorithms.

2 Grammar-Based Text Analysis

While constructing sentences, an author has to adhere to the syntactic rules defined by a specific language. Nevertheless, the number of choices is large, which leads to the assumption that writers intuitively reuse preferred patterns to build their sentences. As a consequence, those patterns can be identified and utilized as a style marker. All applications presented in this paper rely on the analysis of sentences without considering the vocabulary used. Thereby, a parse tree (or syntax tree) for each sentence is calculated, which consists of structured POS tags and serves as the main processing unit to investigate the style of an author. Figure 1 shows the parse trees of the Einstein quote “Insanity: doing the same thing over and over again and expecting different results” (\(S_1\)) and a slightly modified version (\(S_2\)). It can be seen that the trees differ significantly, although the semantic meaning is the same. To quantify such differences of grammar trees, the concept of pq-grams is used [1]. In a brief simplification pq-grams can be seen as “n-grams for trees”, as they represent structural parts of the tree. A pq-gram consists of a stem p and a base q, whereby p defines how much nodes are included vertically, and q defines the number of nodes to be considered horizontally. For example, a valid pq-gram with \(p=2\) and \(q=3\) starting at the FRAG tag of the tree for \(S_1\) would be [FRAG-S-VP-CC-VP]. In order to obtain all possible pq-grams, the base is shifted left and right additionally while marking non existing nodes with *. Consequently, also the pq-grams [FRAG-S-*-*-VP], [FRAG-S-*-VP-CC], [FRAG-S-CC-VP-*] and [FRAG-S-VP-*-*] are valid. Finally, the pq-gram index contains all possible pq-grams of a grammar tree, starting at each node. Because the presented approaches solely analyze the grammar, the leafs of the trees (i.e., the words) have been omitted. The main procedure is as follows:

1.
Clean the document, split it into single sentences, calculate a parse tree for every sentence^{Footnote 2} and compute the corresponding pq-gram index.
2.
Create a profile consisting of all^{Footnote 3} occurring pq-grams and transform the profile into a set of features.
3.
Use the generated features as input for classifiers in order to, for example, assign authorships or predict the age of a writer.

Table 1. Example of a pq-gram Profile.

Full size table

A profile is calculated by normalizing the number of each occurring pq-gram and assigning it a rank by performing a sort in descending order. Table 1 shows an example using \(p=q=2\). Each profile is then transformed into a set of features which serve as input for machine learning algorithms, whereby each pq-gram results in two features: (a) the pq-gram with the occurrence rate, and (b) the pq-gram with its rank. As an example, the first line of Table 1 would be transformed into the two features: {’NP-NN-*-*’: 4.07} and {’NP-NN-*-*–RANK’: 1}. Depending on the document size, the number of distinct features utilized in the following applications ranges between 1,000 and 15,000, which have been processed by common classifiers like Naive Bayes or Support Vector Machines (LibSVM), included in the WEKA framework [5].

3 Approaches

The presented analysis has been applied to several problem types. At first it was used with authorship attribution [8], i.e., it was evaluated if the author of a document can be predicted by analyzing only the grammar syntax. Experiments on different datasets reveal promising accuracies between 75–100 %, which can be compared to other state-of-the-art approaches. Related to that, several approaches have been developed to reveal potential plagiarism [7]. Using machine-learned classifications of sliding windows, an accuracy (F-score) of up to 40 % (for “short” documents with less than 100 sentences even 54 %) could be gained, which is a very good value for so-called intrinsic plagiarism detectors. In addition it could be shown, that grammar-based machine learning algorithms can also be successfully used to predict meta-information like the gender or age of an author (accuracy\(\sim \)70 %, [9]), but also to attribute authors of Old Hebrew Bible passages [10] with a conformance rate of 80–100% compared to current literary criticism knowledge. Summarizing, grammar analysis in combination with machine learners provide a solid base for tackling the mentioned problems as well as general text analysis problems, as the pq-gram extraction is universally applicable to any written text.

4 Conclusion

This paper gives an overview of the main idea to analyze authors by investigating the grammar style used to formulate sentences. The basic principle is to segment a text into sentences, calculate parse trees and to extract pq-grams, which represent the structure of the trees. Several approaches in different domains like authorship attribution or profiling reveal promising results by utilizing pq-gram features as input for common classifiers. Future work may focus on a fine-tuning of the configurations for the latter, as currently only the standard settings are used. Although it was shown that the grammar style is significant, it can additionally be assumed that the existing approaches can be enhanced by incorporating other commonly used features - in particular by features which include information about words and the vocabulary usage.

Notes

1.
https://www.gutenberg.org, visited April 2016.
2.
Using the Stanford Parser [3].
3.
Depending on the approach, the total maximum number of pq-grams in a profile has been restricted, e.g., to the 200 most frequent pq-grams.

References

Augsten, N., Böhlen, M., Gamper, J.: The pq-gram distance between ordered labeled trees. ACM Trans. Database Syst. (TODS) 35(1), 4 (2010)
Article Google Scholar
Holmes, D.I.: The evolution of stylometry in humanities scholarship. Literary Linguist. Comput. 13(3), 111–117 (1998)
Article Google Scholar
Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting on ACL, Sapporo, Japan, pp. 423–430 (2003)
Google Scholar
Koppel, M., Schler, J., Argamon, S.: Computational methods in authorship attribution. J. Am. Soc. Inf. Sci. Technol. 60(1), 9–26 (2009)
Article Google Scholar
Hall, M., et al.: The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 11(1), 10–18 (2009)
Article Google Scholar
Stamatatos, E.: A survey of modern authorship attribution methods. J. Am. Soc. Inf. Sci. Technol. 60(3), 538–556 (2009)
Article Google Scholar
Tschuggnall, M., Specht, G.: Using grammar-profiles to intrinsically expose plagiarism in text documents. In: Métais, E., Meziane, F., Saraee, M., Sugumaran, V., Vadera, S. (eds.) NLDB 2013. LNCS, vol. 7934, pp. 297–302. Springer, Heidelberg (2013)
Chapter Google Scholar
Tschuggnall, M., Specht, G.: Enhancing authorship attribution by utilizing syntax tree profiles. In: Proceedings of the 14th Conference of the European Chapter of the ACL (EACL), Gothenburg, Sweden, pp. 195–199, April 2014
Google Scholar
Tschuggnall, M., Specht, G.: On the potential of grammar features for automated author profiling. Adv. Intell. Syst. 8(3&4), 255–265 (2015)
Google Scholar
Tschuggnall, M., Specht, G., Riepl, C.: Algorithmisch unterstützte Literarkritik. Memorialband Richter, ATSAT 100, St. Ottilien (2016, to appear)
Google Scholar
Vinodhini, G., Chandrasekaran, R.: Sentiment analysis, opinion mining: a survey. Int. J. 2(6) (2012)
Google Scholar
Zangerle, E., Gassler, W., Specht, G.: On the impact of text similarity functions on hashtag recommendations in microblogging environments. Soc. Netw. Anal. Min. 3(4), 889–898 (2013)
Article Google Scholar
Zheng, R., Li, J., Chen, H., Huang, Z.: A framework for authorship identification of online messages: writing-style features and classification techniques. J. Am. Soc. Inf. Sci. Technol. 57(3), 378–393 (2006)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, University of Innsbruck, Innsbruck, Austria
Michael Tschuggnall & Günther Specht

Authors

Michael Tschuggnall
View author publications
You can also search for this author in PubMed Google Scholar
Günther Specht
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michael Tschuggnall .

Editor information

Editors and Affiliations

Department of Computer Science, KU Leuven, Leuven, Belgium
Bettina Berendt
Deloitte GmbH, München, Germany
Björn Bringmann
Laboratoire Hubert Curien, Jean Monnet University, Saint-Etienne, France
Élisa Fromont
Allianz SE, Munich, Germany
Gemma Garriga
Max-Planck-Institute for Informatics, Saarbrücken, Germany
Pauli Miettinen
Aalto University School of Science, Espoo, Finland
Nikolaj Tatti
Siemens AG & Lud. Max. Univ. of Munich, Munich, Germany
Volker Tresp

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tschuggnall, M., Specht, G. (2016). From Plagiarism Detection to Bible Analysis: The Potential of Machine Learning for Grammar-Based Text Analysis. In: Berendt, B., et al. Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2016. Lecture Notes in Computer Science(), vol 9853. Springer, Cham. https://doi.org/10.1007/978-3-319-46131-1_27

Download citation

DOI: https://doi.org/10.1007/978-3-319-46131-1_27
Published: 03 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46130-4
Online ISBN: 978-3-319-46131-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

From Plagiarism Detection to Bible Analysis: The Potential of Machine Learning for Grammar-Based Text Analysis

Abstract

Similar content being viewed by others

On the use of character n-grams as the only intrinsic evidence of plagiarism

Exactus Like: Plagiarism Detection in Scientific Texts

Paraphrase plagiarism identification with character-level features

Keywords

1 Introduction

2 Grammar-Based Text Analysis

3 Approaches

4 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

From Plagiarism Detection to Bible Analysis: The Potential of Machine Learning for Grammar-Based Text Analysis

Abstract

Similar content being viewed by others

On the use of character n-grams as the only intrinsic evidence of plagiarism

Exactus Like: Plagiarism Detection in Scientific Texts

Paraphrase plagiarism identification with character-level features

Keywords

1 Introduction

2 Grammar-Based Text Analysis

3 Approaches

4 Conclusion

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation