Skip to main content
Log in

An efficient classification approach in imbalanced datasets for intrinsic plagiarism detection

  • Original Paper
  • Published:
Evolving Systems Aims and scope Submit manuscript

Abstract

The ever increasing volume of information due to the widespread use of computers and the web has made effective plagiarism detection methods a necessity. Plagiarism can be found in many settings and forms, in literature, in academic papers, even in programming code. Intrinsic plagiarism detection is the task that deals with the discovery of plagiarized passages in a text document, by identifying the stylistic changes and inconsistencies within the document itself, given that no reference corpus is available. The main idea consists in profiling the style of the original author and marking the passages that seem to differ significantly. In this work, we follow a supervised machine learning classification approach. We consider, for the first time, the fact of imbalanced data as a crucial parameter of the problem and experiment with various balancing techniques. Apart from this, we propose some novel stylistic features. We combine our features and imbalanced dataset treatment with various classification methods. Our detection system is tested on the data corpora of PAN Webis intrinsic plagiarism detection shared tasks. It is compared to the best performing detection systems on these datasets, and succeeds the best resulting scores.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. http://pan.webis.de/.

  2. http://www.gutenberg.org.

  3. https://opennlp.apache.org/

  4. After experiments, we denote a word as frequent it is of word frequency class value lower than 1.8, in respect to Eq. 2.

  5. http://scikit-learn.org

References

  • Alsallal M, Iqbal R, Amin S, James A (2013) Intrinsic plagiarism detection using latent semantic indexing and stylometry. In: 2013 Sixth international conference on developments in eSystems engineering, Abu Dhabi, pp 145–150. https://doi.org/10.1109/DeSE.2013.34

  • Alzahrani S, Salim N, Abraham A (2012) Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Trans Syst Man Cybern Part C (Applications and Reviews) 42:133–149

    Article  Google Scholar 

  • Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor Newsl 6(1):20–29. https://doi.org/10.1145/1007730.1007735 (ISSN 1931-0145)

  • Bensalem I, Rosso P, Chikhi S (2014) Intrinsic plagiarism detection using n-gram classes. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25–29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp 1459–1464. https://aclweb.org/anthology/D/D14/D14-1153.pdf

  • Bowyer KW, Chawla NV, Hall LO, Kegelmeyer WP (2011) SMOTE: synthetic minority over-sampling technique. CoRR, abs/1106.1813. https://arxiv.org/abs/1106.1813

  • Cheng N, Chandramouli R, Subbalakshmi KP (2011) Author gender identification from text. Digit Investig 8(1): 78–88. https://doi.org/10.1016/j.diin.2011.04.002 (ISSN 1742-2876)

  • Curran D (2010) An evolutionary neural network approach to intrinsic plagiarism detection. In: Proceedings of the 20th Irish Conference on Artificial Intelligence and Cognitive Science, AICS’09, Springer-Verlag, Berlin, Heidelberg, pp 33–40. https://dl.acm.org/citation.cfm?id=1939047.1939055 (ISBN 3-642-17079-X, 978-3-642-17079-9)

  • Dubay WH (2004) The principles of readability. Impact Information, Costa Mesa

    Google Scholar 

  • zu Eissen SM, Stein B (2006) Intrinsic plagiarism detection. In: Lalmas M, MacFarlane A, Rüger S, Tombros A, Tsikrika T, Yavlinsky A (eds) Advances in information retrieval. Springer, Berlin Heidelberg, pp 565–569 (ISBN 978-3-540-33348-7).

  • Holmes DI (1998) The evolution of stylometry in humanities scholarship. Lit Linguist Comput 13(3): 111–117. https://doi.org/10.1093/llc/13.3.111

  • Hua X, Li S, Li P, Zhu Q (2013) Research on intrinsic plagiarism detection resolution: a supervised learning approach. In: Ji D, Xiao G (eds) Chinese lexical semantics. Springer, Berlin, Heidelberg, pp 58–63 (ISBN 978-3-642-36337-5)

  • Kestemont M, Luyckx K, Daelemans W (2011) Intrinsic plagiarism detection using character trigram distance scores—notebook for PAN at CLEF 2011. In: Petras V, Forner P, Clough PD (eds) Notebook papers of CLEF 2011 labs and workshops, 19–22 September 2011, Amsterdam, The Netherlands (ISBN 978-88-904810-1-7, 2038-4963)

  • Koppel M, Schler J (2004) Authorship verification as a one-class classification problem. In: Proceedings of the Twenty-first International Conference on Machine Learning, ICML, ACM ’04, New York, NY, USA. https://doi.org/10.1145/1015330.1015448 (ISBN 1-58113-838-5)

  • Kuta M, Kitowski J (2014) Optimisation of character n-gram profiles method for intrinsic plagiarism detection. In: Rutkowski L, Korytkowski M, Scherer R, Tadeusiewicz R, Zadeh LA, Zurada JM (eds) Artificial intelligence and soft computing. Springer International Publishing, Cham, pp 500–511 (ISBN 978-3-319-07176-3)

  • Kuznetsov M, Motrenko A, Kuznetsova R, Strijov V (2016) Methods for intrinsic plagiarism detection and author diarization—notebook for PAN at CLEF 2016. In: Balog K, Cappellato L, Ferro N, Macdonald C (eds) CLEF 2016 evaluation labs and workshop —working notes papers, 5–8 September 2016, Évora, Portugal, CEUR-WS.org (ISSN 1613 0073)

  • Lemaitre G, Nogueira F, Aridas CK (2016) Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. CoRR. http://arxiv.org/abs/1609.06570

  • Mihalcea RF, Radev DR (2011) Graph-based Natural Language Processing and Information Retrieval, 1st edn. Cambridge University Press, New York (ISBN 0521896134, 9780521896139)

  • Oberreuter G, L’Huillier G, Ríos SA, Velásquez JD (2011) Approaches for intrinsic and external plagiarism detection—notebook for PAN at CLEF 2011. In: Petras V, Forner P, Clough PD (eds) Notebook papers of CLEF 2011 labs and workshops, 19–22 September 2011, Amsterdam, The Netherlands (ISBN 978-88-904810-1-7, 2038-4963)

  • Oberreuter G, Velásquez JD (2013) Text mining applied to plagiarism detection: the use of words for detecting deviations in the writing style. Expert Syst Appl 40(9):3756–3763. https://doi.org/10.1016/j.eswa.2012.12.082

    Article  Google Scholar 

  • Potthast M, Eiselt A, Cedeo AB, Stein B, Rosso P (2011) Overview of the 3rd international competition on plagiarism detection. In: Working Notes Papers of the CLEF 2011 Evaluation

  • Potthast M, Stein B, Eiselt A, Weimar BU, Cedeo AB, Rosso P (2009) Overview of the 1st international competition on plagiarism detection. In: SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 09), CEUR-WS.org, pp. 1–9

  • Ranatunga R, Atukorale A, Hewagamage K (2011) Intrinsic plagiarism detection with Kohonen self organizing maps. In: 2011 International conference on advances in ICT for emerging regions (ICTer). IEEE, pp 125

  • Rosso P, Rangel F, Potthast M, Stamatatos E, Tschuggnall M, Stein B (2016) Overview of the PAN’2016 - new challenges for authorship analysis: cross-genre profiling, clustering, diarization, and obfuscation. In: 7th Int. Conf. of CLEF on experimental IR meets multilinguality, multimodality, and interaction, CLEF 2016, LNCS(9822), Springer, pp 332–350

  • Seaward L, Matwin S (2009) Intrinsic plagiarism detection using complexity analysis. In: Stein B et al (eds) SEPLN 2009 Workshop on uncovering plagiarism, authorship, and social software misuse (PAN 09). Universidad Politécnica de Valencia and CEUR-WS.org, pp 56–61 (ISSN 1613-0073)

  • Sittar A, Iqbal HR, Nawab RMA (2016) Author diarization using cluster-distance approach—notebook for PAN at CLEF 2016. In: Balog K, Cappellato L, Ferro N, Macdonald C (eds) CLEF 2016 evaluation labs and workshop—working notes papers, 5–8 September 2016, Évora, Portugal. CEUR-WS.org (ISSN 1613-0073)

  • Stamatatos E (2009a) Intrinsic plagiarism detection using character n-gram profiles. In: Stein B, Rosso P, Stamatatos E, Koppel M, Agirre E (eds) SEPLN 2009 workshop on uncovering plagiarism, authorship, and social software misuse (PAN 09), pp 38–46

  • Stamatatos E (2009b) A survey of modern authorship attribution methods. J Am Soc Inf Sci Technol 60(3):538–556. https://doi.org/10.1002/asi.v60:3 (ISSN 1532-2882)

  • Stamatatos E, Daelemans W, Verhoeven B, Juola P, López-López A, Potthast M, Stein B (2015) Overview of the author identification task at pan. In: CLEF 2015 Evaluation Labs and Workshop—Working Notes Papers. CEUR, Toulouse (2015/09/10 2015)

  • Stein B, Lipka N, Prettenhofer P (2011) Intrinsic plagiarism analysis. Lang Resour Eval 45(1):63–82. https://doi.org/10.1007/s10579-010-9115-y (ISSN 1574-020X)

  • Tang Y, Zhang Y, Chawla NV, Krasser S (2009) Svms modeling for highly imbalanced classification. IEEE Trans Syst Man Cybern Part B (Cybern) 39:281–288

  • Tschuggnall M, Specht G (2012) Plag-inn: Intrinsic plagiarism detection using grammar trees. In: Bouma G, Ittoo A, Métais E, Wortmann H (eds) Natural Language Processing and Information Systems. Springer, Berlin, Heidelberg, pp 284–289 (ISBN 978-3-642-31178-9)

  • Tschuggnall M, Specht G (2013) Using grammar-profiles to intrinsically expose plagiarism in text documents. In: Métais E, Meziane F, Saraee M, Sugumaran V, Vadera S (eds) Natural Language Processing and Information Systems. Springer, Berlin, Heidelberg, pp 297–302 (ISBN 978-3-642-38824-8)

  • Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 2(3):408–421 (ISSN 0018-9472)

Download references

Acknowledgements

Many thanks to Panagiotis Christou, whose comments essentially helped to overcome some problems during the design of this system.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andrianna Polydouri.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Polydouri, A., Vathi, E., Siolas, G. et al. An efficient classification approach in imbalanced datasets for intrinsic plagiarism detection. Evolving Systems 11, 503–515 (2020). https://doi.org/10.1007/s12530-018-9232-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12530-018-9232-1

Keywords

Navigation