An efficient classification approach in imbalanced datasets for intrinsic plagiarism detection

Polydouri, Andrianna; Vathi, Eleni; Siolas, Georgios; Stafylopatis, Andreas

doi:10.1007/s12530-018-9232-1

An efficient classification approach in imbalanced datasets for intrinsic plagiarism detection

Original Paper
Published: 13 July 2018

Volume 11, pages 503–515, (2020)
Cite this article

Evolving Systems Aims and scope Submit manuscript

Andrianna Polydouri ORCID: orcid.org/0000-0003-0715-4355¹,
Eleni Vathi¹,
Georgios Siolas¹ &
…
Andreas Stafylopatis¹

351 Accesses
9 Citations
Explore all metrics

Abstract

The ever increasing volume of information due to the widespread use of computers and the web has made effective plagiarism detection methods a necessity. Plagiarism can be found in many settings and forms, in literature, in academic papers, even in programming code. Intrinsic plagiarism detection is the task that deals with the discovery of plagiarized passages in a text document, by identifying the stylistic changes and inconsistencies within the document itself, given that no reference corpus is available. The main idea consists in profiling the style of the original author and marking the passages that seem to differ significantly. In this work, we follow a supervised machine learning classification approach. We consider, for the first time, the fact of imbalanced data as a crucial parameter of the problem and experiment with various balancing techniques. Apart from this, we propose some novel stylistic features. We combine our features and imbalanced dataset treatment with various classification methods. Our detection system is tested on the data corpora of PAN Webis intrinsic plagiarism detection shared tasks. It is compared to the best performing detection systems on these datasets, and succeeds the best resulting scores.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Intrinsic Plagiarism Detection with Feature-Rich Imbalanced Dataset Learning

Plagiarism Detection Software: Promises, Pitfalls, and Practices

Notes

http://pan.webis.de/.
http://www.gutenberg.org.
https://opennlp.apache.org/
After experiments, we denote a word as frequent it is of word frequency class value lower than 1.8, in respect to Eq. 2.
http://scikit-learn.org

References

Alsallal M, Iqbal R, Amin S, James A (2013) Intrinsic plagiarism detection using latent semantic indexing and stylometry. In: 2013 Sixth international conference on developments in eSystems engineering, Abu Dhabi, pp 145–150. https://doi.org/10.1109/DeSE.2013.34
Alzahrani S, Salim N, Abraham A (2012) Understanding plagiarism linguistic patterns, textual features, and detection methods. IEEE Trans Syst Man Cybern Part C (Applications and Reviews) 42:133–149
Article Google Scholar
Batista GEAPA, Prati RC, Monard MC (2004) A study of the behavior of several methods for balancing machine learning training data. SIGKDD Explor Newsl 6(1):20–29. https://doi.org/10.1145/1007730.1007735 (ISSN 1931-0145)
Bensalem I, Rosso P, Chikhi S (2014) Intrinsic plagiarism detection using n-gram classes. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25–29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp 1459–1464. https://aclweb.org/anthology/D/D14/D14-1153.pdf
Bowyer KW, Chawla NV, Hall LO, Kegelmeyer WP (2011) SMOTE: synthetic minority over-sampling technique. CoRR, abs/1106.1813. https://arxiv.org/abs/1106.1813
Cheng N, Chandramouli R, Subbalakshmi KP (2011) Author gender identification from text. Digit Investig 8(1): 78–88. https://doi.org/10.1016/j.diin.2011.04.002 (ISSN 1742-2876)
Curran D (2010) An evolutionary neural network approach to intrinsic plagiarism detection. In: Proceedings of the 20th Irish Conference on Artificial Intelligence and Cognitive Science, AICS’09, Springer-Verlag, Berlin, Heidelberg, pp 33–40. https://dl.acm.org/citation.cfm?id=1939047.1939055 (ISBN 3-642-17079-X, 978-3-642-17079-9)
Dubay WH (2004) The principles of readability. Impact Information, Costa Mesa
Google Scholar
zu Eissen SM, Stein B (2006) Intrinsic plagiarism detection. In: Lalmas M, MacFarlane A, Rüger S, Tombros A, Tsikrika T, Yavlinsky A (eds) Advances in information retrieval. Springer, Berlin Heidelberg, pp 565–569 (ISBN 978-3-540-33348-7).
Holmes DI (1998) The evolution of stylometry in humanities scholarship. Lit Linguist Comput 13(3): 111–117. https://doi.org/10.1093/llc/13.3.111
Hua X, Li S, Li P, Zhu Q (2013) Research on intrinsic plagiarism detection resolution: a supervised learning approach. In: Ji D, Xiao G (eds) Chinese lexical semantics. Springer, Berlin, Heidelberg, pp 58–63 (ISBN 978-3-642-36337-5)
Kestemont M, Luyckx K, Daelemans W (2011) Intrinsic plagiarism detection using character trigram distance scores—notebook for PAN at CLEF 2011. In: Petras V, Forner P, Clough PD (eds) Notebook papers of CLEF 2011 labs and workshops, 19–22 September 2011, Amsterdam, The Netherlands (ISBN 978-88-904810-1-7, 2038-4963)
Koppel M, Schler J (2004) Authorship verification as a one-class classification problem. In: Proceedings of the Twenty-first International Conference on Machine Learning, ICML, ACM ’04, New York, NY, USA. https://doi.org/10.1145/1015330.1015448 (ISBN 1-58113-838-5)
Kuta M, Kitowski J (2014) Optimisation of character n-gram profiles method for intrinsic plagiarism detection. In: Rutkowski L, Korytkowski M, Scherer R, Tadeusiewicz R, Zadeh LA, Zurada JM (eds) Artificial intelligence and soft computing. Springer International Publishing, Cham, pp 500–511 (ISBN 978-3-319-07176-3)
Kuznetsov M, Motrenko A, Kuznetsova R, Strijov V (2016) Methods for intrinsic plagiarism detection and author diarization—notebook for PAN at CLEF 2016. In: Balog K, Cappellato L, Ferro N, Macdonald C (eds) CLEF 2016 evaluation labs and workshop —working notes papers, 5–8 September 2016, Évora, Portugal, CEUR-WS.org (ISSN 1613 0073)
Lemaitre G, Nogueira F, Aridas CK (2016) Imbalanced-learn: a python toolbox to tackle the curse of imbalanced datasets in machine learning. CoRR. http://arxiv.org/abs/1609.06570
Mihalcea RF, Radev DR (2011) Graph-based Natural Language Processing and Information Retrieval, 1st edn. Cambridge University Press, New York (ISBN 0521896134, 9780521896139)
Oberreuter G, L’Huillier G, Ríos SA, Velásquez JD (2011) Approaches for intrinsic and external plagiarism detection—notebook for PAN at CLEF 2011. In: Petras V, Forner P, Clough PD (eds) Notebook papers of CLEF 2011 labs and workshops, 19–22 September 2011, Amsterdam, The Netherlands (ISBN 978-88-904810-1-7, 2038-4963)
Oberreuter G, Velásquez JD (2013) Text mining applied to plagiarism detection: the use of words for detecting deviations in the writing style. Expert Syst Appl 40(9):3756–3763. https://doi.org/10.1016/j.eswa.2012.12.082
Article Google Scholar
Potthast M, Eiselt A, Cedeo AB, Stein B, Rosso P (2011) Overview of the 3rd international competition on plagiarism detection. In: Working Notes Papers of the CLEF 2011 Evaluation
Potthast M, Stein B, Eiselt A, Weimar BU, Cedeo AB, Rosso P (2009) Overview of the 1st international competition on plagiarism detection. In: SEPLN 2009 Workshop on Uncovering Plagiarism, Authorship, and Social Software Misuse (PAN 09), CEUR-WS.org, pp. 1–9
Ranatunga R, Atukorale A, Hewagamage K (2011) Intrinsic plagiarism detection with Kohonen self organizing maps. In: 2011 International conference on advances in ICT for emerging regions (ICTer). IEEE, pp 125
Rosso P, Rangel F, Potthast M, Stamatatos E, Tschuggnall M, Stein B (2016) Overview of the PAN’2016 - new challenges for authorship analysis: cross-genre profiling, clustering, diarization, and obfuscation. In: 7th Int. Conf. of CLEF on experimental IR meets multilinguality, multimodality, and interaction, CLEF 2016, LNCS(9822), Springer, pp 332–350
Seaward L, Matwin S (2009) Intrinsic plagiarism detection using complexity analysis. In: Stein B et al (eds) SEPLN 2009 Workshop on uncovering plagiarism, authorship, and social software misuse (PAN 09). Universidad Politécnica de Valencia and CEUR-WS.org, pp 56–61 (ISSN 1613-0073)
Sittar A, Iqbal HR, Nawab RMA (2016) Author diarization using cluster-distance approach—notebook for PAN at CLEF 2016. In: Balog K, Cappellato L, Ferro N, Macdonald C (eds) CLEF 2016 evaluation labs and workshop—working notes papers, 5–8 September 2016, Évora, Portugal. CEUR-WS.org (ISSN 1613-0073)
Stamatatos E (2009a) Intrinsic plagiarism detection using character n-gram profiles. In: Stein B, Rosso P, Stamatatos E, Koppel M, Agirre E (eds) SEPLN 2009 workshop on uncovering plagiarism, authorship, and social software misuse (PAN 09), pp 38–46
Stamatatos E (2009b) A survey of modern authorship attribution methods. J Am Soc Inf Sci Technol 60(3):538–556. https://doi.org/10.1002/asi.v60:3 (ISSN 1532-2882)
Stamatatos E, Daelemans W, Verhoeven B, Juola P, López-López A, Potthast M, Stein B (2015) Overview of the author identification task at pan. In: CLEF 2015 Evaluation Labs and Workshop—Working Notes Papers. CEUR, Toulouse (2015/09/10 2015)
Stein B, Lipka N, Prettenhofer P (2011) Intrinsic plagiarism analysis. Lang Resour Eval 45(1):63–82. https://doi.org/10.1007/s10579-010-9115-y (ISSN 1574-020X)
Tang Y, Zhang Y, Chawla NV, Krasser S (2009) Svms modeling for highly imbalanced classification. IEEE Trans Syst Man Cybern Part B (Cybern) 39:281–288
Tschuggnall M, Specht G (2012) Plag-inn: Intrinsic plagiarism detection using grammar trees. In: Bouma G, Ittoo A, Métais E, Wortmann H (eds) Natural Language Processing and Information Systems. Springer, Berlin, Heidelberg, pp 284–289 (ISBN 978-3-642-31178-9)
Tschuggnall M, Specht G (2013) Using grammar-profiles to intrinsically expose plagiarism in text documents. In: Métais E, Meziane F, Saraee M, Sugumaran V, Vadera S (eds) Natural Language Processing and Information Systems. Springer, Berlin, Heidelberg, pp 297–302 (ISBN 978-3-642-38824-8)
Wilson DL (1972) Asymptotic properties of nearest neighbor rules using edited data. IEEE Trans Syst Man Cybern 2(3):408–421 (ISSN 0018-9472)

Download references

Acknowledgements

Many thanks to Panagiotis Christou, whose comments essentially helped to overcome some problems during the design of this system.

Author information

Authors and Affiliations

Intelligent Systems, Content and Interaction Laboratory, School of Electrical and Computer Engineering, National and Technical University of Athens, Athens, Greece
Andrianna Polydouri, Eleni Vathi, Georgios Siolas & Andreas Stafylopatis

Authors

Andrianna Polydouri
View author publications
You can also search for this author in PubMed Google Scholar
Eleni Vathi
View author publications
You can also search for this author in PubMed Google Scholar
Georgios Siolas
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Stafylopatis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Andrianna Polydouri.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Polydouri, A., Vathi, E., Siolas, G. et al. An efficient classification approach in imbalanced datasets for intrinsic plagiarism detection. Evolving Systems 11, 503–515 (2020). https://doi.org/10.1007/s12530-018-9232-1

Download citation

Received: 05 January 2018
Accepted: 13 May 2018
Published: 13 July 2018
Issue Date: September 2020
DOI: https://doi.org/10.1007/s12530-018-9232-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An efficient classification approach in imbalanced datasets for intrinsic plagiarism detection

Abstract

Access this article

Similar content being viewed by others

Intrinsic Plagiarism Detection with Feature-Rich Imbalanced Dataset Learning

Plagiarism Detection Software: Promises, Pitfalls, and Practices

Plagiarism Detection Software: Promises, Pitfalls, and Practices

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An efficient classification approach in imbalanced datasets for intrinsic plagiarism detection

Abstract

Access this article

Similar content being viewed by others

Intrinsic Plagiarism Detection with Feature-Rich Imbalanced Dataset Learning

Plagiarism Detection Software: Promises, Pitfalls, and Practices

Plagiarism Detection Software: Promises, Pitfalls, and Practices

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation