Paraphrase plagiarism identification with character-level features

Sánchez-Vega, Fernando; Villatoro-Tello, Esaú; Montes-y-Gómez, Manuel; Rosso, Paolo; Stamatatos, Efstathios; Villaseñor-Pineda, Luis

doi:10.1007/s10044-017-0674-z

Paraphrase plagiarism identification with character-level features

Theoretical Advances
Published: 21 December 2017

Volume 22, pages 669–681, (2019)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

Fernando Sánchez-Vega¹,
Esaú Villatoro-Tello ORCID: orcid.org/0000-0002-1322-0358²,
Manuel Montes-y-Gómez¹,
Paolo Rosso³,
Efstathios Stamatatos⁴ &
…
Luis Villaseñor-Pineda¹

1240 Accesses
16 Citations
5 Altmetric
Explore all metrics

Abstract

Several methods have been proposed for determining plagiarism between pairs of sentences, passages or even full documents. However, the majority of these methods fail to reliably detect paraphrase plagiarism due to the high complexity of the task, even for human beings. Paraphrase plagiarism identification consists in automatically recognizing document fragments that contain reused text, which is intentionally hidden by means of some rewording practices such as semantic equivalences, discursive changes and morphological or lexical substitutions. Our main hypothesis establishes that the original author’s writing style fingerprint prevails in the plagiarized text even when paraphrases occur. Thus, in this paper we propose a novel text representation scheme that gathers both content and style characteristics of texts, represented by means of character-level features. As an additional contribution, we describe the methodology followed for the construction of an appropriate corpus for the task of paraphrase plagiarism identification, which represents a new valuable resource to the NLP community for future research work in this field.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

Diksha Khurana, Aditya Koli, … Sukhdev Singh

Natural Language Processing

The Use of Artificial Intelligence in Writing Scientific Review Articles

Article Open access 16 January 2024

Melissa A. Kacena, Lilian I. Plotkin & Jill C. Fehrenbacher

Notes

The PAN competition (http://pan.webis.de)
The P4P corpus and guidelines used for its annotation are available at http://clic.ub.edu/corpus/en/paraphrases-en
A large lexical database of English (https://wordnet.princeton.edu/).
http://www.uni-weimar.de/en/media/chairs/webis/corpora/pan-pc-10/#webis-download
Workers from the Amazon Mechanical Turk (https://www.mturk.com/mturk/welcome) perform simple tasks in exchange for a monetary reward.
The second categorization level consists of four subclasses and two classes without subclasses
http://ccc.inaoep.mx/~mmontesg/resources/corpusP4PIN.zip
We used the implementation provided by Weka (http://www.cs.waikato.ac.nz/ml/weka/)
For this experiment, we keep the best configuration obtained using the Brown corpus.
The LCS performance is constant since the n parameter is not applicable for it.

References

Barrón-Cedeño A, Rosso P (2009) On automatic plagiarism detection based on n-grams comparison. In: Proceedings of the 31th European conference on IR research on advances in information retrieval (ECIR), LNCS vol 5478, Springer, Berlin, pp 696–700
Barron-Cedeño A, Vila M, Martí MA, Rosso P (2013) Plagiarism meets paraphrasing: insights for the next generation in automatic plagiarism detection. Comput Linguist 39(4):917–947
Article Google Scholar
Basile C, Benedetto D, Caglioti E, Cristadoro G, Esposti M (2009) A plagiarism detection procedure in three steps: selection, matches and “squares”. In: Proceedings of the SEPLN 2009 workshop on uncovering plagiarism, authorship and social software misuse (PAN 2009), CEUR-WS vol 502. Donostia-San Sebastian, Spain
Biggins S, Mohammed S, Oakley S (2012) University of shefield: two approaches to semantic text similarity. In: First joint conference on lexical and computational semantics (SEM at NAACL 2012), Montreal, Canada, pp 655–661
Burrows S, Potthast M, Stein B (2013) Paraphrase acquisition via crowdsourcing and machine learning. ACM Trans Intell Syst Technol 4(3):43:1–43:21. https://doi.org/10.1145/2483669.2483676
Article Google Scholar
Calvo H, Segura-Olivares A, García A (2014) Dependency vs. constituent based syntactic n-grams in text similarity measures for paraphrase recognition. Computación y Sistemas 18(3):517554
Article Google Scholar
Chien-Ying C, Jen-Yuan Y, Hao-Ren K (2010) Plagiarism detection using rouge and wordnet. J Comput 2(3):34–44
Google Scholar
Chong M, Specia L, Mitkov R (2010) Using natural language processing for automatic detection of plagiarism. In: Proceedings of the 4th international plagiarism conference. Newcastle-upon-Tyne, UK
Clough P (2003) Old a new challenges in automatic plagiarism detection. In: National plagiarism advisory service, pp 391–407
Clough P, Gaizauskas R, Piao SS, Wilks Y (2002) Meter: Measuring text reuse. In: Proceedings of the 40th annual meeting of the association for computational linguistics (ACL). Philadelphia
Courtney C, Mihalcea R (2005) Measuring the semantic similarity of texts. In: Proceedings of the ACL workshop on empirical modeling of semantic equivalence and entailment (EMSEE at NAALC 2005), pp 13–18
Daelemans W (2013) Explanation in computational stylometry. In: 14th International conference on intelligent text processing and computational linguistics (CIC-Ling 2013), Lecture Notes in Computer Science LNCS, vol 7817, pp 451–462
Ehsan N, Shakery A (2016) Candidate document retrieval for cross-lingual plagiarism detection using two-level proximity information. Inf Process Manag. https://doi.org/10.1016/j.ipm.2016.04.006
Grieve J (2007) Quantitative authorship attribution: an evaluation of techniques. Lit Linguist Comput 22(3):251–270
Article Google Scholar
Hartrumpf S, vor Der Brück T, Eichhorn C (2010) Semantic duplicate identification with parsing and machine learning. In: Eleventh international conference on text, speech and dialogue (TSD 2010) LNAI vol 6231, Springer, Berlin, pp 84–92. Brno, Czech Republic
Hoad TC, Zobel J (2003) Methods for identifying versioned and plagiarised documents. J Am Soc Inform Sci Technol 54:203–215
Article Google Scholar
Koppel M, Schler J, Argamon S (2009) Computational methods in authorship attribution. J Am Soc Inf Sci Technol 60(1):9–26
Article Google Scholar
Koppel M, Schler J, Argamon S (2011) Authorship attribution in the wild. Lang Resour Eval 45:83–94
Article Google Scholar
Man PD (1983) Blindness and insight: essays in the rhetoric of contemporary criticism, 2nd ed. chap. Literature and Language: A Commentary, pp. 277–89. Routtloedge
McNamee P, Mayfield J (2004) Character n-gram tokenization for european language text retrieval. Inf Retr 7(1–2):73–97
Article Google Scholar
Oberreuter G, L’Huillier G, Ríos SA, Velásquez JD (2011) Approaches for intrinsic and external plagiarism detection. In: Notebook for PAN at CLEF’11
Palkovskii Y, Belov A, Muzyka I (2011) Using wordnet-based semantic similarity measurement in external plagiarism detection. In: Notebook for PAN at CLEF’11
Potthast M, Hagen M, Gollub T, Tippmann M, Kiesel J, Rosso P, Stamatatos E, Stein B (2013) Overview of the 5th international competition on plagiarism detection. In: CLEF 2013 evaluation labs and workshop working notes papers
Ravi NR, Gupta D (2015) Efficient paragraph based chunking and download filtering for plagiarism source retrieval. In: Notebook for PAN at CLEF 2015 evaluation labs and workshop working notes papers, PAN ’15. http://www.uni-weimar.de/medien/webis/events/pan-15/pan15-papers-final/pan15-plagiarism-detection/ravi15-notebook.pdf
Sapkota U, Bethard S, Montes-y Gómez M, Solorio T (2015) Not all character n-grams are created equal: a study in authorship attribution. In: Conference of the North American chapter of the association for computational linguistics human language technologies (NAACL-HLT 2015), pp 93–102
Sapkota U, Solorio T, Montes M, Bethard S, Rosso P (2014) Cross-topic authorship attribution: will out-of-topic data help? In: Proceedings of COLING 2014, the 25th international conference on computational linguistics: technical papers, pp 1228–1237. Dublin City University and Association for Computational Linguistics. http://aclweb.org/anthology/C14-1116
Schleimer S, Wilkerson DS, Aiken A (2003) Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD international conference on management of data, SIGMOD ’03, pp 76–85. ACM, New York. https://doi.org/10.1145/872757.872770
Sediyono A, Mahamud K (2008) Algorithm of the longest commonly consecutive word for plagiarism detection in text based document. In: Digital information management, ICDIM ’08, pp 253–259. IEEE. https://doi.org/10.1109/ICDIM.2008.4746827
Shivakumar N, Garcia-Molina H (1995) Scam: a copy detection mechanism for digital documents. In: Proceedings of the second annual conference on the theory and practice of digital libraries
Si A, Leong HV, Lau RWH (1997) Check: a document plagiarism detection system. In: Proceedings of ACM symposium for applied computing, SAC ’97, pp. 70–77. ACM, New York. https://doi.org/10.1145/331697.335176
Sánchez-Vega F, Villatoro-Tello E, Montes-y Gómez M, Villaseñor-Pineda L, Rosso P (2013) Determining and characterizing the reused text for plagiarism detection. Expert Syst Appl 40(5):1804–1813
Article Google Scholar
Stamatatos E (2011) Plagiarism detection using stopword n-grams. J Am Soc Inf Sci Technol 62(12):2512–2527
Article Google Scholar
Stamatatos E (2013) On the robustness of authorship attribution based on character n-gram features. J Law Policy 21(2):421–439
Google Scholar
Stein B, Potthast M, Rosso P, Barrón-Cedeño A, Stamatatos E, Koppel M (2011) Fourth international workshop on uncovering plagiarism, authorship, and social software misuse. SIGIR Forum 45:45–48
Article Google Scholar
Uzuner Özlem, Katz B, Nahnsen T (2005) Using syntactic information to identify plagiarism. In: Proceedings of 2nd workshop on building educational applications using NLP. Ann Arbor
Xu W, Ritter A, Dolan WB, Grishman R, Cherry C (2012) Paraphrasing for style. In: Proceedings of COLING 2012: Technical Papers, pp 2899–2914. Mumbai
Zechner M, Muhr M, Kern R, Granitzer M (2009) External and intrinsic plagiarism detection using vector space models. In: SEPLN 2009, workshop on uncovering plagiarism, authorship, and social software misuse (PAN 09), pp 45–55

Download references

Acknowledgements

This work is the result of the collaboration in the framework of the CONACYT Thematic Networks program (RedTTL Language Technologies Network) and the WIQ-EI IRSES project (Grant No. 269180) within the FP7 Marie Curie action. The first author was supported by CONACYT (Scholarship 258345/224483). The second, third, and sixth authors were partially supported by CONACyT (Project Grants 258588 and 2410). The work of the fourth author was partially supported by the SomEMBED TIN2015-71147-C2-1-P MINECO research project and by the Generalitat Valenciana under the Grant ALMAMATER (PrometeoII/2014/030).

Author information

Authors and Affiliations

Computer Science Department, Instituto Nacional de Astrofísica, Óptica y Electrónica, Puebla, Mexico
Fernando Sánchez-Vega, Manuel Montes-y-Gómez & Luis Villaseñor-Pineda
Information Technologies Department, Universidad Autónoma Metropolitana (UAM) Unidad Cuajimalpa, Ciudad de México, Mexico
Esaú Villatoro-Tello
PRHLT Research Center, Universitat Politècnica de València, València, Spain
Paolo Rosso
Department of Information and Communication Systems Engineering, University of the Aegean, Samos, Greece
Efstathios Stamatatos

Authors

Fernando Sánchez-Vega
View author publications
You can also search for this author in PubMed Google Scholar
Esaú Villatoro-Tello
View author publications
You can also search for this author in PubMed Google Scholar
Manuel Montes-y-Gómez
View author publications
You can also search for this author in PubMed Google Scholar
Paolo Rosso
View author publications
You can also search for this author in PubMed Google Scholar
Efstathios Stamatatos
View author publications
You can also search for this author in PubMed Google Scholar
Luis Villaseñor-Pineda
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Esaú Villatoro-Tello.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sánchez-Vega, F., Villatoro-Tello, E., Montes-y-Gómez, M. et al. Paraphrase plagiarism identification with character-level features. Pattern Anal Applic 22, 669–681 (2019). https://doi.org/10.1007/s10044-017-0674-z

Download citation

Received: 10 March 2017
Accepted: 10 December 2017
Published: 21 December 2017
Issue Date: 01 May 2019
DOI: https://doi.org/10.1007/s10044-017-0674-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Paraphrase plagiarism identification with character-level features

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Natural Language Processing

The Use of Artificial Intelligence in Writing Scientific Review Articles

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Paraphrase plagiarism identification with character-level features

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

Natural Language Processing

The Use of Artificial Intelligence in Writing Scientific Review Articles

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation