Skip to main content
Log in

Reuse and plagiarism in Speech and Natural Language Processing publications

  • Published:
International Journal on Digital Libraries Aims and scope Submit manuscript

Abstract

The aim of this experiment is to present an easy way to compare fragments of texts in order to detect (supposed) results of copy and paste operations between articles in the domain of Natural Language Processing (NLP), including Speech Processing. The search space of the comparisons is a corpus labeled as NLP4NLP, which includes 34 different conferences and journals and gathers a large part of the NLP activity over the past 50 years. This study considers the similarity between the papers of each individual event and the complete set of papers in the whole corpus, according to four different types of relationship (self-reuse, self-plagiarism, reuse and plagiarism) and in both directions: a paper borrowing a fragment of text from another paper of the corpus (that we will call the source paper), or in the reverse direction, fragments of text from the source paper being borrowed and inserted in another paper of the corpus. The results show that self-reuse is rather a common practice, but that plagiarism seems to be very unusual, and that both stay within legal and ethical limits.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Notes

  1. http://pan.webis.de.

  2. www.nlp4nlp.org.

  3. The total number of papers is 67,937, but in the case of a joint conference, the papers are counted twice. This number reduces to 65,003, if we count duplicate papers only once. Similarly, the number of venues is 577 when all venues are counted, but this number reduces to 558 when the 19 joint conferences are counted only once.

  4. http://aclweb.org/anthology.

  5. www.isca-speech.org/iscaweb/index.php/archive/online-archive.

  6. www.ieee.org/index.html.

  7. http://pdfbox.apache.org.

  8. www.bouncycastle.org/.

  9. http://code.google.com/p/tesseract-ocr.

  10. TagParser is a tool created and distributed by Tagmatica (see www.tagmatica.com).

  11. http://github.com/knmnyn/ParsCit.

  12. http://github.com/kermitt2/grobid.

  13. Also called “n-grams” in some NLP papers.

  14. Concerning this specific problem, for instance, PACLIC and COLING which are one column formatted give much better extraction quality than LREC and ACL which are two columns formatted.

  15. It takes 69 h instead of 44 h on a mid-range mono-processor Xeon E3-1270 V2 with 32 GB of RAM.

  16. But the space limitations do not allow to present these results in lengthy details. Furthermore, we do not want to display personal results.

  17. http://en.wikipedia.org/wiki/Right_to_quote.

  18. http://en.wikipedia.org/wiki/Rogeting.

  19. To this regard, the reader will find a certain degree of overlapping between this paper and the one we published at LREC 2016 on reuse and plagiarism limited to the LREC papers, regarding the description of the NLP4NLP corpus and of the similarity measure algorithm.

References

  1. Barron-Cedeno, A., Potthast, M., Rosso, P., Stein, B., Eiselt, A.: Corpus and evaluation measures for automatic plagiarism detection. In: Proceedings of LREC 2010, pp. 771–774. Valletta (2010)

  2. Barron-Cedeno, A., Vila, M., Marti, M.A., Rosso, P.: Plagiarism meets paraphrasing insights for the next generation in automatic plagiarism detection. Comput. Linguist. 39(4), 917–947 (2013)

    Article  Google Scholar 

  3. Bensalem, I., Rosso, P., Chikhi, S.,: Intrinsic plagiarism detection using n-gram classes. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing 2014, pp. 1459–1464. Doha (2014)

  4. Berne Convention for the Protection of Literary and Artistic Works (as amended on Sept. 28, 1979). http://www.wipo.int/wipolex/en/treaties/text.jsp?file_id=283693

  5. Bird, S., Dale, R., Dorr, B.J., Gibson, B., Joseph, M.T., Kan, M.-Y., Dongwon, L., Powley, B., Radev, D.R., Tan Y.F.: The ACL anthology reference corpus: a reference dataset for bibliographic research in Computational linguistics. In: Proceedings of LREC 2008, pp. 1755–1759. Marrakesh (2008)

  6. Calzolari, N., Del Gratta, R., Francopoulo, G., Mariani, J., Rubino, F., Russo, I., Soria, C.: The LRE map. Harmonising community descriptions of resources. In: Proceedings of LREC 2012, pp. 1084–1089. Istanbul (2012)

  7. Ceska, Z., Fox, C.: The influence of text pre-processing on plagiarism detection. In: Proceedings of the Recent Advances in Natural Language Processing Conference 2009, pp. 55–59. Borovets (2009)

  8. Chong, M., Specia, L.: Lexical generalisation for word-level matching in plagiarism detection. In: Proceedings of the Recent Advances in Natural Language Processing Conference 2011, pp. 704–709. Hissar (2011)

  9. Citron, D.T., Ginsparg, P.: Patterns of text reuse in a scientific corpus. Proc. Natl. Acad. Sci. 112(1), 25–30 (2014). doi:10.1073/pnas.1415135111

    Article  Google Scholar 

  10. Clough, P., Gaizauskas, R., Piao, S.S.L., Wilks, Y.: Measuring text reuse. In: Proceedings of ACL’2002, pp. 152–159. Philadelphia (2002)

  11. Clough, P., Gaizauskas, R., Piao, S.S.L.: Building and annotating a corpus for the study of journalistic text reuse. In: Proceedings of LREC 2002, pp. 1678–1691. Las Palmas (2002)

  12. Clough, P., Stevenson, M.: Developing a corpus of plagiarised short answers. Lang. Resour. Eval. 45(1), 5–24 (2011)

    Article  Google Scholar 

  13. Councill, I.G., Giles, C.L., Kan, M.-Y.: ParsCit: an open-source CRF reference string parsing package. In: Proceedings of LREC 2008, pp. 661–667. Marrakesh (2008)

  14. Francopoulo, G.: TagParser: well on the way to ISO-TC37 conformance. In: Proceedings of ICGL (International Conference on Global Interoperability for Language Resources) 2008. Hong Kong (2008)

  15. Francopoulo, G., Marcoul, F., Causse, D., Piparo, G.: Global atlas: proper nouns, from Wikipedia to LMF. In: Francopoulo, G. (ed) LMF Lexical Markup Framework. ISTE Wiley (2013)

  16. Francopoulo, G., Mariani, J., Paroubek, P.: NLP4NLP: the cobbler’s children won’t go unshod. D-Lib Mag. 21(11/12). www.dlib.org/dlib/november15/francopoulo/11francopoulo.html (2015)

  17. Francopoulo, G., Mariani, J., Paroubek, P.: A study of reuse and plagiarism in LREC papers. In: Proceedings of LREC 2016, pp. 72–83. Portorož (2016)

  18. Frey, M., Kern, R.: Efficient table annotation for digital articles. D-Lib Mag. 21(11/12). www.dlib.org/dlib/november15/frey/11frey.html (2015)

  19. Gaizauskas, R., Foster, J., Wilks, Y., Arundel, J., Clough, P., Piao, S.S.L.: The METER corpus: a corpus for analysing journalistic text reuse. In: Proceedings of the Corpus Linguistics Conference 2001, pp. 214–223. Lancaster (2001)

  20. Grove, J.: Sinister buttocks? Roget would blush at the crafty cheek. Middlesex lecturer gets to the bottom of meaningless phrases found while marking essays. Times Higher Education, 7 August (2014). https://www.timeshighereducation.com/news/sinister-buttocks-roget-would-blush-at-the-crafty-cheek/2015027.article

  21. Guo, Y., Che, W., Liu, T., Li, S.: A graph-based method for entity linking. In: Proceedings of the International Joint Conference on NLP 2011, pp. 1010–1018. Chiang Mai (2011)

  22. Gupta, P., Rosso, P.: Text reuse with ACL: (upward) trends. In: Proceedings ACL’2012 Special Workshop on Rediscovering 50 Years of Discoveries, pp. 76–82. Jeju (2012)

  23. Hoad, T.C., Zobel, J.: Methods for identifying versioned and plagiarised documents. J. Am. Soc. Inf. Sci. Technol. 54(3), 203–215 (2003)

    Article  Google Scholar 

  24. HaCohen-Kerner, Y., Tayeb, A., Ben-Dror, N.: Detection of simple plagiarism in computer science papers. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING), pp. 421–429. Beijing (2010)

  25. Kasprzak, J., Brandejs, M.: Improving the reliability of the plagiarism detection system lab. In: Proceedings of the Uncovering Plagiarism, Authorship and Social Software Misuse (PAN) at CLEF’2010. Padua (2010)

  26. Lyon, C., Malcolm, J., Dickerson, B.: Detecting short passages of similar text in large document collections. In: Proceedings of the Empirical Methods in Natural Language Processing Conference 2001, pp. 118–125. Pittsburgh (2001)

  27. Mariani, J., Paroubek, P., Francopoulo, G., Delaborde, M.: Rediscovering 25 years of discoveries in spoken language processing: a preliminary ISCA archive analysis. In: Proceedings of Interspeech 2013, pp. 4632–4669. Lyon (2013)

  28. Moro, A., Raganato, A., Navigli, R.: Entity linking meets word sense disambiguation: a unified approach. Trans. Assoc. Comput. Linguist. 2, 231–244 (2014)

    Google Scholar 

  29. Nawab R.M.A., Stevenson, M., Clough, P.: Detecting text reuse with modified and weighted n-grams. In: Proceedings of the First Joint Conference on Lexical and Computational Semantics, pp. 54–58. Montréal (2012)

  30. Potthast, M., Stein, B., Barron-Cedeno, A., Rosso, P.: An evaluation framework for plagiarism detection. In: Proceedings of the 23rd International Conference on Computational Linguistics (COLING), pp. 997–1005. Beijing (2010)

  31. Radev, D.R., Muthukrishnan, P., Qazvinian, V., Abu-Jbara, A.: The ACL anthology network corpus. Lang. Resour. Eval. 47(4), 919–944 (2013)

    Article  Google Scholar 

  32. Samuelson, P.: Self-plagiarism or fair use? Commun. ACM 37(8), 21–25 (1994)

    Article  Google Scholar 

  33. Stamatatos, E., Koppel, M.: Plagiarism and authorship analysis: introduction to the special issue. Lang. Resour. Eval. 45(1), 1–5 (2011)

    Article  Google Scholar 

  34. Stamatatos, E.: Plagiarism detection using stopword n-grams. J. Am. Soc. Inf. Sci. Technol. 62(12), 2512–2527 (2011)

    Article  Google Scholar 

  35. Stein, B., Lipka, N., Prettenhofer, P.: Intrinsic plagiarism analysis. Lang. Resour. Eval. 45(1), 63–82 (2011)

    Article  Google Scholar 

  36. Vilnat, A., Paroubek, P., de la Clergerie, E.V., Francopoulo, G., Guénot, M.-L.: PASSAGE syntactic representation: a minimal common ground for evaluation. In: Proceedings of LREC 2010, pp. 2478–2485. Valletta (2010)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joseph Mariani.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Mariani, J., Francopoulo, G. & Paroubek, P. Reuse and plagiarism in Speech and Natural Language Processing publications. Int J Digit Libr 19, 113–126 (2018). https://doi.org/10.1007/s00799-017-0211-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00799-017-0211-0

Keywords

Navigation