Skip to main content
Log in

A crowdsourcing approach to construct mono-lingual plagiarism detection corpus

  • Published:
International Journal on Digital Libraries Aims and scope Submit manuscript

Abstract

Plagiarism detection deals with detecting plagiarized fragments among textual documents. The availability of digital documents in online libraries makes plagiarism easier and on the other hand, to be easily detected by automatic plagiarism detection systems. Large scale plagiarism corpora with a wide variety of plagiarism cases are needed to evaluate different detection methods in different languages. Plagiarism detection corpora play an important role in evaluating and tuning plagiarism detection systems. Despite of their importance, few corpora have been developed for low resource languages. In this paper, we propose HAMTA, a Persian plagiarism detection corpus. To simulate real cases of plagiarism, manually paraphrased text are used to compile the corpus. For obtaining the manual plagiarism cases, a crowdsourcing platform is developed and crowd workers are asked to paraphrase fragments of text in order to simulate real cases of plagiarism. Moreover, artificial methods are used to scale-up the proposed corpus by automatically generating cases of text re-use. The evaluation results indicate a high correlation between the proposed corpus and the PAN state-of-the-art English plagiarism detection corpus.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Notes

  1. http://pan.webis.de/

  2. http://ictrc.ac.ir/plagdet

  3. http://ictrc.ac.ir/plagdet

References

  1. Al-Raisi, F., Lin, W., Bourai, A.: A monolingual parallel corpus of arabic. In: Fourth International Conference On Arabic Computational Linguistics, ACLING 2018, November 17–19, 2018, Dubai, United Arab Emirates, pp. 334–338 (2018)

  2. Ambati, V., Vogel, S.: Can crowds build parallel corpora for machine translation systems? In: Proceedings of the 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, Los Angeles, USA, June 6, 2010, pp. 62–65 (2010)

  3. Asghari, H., Fatemi, O., Mohtaj, S., Faili, H., Rosso, P.: On the use of word embedding for cross language plagiarism detection. Intell. Data Anal. 23(3), 661–680 (2019)

    Article  Google Scholar 

  4. Asghari, H., Mohtaj, S., Fatemi, O., Faili, H., Rosso, P., Potthast, M.: Algorithms and corpora for persian plagiarism detection: overview of PAN at FIRE 2016. In: P. Majumder, M. Mitra, P. Mehta, J. Sankhavara, and K. Ghosh (Eds.), Working notes of FIRE 2016—Forum for Information Retrieval Evaluation, Kolkata, India, December 7-10, 2016., Volume 1737 of CEUR Workshop Proceedings, pp. 135–144. CEUR-WS.org (2016)

  5. Barrón-Cedeño, A., Gupta, P., Rosso, P.: Methods for cross-language plagiarism detection. Knowl.-Based Syst.  50, 211–217 (2013)

  6. Barrón-Cedeño, A., M. Potthast, P. Rosso, and B. Stein (2010). Corpus and evaluation measures for automatic plagiarism detection. In Proceedings of the International Conference on Language Resources and Evaluation, LREC: 17–23 May 2010. Valletta, Malta (2010)

    Google Scholar 

  7. Barrón-Cedeño, A., Vila, M., Martí, M.A., Rosso, P.: Plagiarism meets paraphrasing: insights for the next generation in automatic plagiarism detection. Comput. Linguist. 39(4), 917–947 (2013)

    Article  Google Scholar 

  8. Bensalem, I., Rosso, P., Chikhi, S.: A new corpus for the evaluation of arabic intrinsic plagiarism detection. In: P. Forner, H. Müller, R. Paredes, P. Rosso, and B. Stein (Eds.), Information Access Evaluation. Multilinguality, Multimodality, and Visualization - 4th International Conference of the CLEF Initiative, CLEF 2013, Valencia, Spain, September 23-26, 2013. Proceedings, Volume 8138 of Lecture Notes in Computer Science, pp. 53–58. Springer (2013)

  9. Bloodgood, M., Callison-Burch, C.: Bucking the trend: Large-scale cost-focused active learning for statistical machine translation. In: ACL 2010, Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, July 11–16, 2010, Uppsala, Sweden, pp. 854–864 (2010a)

  10. Bloodgood, M., Callison-Burch, C.: Using mechanical turk to build machine translation evaluation sets. In: Proceedings of the 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, Los Angeles, USA, June 6, 2010, pp. 208–211 (2010b)

  11. Callison-Burch, C.: Fast, cheap, and creative: Evaluating translation quality using amazon’s mechanical turk. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, 6–7 August 2009, Singapore, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 286–295 (2009)

  12. Cappellato, L., Ferro, N., Jones, G. J. F., SanJuan, E.: Working Notes of CLEF 2015—Conference and Labs of the Evaluation forum, Toulouse, France, September 8-11, 2015, Volume 1391 of CEUR Workshop Proceedings. CEUR-WS.org (2015)

  13. Chen, D., Dolan, W. B.: Collecting highly parallel data for paraphrase evaluation. In: The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19–24 June, 2011, Portland, Oregon, USA, pp. 190–200 (2011)

  14. Clough, P. D., Gaizauskas, R. J., Piao, S. S. L., Wilks, Y.: METER: measuring text reuse. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6–12, 2002, Philadelphia, PA, USA., pp. 152–159. ACL (2002)

  15. Clough, P.D., Stevenson, M.: Developing a corpus of plagiarised short answers. Lang. Resour. Eval. 45(1), 5–24 (2011)

    Article  Google Scholar 

  16. Denkowski, M. J., Lavie, A.: Exploring normalization techniques for human judgments of machine translation adequacy collected using amazon mechanical turk. In: Proceedings of the 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, Los Angeles, USA, June 6, 2010, pp. 57–61 (2010)

  17. Farghaly, A.: Computer processing of arabic script-based languages: Current state and future directions. In: Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages, Stroudsburg, PA, USA. Association for Computational Linguistics (2004)

  18. Fortunato, S., Lancichinetti, A.: Community detection algorithms: a comparative analysis. In: G. Stea, J. Mairesse, and J. Mendes (Eds.), 4th International Conference on Performance Evaluation Methodologies and Tools, VALUETOOLS ’09, Pisa, Italy, October 20–22, 2009, pp.  27. ICST/ACM (2009)

  19. Franco-Salvador, M., Gupta, P., Rosso, P., Banchs, R.E.: Cross-language plagiarism detection over continuous-space- and knowledge graph-based representations of language. Knowl.-Based Syst. 111, 87–99 (2016)

    Article  Google Scholar 

  20. Franco-Salvador, M., Rosso, P., Montes-y-Gómez, M.: A systematic study of knowledge graph analysis for cross-language plagiarism detection. Inf. Process. Manag. 52(4), 550–570 (2016)

    Article  Google Scholar 

  21. Irvine, A., Klementiev, A.: Using mechanical turk to annotate lexicons for less commonly used languages. In: Proceedings of the 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, Los Angeles, USA, June 6, 2010, pp. 108–113 (2010)

  22. Khoshnavataher, K., Zarrabi, V., Mohtaj, S., Asghari, H.: Developing monolingual persian corpus for extrinsic plagiarism detection using artificial obfuscation: notebook for PAN at CLEF 2015. See DBLP:conf/clef/2015w (2015)

  23. Lizorkin, D., Medelyan, O., Grineva, M. P.: Analysis of community structure in wikipedia. In: J. Quemada, G. León, Y. S. Maarek, and W. Nejdl (Eds.), Proceedings of the 18th International Conference on World Wide Web, WWW 2009, Madrid, Spain, April 20–24, 2009, pp. 1221–1222. ACM (2009)

  24. Mashhadirajab, F., Shamsfard, M., Adelkhah, R., Shafiee, F., Saedi, C.: A text alignment corpus for persian plagiarism detection. In: FIRE (Working Notes), Volume 1737 of CEUR Workshop Proceedings, pp. 184–189. CEUR-WS.org (2016)

  25. Meuschke, N., Stange, V., Schubotz, M., Gipp, B.: Hyplag: a hybrid approach to academic plagiarism detection. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, New York, NY, USA, pp. 1321–1324. ACM (2018)

  26. Mohtaj, S., Asghari, H., Zarrabi, V.: Developing monolingual english corpus for plagiarism detection using human annotated paraphrase corpus. See DBLP:conf/clef/2015w (2015)

  27. Mohtaj, S., Asghari, H., Zarrabi, V.: Compiling a text re-use detection corpus from scientific papers with semi-real cases of plagiarism. In: 2017 International Conference on Asian Language Processing, IALP 2017, Singapore, December 5–7, 2017, pp. 227–230 (2017)

  28. Mohtaj, S., Roshanfekr, B., Zafarian, A., Asghari, H.: Parsivar: a language processing toolkit for persian. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7–12, (2018)

  29. Potthast, M., Gollub, T., Hagen, M., Kiesel, J., Michel, M., Oberländer, A., Tippmann, M., Barrón-Cedeño, A., Gupta, P., Rosso, P., Stein, B.: Overview of the 4th international competition on plagiarism detection. In: CLEF 2012 Evaluation Labs and Workshop, Online Working Notes, Rome, Italy, September 17–20, (2012)

  30. Potthast, M., Hagen, M., Völske, M., Stein, B.: Crowdsourcing interaction logs to understand text reuse from the web. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, 4–9 August 2013, Sofia, Bulgaria, Volume 1: Long Papers, pp. 1212–1221. The Association for Computer Linguistics (2013)

  31. Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An evaluation framework for plagiarism detection. In: C. Huang and D. Jurafsky (Eds.), COLING 2010, 23rd International Conference on Computational Linguistics, Posters Volume, 23–27 August 2010, Beijing, China, pp. 997–1005. Chinese Information Processing Society of China (2010)

  32. Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeno, A., Rosso, P.: Overview of the 1st international competition on plagiarism detection. In: 3rd PAN Workshop. Uncovering Plagiarism, Authorship and Social Software Misuse (PAN 09), CEUR-WS.org, pp. 1–9 (2009)

  33. Rosvall, M., Bergstrom, C.T.: Maps of random walks on complex networks reveal community structure. Proc. Natl. Acad. Sci. 105(4), 1118–1123 (2008)

    Article  Google Scholar 

  34. Sabou, M., Bontcheva, K., Derczynski, L., Scharl, A.: Corpus annotation through crowdsourcing: Towards best practice guidelines. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, Reykjavik, Iceland, May 26–31, 2014., pp. 859–866 (2014)

  35. Shamsfard, M.: Challenges and open problems in persian text processing. In: 5th Language & Technology Conference (LTC): Human Language Technologies as a Challenge for Computer Science and Linguistics, pp. 65—69 (2011)

  36. Shamsfard, M., Hesabi, A., Fadaei, H., Mansoory, N., Famian, A., Bagherbeigi, S., Fekri, E., Monshizadeh, M., Assi, S. M.: Semi automatic development of farsnet; the persian wordnet. In: Proceedings of 5th Global WordNet Conference (GWA2010), Mumbai, India, Volume 29 (2010)

  37. Sharifabadi, M. R., Eftekhari, S.A.: Mahak samim: A corpus of persian academic texts for evaluating plagiarism detection systems. In: FIRE (Working Notes), Volume 1737 of CEUR Workshop Proceedings, pp. 190–192. CEUR-WS.org (2016)

  38. Sharjeel, M., Nawab, R. M. A., Rayson, P.: Counter: corpus of urdu news text reuse. Language Resources and Evaluation, 1–27 (2016)

  39. Stein, B., zu Eissen, S. M., Potthast, M.: Strategies for retrieving plagiarized documents. In: W. Kraaij, A. P. de Vries, C. L. A. Clarke, N. Fuhr, and N. Kando (Eds.), SIGIR 2007: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, July 23–27, 2007, pp. 825–826. ACM (2007)

  40. Yang, Z., Algesheimer, R., Tessone, C. J.: A comparative analysis of community detection algorithms on artificial networks. Scientific Reports  6 (2016)

  41. zu Eissen, S. M., Stein, B.: Intrinsic plagiarism detection. In: M. Lalmas, A. MacFarlane, S. M. Rüger, A. Tombros, T. Tsikrika, and A. Yavlinsky (Eds.), Advances in Information Retrieval, 28th European Conference on IR Research, ECIR 2006, London, UK, April 10–12, 2006, Proceedings, Volume 3936 of Lecture Notes in Computer Science, pp. 565–569. Springer (2006)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Omid Fatemi.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Asghari, H., Fatemi, O., Mohtaj, S. et al. A crowdsourcing approach to construct mono-lingual plagiarism detection corpus. Int J Digit Libr 22, 49–61 (2021). https://doi.org/10.1007/s00799-020-00294-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00799-020-00294-4

Keywords

Navigation