Abstract
Plagiarism detection deals with detecting plagiarized fragments among textual documents. The availability of digital documents in online libraries makes plagiarism easier and on the other hand, to be easily detected by automatic plagiarism detection systems. Large scale plagiarism corpora with a wide variety of plagiarism cases are needed to evaluate different detection methods in different languages. Plagiarism detection corpora play an important role in evaluating and tuning plagiarism detection systems. Despite of their importance, few corpora have been developed for low resource languages. In this paper, we propose HAMTA, a Persian plagiarism detection corpus. To simulate real cases of plagiarism, manually paraphrased text are used to compile the corpus. For obtaining the manual plagiarism cases, a crowdsourcing platform is developed and crowd workers are asked to paraphrase fragments of text in order to simulate real cases of plagiarism. Moreover, artificial methods are used to scale-up the proposed corpus by automatically generating cases of text re-use. The evaluation results indicate a high correlation between the proposed corpus and the PAN state-of-the-art English plagiarism detection corpus.







Similar content being viewed by others
References
Al-Raisi, F., Lin, W., Bourai, A.: A monolingual parallel corpus of arabic. In: Fourth International Conference On Arabic Computational Linguistics, ACLING 2018, November 17–19, 2018, Dubai, United Arab Emirates, pp. 334–338 (2018)
Ambati, V., Vogel, S.: Can crowds build parallel corpora for machine translation systems? In: Proceedings of the 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, Los Angeles, USA, June 6, 2010, pp. 62–65 (2010)
Asghari, H., Fatemi, O., Mohtaj, S., Faili, H., Rosso, P.: On the use of word embedding for cross language plagiarism detection. Intell. Data Anal. 23(3), 661–680 (2019)
Asghari, H., Mohtaj, S., Fatemi, O., Faili, H., Rosso, P., Potthast, M.: Algorithms and corpora for persian plagiarism detection: overview of PAN at FIRE 2016. In: P. Majumder, M. Mitra, P. Mehta, J. Sankhavara, and K. Ghosh (Eds.), Working notes of FIRE 2016—Forum for Information Retrieval Evaluation, Kolkata, India, December 7-10, 2016., Volume 1737 of CEUR Workshop Proceedings, pp. 135–144. CEUR-WS.org (2016)
Barrón-Cedeño, A., Gupta, P., Rosso, P.: Methods for cross-language plagiarism detection. Knowl.-Based Syst. 50, 211–217 (2013)
Barrón-Cedeño, A., M. Potthast, P. Rosso, and B. Stein (2010). Corpus and evaluation measures for automatic plagiarism detection. In Proceedings of the International Conference on Language Resources and Evaluation, LREC: 17–23 May 2010. Valletta, Malta (2010)
Barrón-Cedeño, A., Vila, M., Martí, M.A., Rosso, P.: Plagiarism meets paraphrasing: insights for the next generation in automatic plagiarism detection. Comput. Linguist. 39(4), 917–947 (2013)
Bensalem, I., Rosso, P., Chikhi, S.: A new corpus for the evaluation of arabic intrinsic plagiarism detection. In: P. Forner, H. Müller, R. Paredes, P. Rosso, and B. Stein (Eds.), Information Access Evaluation. Multilinguality, Multimodality, and Visualization - 4th International Conference of the CLEF Initiative, CLEF 2013, Valencia, Spain, September 23-26, 2013. Proceedings, Volume 8138 of Lecture Notes in Computer Science, pp. 53–58. Springer (2013)
Bloodgood, M., Callison-Burch, C.: Bucking the trend: Large-scale cost-focused active learning for statistical machine translation. In: ACL 2010, Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics, July 11–16, 2010, Uppsala, Sweden, pp. 854–864 (2010a)
Bloodgood, M., Callison-Burch, C.: Using mechanical turk to build machine translation evaluation sets. In: Proceedings of the 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, Los Angeles, USA, June 6, 2010, pp. 208–211 (2010b)
Callison-Burch, C.: Fast, cheap, and creative: Evaluating translation quality using amazon’s mechanical turk. In: Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, EMNLP 2009, 6–7 August 2009, Singapore, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 286–295 (2009)
Cappellato, L., Ferro, N., Jones, G. J. F., SanJuan, E.: Working Notes of CLEF 2015—Conference and Labs of the Evaluation forum, Toulouse, France, September 8-11, 2015, Volume 1391 of CEUR Workshop Proceedings. CEUR-WS.org (2015)
Chen, D., Dolan, W. B.: Collecting highly parallel data for paraphrase evaluation. In: The 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference, 19–24 June, 2011, Portland, Oregon, USA, pp. 190–200 (2011)
Clough, P. D., Gaizauskas, R. J., Piao, S. S. L., Wilks, Y.: METER: measuring text reuse. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, July 6–12, 2002, Philadelphia, PA, USA., pp. 152–159. ACL (2002)
Clough, P.D., Stevenson, M.: Developing a corpus of plagiarised short answers. Lang. Resour. Eval. 45(1), 5–24 (2011)
Denkowski, M. J., Lavie, A.: Exploring normalization techniques for human judgments of machine translation adequacy collected using amazon mechanical turk. In: Proceedings of the 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, Los Angeles, USA, June 6, 2010, pp. 57–61 (2010)
Farghaly, A.: Computer processing of arabic script-based languages: Current state and future directions. In: Proceedings of the Workshop on Computational Approaches to Arabic Script-based Languages, Stroudsburg, PA, USA. Association for Computational Linguistics (2004)
Fortunato, S., Lancichinetti, A.: Community detection algorithms: a comparative analysis. In: G. Stea, J. Mairesse, and J. Mendes (Eds.), 4th International Conference on Performance Evaluation Methodologies and Tools, VALUETOOLS ’09, Pisa, Italy, October 20–22, 2009, pp. 27. ICST/ACM (2009)
Franco-Salvador, M., Gupta, P., Rosso, P., Banchs, R.E.: Cross-language plagiarism detection over continuous-space- and knowledge graph-based representations of language. Knowl.-Based Syst. 111, 87–99 (2016)
Franco-Salvador, M., Rosso, P., Montes-y-Gómez, M.: A systematic study of knowledge graph analysis for cross-language plagiarism detection. Inf. Process. Manag. 52(4), 550–570 (2016)
Irvine, A., Klementiev, A.: Using mechanical turk to annotate lexicons for less commonly used languages. In: Proceedings of the 2010 Workshop on Creating Speech and Language Data with Amazon’s Mechanical Turk, Los Angeles, USA, June 6, 2010, pp. 108–113 (2010)
Khoshnavataher, K., Zarrabi, V., Mohtaj, S., Asghari, H.: Developing monolingual persian corpus for extrinsic plagiarism detection using artificial obfuscation: notebook for PAN at CLEF 2015. See DBLP:conf/clef/2015w (2015)
Lizorkin, D., Medelyan, O., Grineva, M. P.: Analysis of community structure in wikipedia. In: J. Quemada, G. León, Y. S. Maarek, and W. Nejdl (Eds.), Proceedings of the 18th International Conference on World Wide Web, WWW 2009, Madrid, Spain, April 20–24, 2009, pp. 1221–1222. ACM (2009)
Mashhadirajab, F., Shamsfard, M., Adelkhah, R., Shafiee, F., Saedi, C.: A text alignment corpus for persian plagiarism detection. In: FIRE (Working Notes), Volume 1737 of CEUR Workshop Proceedings, pp. 184–189. CEUR-WS.org (2016)
Meuschke, N., Stange, V., Schubotz, M., Gipp, B.: Hyplag: a hybrid approach to academic plagiarism detection. In: The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, New York, NY, USA, pp. 1321–1324. ACM (2018)
Mohtaj, S., Asghari, H., Zarrabi, V.: Developing monolingual english corpus for plagiarism detection using human annotated paraphrase corpus. See DBLP:conf/clef/2015w (2015)
Mohtaj, S., Asghari, H., Zarrabi, V.: Compiling a text re-use detection corpus from scientific papers with semi-real cases of plagiarism. In: 2017 International Conference on Asian Language Processing, IALP 2017, Singapore, December 5–7, 2017, pp. 227–230 (2017)
Mohtaj, S., Roshanfekr, B., Zafarian, A., Asghari, H.: Parsivar: a language processing toolkit for persian. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, LREC 2018, Miyazaki, Japan, May 7–12, (2018)
Potthast, M., Gollub, T., Hagen, M., Kiesel, J., Michel, M., Oberländer, A., Tippmann, M., Barrón-Cedeño, A., Gupta, P., Rosso, P., Stein, B.: Overview of the 4th international competition on plagiarism detection. In: CLEF 2012 Evaluation Labs and Workshop, Online Working Notes, Rome, Italy, September 17–20, (2012)
Potthast, M., Hagen, M., Völske, M., Stein, B.: Crowdsourcing interaction logs to understand text reuse from the web. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, ACL 2013, 4–9 August 2013, Sofia, Bulgaria, Volume 1: Long Papers, pp. 1212–1221. The Association for Computer Linguistics (2013)
Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An evaluation framework for plagiarism detection. In: C. Huang and D. Jurafsky (Eds.), COLING 2010, 23rd International Conference on Computational Linguistics, Posters Volume, 23–27 August 2010, Beijing, China, pp. 997–1005. Chinese Information Processing Society of China (2010)
Potthast, M., Stein, B., Eiselt, A., Barrón-Cedeno, A., Rosso, P.: Overview of the 1st international competition on plagiarism detection. In: 3rd PAN Workshop. Uncovering Plagiarism, Authorship and Social Software Misuse (PAN 09), CEUR-WS.org, pp. 1–9 (2009)
Rosvall, M., Bergstrom, C.T.: Maps of random walks on complex networks reveal community structure. Proc. Natl. Acad. Sci. 105(4), 1118–1123 (2008)
Sabou, M., Bontcheva, K., Derczynski, L., Scharl, A.: Corpus annotation through crowdsourcing: Towards best practice guidelines. In: Proceedings of the Ninth International Conference on Language Resources and Evaluation, LREC 2014, Reykjavik, Iceland, May 26–31, 2014., pp. 859–866 (2014)
Shamsfard, M.: Challenges and open problems in persian text processing. In: 5th Language & Technology Conference (LTC): Human Language Technologies as a Challenge for Computer Science and Linguistics, pp. 65—69 (2011)
Shamsfard, M., Hesabi, A., Fadaei, H., Mansoory, N., Famian, A., Bagherbeigi, S., Fekri, E., Monshizadeh, M., Assi, S. M.: Semi automatic development of farsnet; the persian wordnet. In: Proceedings of 5th Global WordNet Conference (GWA2010), Mumbai, India, Volume 29 (2010)
Sharifabadi, M. R., Eftekhari, S.A.: Mahak samim: A corpus of persian academic texts for evaluating plagiarism detection systems. In: FIRE (Working Notes), Volume 1737 of CEUR Workshop Proceedings, pp. 190–192. CEUR-WS.org (2016)
Sharjeel, M., Nawab, R. M. A., Rayson, P.: Counter: corpus of urdu news text reuse. Language Resources and Evaluation, 1–27 (2016)
Stein, B., zu Eissen, S. M., Potthast, M.: Strategies for retrieving plagiarized documents. In: W. Kraaij, A. P. de Vries, C. L. A. Clarke, N. Fuhr, and N. Kando (Eds.), SIGIR 2007: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, Amsterdam, The Netherlands, July 23–27, 2007, pp. 825–826. ACM (2007)
Yang, Z., Algesheimer, R., Tessone, C. J.: A comparative analysis of community detection algorithms on artificial networks. Scientific Reports 6 (2016)
zu Eissen, S. M., Stein, B.: Intrinsic plagiarism detection. In: M. Lalmas, A. MacFarlane, S. M. Rüger, A. Tombros, T. Tsikrika, and A. Yavlinsky (Eds.), Advances in Information Retrieval, 28th European Conference on IR Research, ECIR 2006, London, UK, April 10–12, 2006, Proceedings, Volume 3936 of Lecture Notes in Computer Science, pp. 565–569. Springer (2006)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Asghari, H., Fatemi, O., Mohtaj, S. et al. A crowdsourcing approach to construct mono-lingual plagiarism detection corpus. Int J Digit Libr 22, 49–61 (2021). https://doi.org/10.1007/s00799-020-00294-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00799-020-00294-4