Abstract
Rapid growth of documents and the increased accessibility of electronic documents lead to the need to develop effective tools for detecting plagiarised texts. The task of plagiarism detection entails two main subtasks, suspicious candidate retrieval and pairwise document similarity analysis also called detailed analysis. In this paper we focus on the second subtask. We will report our monolingual plagiarism detection system which is used to process the Persian plagiarism corpus for the task of pairwise document similarity. To retrieve plagiarised passages this paper presents a pairwise plagiarism detection algorithm based on a vector space model considering the proximity of the terms. The proposed framework is applicable in any language and it could also adapted for cross language domain. We evaluate the performance in terms of precision, recall, granularity and Plagdet metrics.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abnar, S., Dehghani, M., Shakery, A.: Meta text aligner: text alignment based on predicted plagiarism relation. In: International Conference of the Cross-Language Evaluation Forum for European Languages, pp. 193–199 (2015)
Androutsopoulos, I., Malakasiotis, P.: A survey of paraphrasing and textual entailment methods. J. Artif. Intell. Res. 38, 135–187 (2009)
Asghari, H., Mohtaj, S., Fatemi, O., Faili, H., Rosso, P., Potthast, M.: Algorithms and corpora for persian plagiarism detection: overview of pan at fire 2016. In: Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation. CEUR Workshop Proceedings, CEUR-WS.org (2016)
Barrón-Cedeño, A., Rosso, P., BenedÃ, J.-M.: Reducing the plagiarism detection search space on the basis of the kullback-leibler distance. In: Gelbukh, A. (ed.) CICLing 2009. LNCS, vol. 5449, pp. 523–534. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-00382-0_42
Barrón-Cedeno, A., Vila, M., MartÃ, M.A., Rosso, P.: Plagiarism meets paraphrasing: insights for the next generation in automatic plagiarism detection. Comput. Linguist. 39(4), 917–947 (2013)
Bensalem, I., Boukhalfa, I., Rosso, P., Abouenour, L., Darwish, K., Chikhi, S.: Overview of the araplagdet pan@ fire2015 shared task on arabic plagiarism detection, pp. 111–122 (2015)
Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital documents. In: ACM SIGMOD Record, vol. 24, pp. 398–409. ACM (1995)
Chen, C.Y., Yeh, J.Y., Ke, H.R.: Plagiarism detection using rouge and wordNet. J. Comput. 2(3), 34–44 (2010)
Chong, M., Specia, L.: Lexical generalisation for word-level matching in plagiarism detection. In: RANLP, pp. 704–709 (2011)
Chow, T.W., Rahman, M.: Multilayer SOM with tree-structured data for efficient document retrieval and plagiarism detection. IEEE Trans. Neural Netw. 20(9), 1385–1402 (2009)
Clough, P., Stevenson, M.: Developing a corpus of plagiarised short answers. Lang. Resour. Eval. 45(1), 5–24 (2011)
Ehsan, N., Shakery, A.: Candidate document retrieval for cross-lingual plagiarism detection using two-level proximity information. Inf. Process. Manage. 52(6), 1004–1017 (2016)
Ehsan, N., Shakery, A.: A pairwise document analysis approach for monolingual plagiarism detection. In: Working Notes of FIRE 2016-Forum for Information Retrieval Evaluation, pp. 7–10 (2016)
Ehsan, N., Tompa, F.W., Shakery, A.: Using a dictionary and n-gram alignment to improve fine-grained cross-language plagiarism detection. In: Proceedings of the 2016 ACM Symposium on Document Engineering, pp. 59–68. ACM (2016)
Errami, M., Sun, Z., George, A.C., Long, T.C., Skinner, M.A., Wren, J.D., Garner, H.R.: Identifying duplicate content using statistically improbable phrases. Bioinformatics 26(11), 1453–1457 (2010)
Gollub, T., Stein, B., Burrows, S.: Ousting ivory tower research: towards a web framework for providing experiments as a service. In: Hersh, B., Callan, J., Maarek, Y., Sanderson, M. (eds.) 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2012), pp. 1125–1126. ACM (August 2012)
Grozea, C., Gehl, C., Popescu, M.: Encoplot: pairwise sequence matching in linear time applied to plagiarism detection. In: 3rd PAN Workshop. Uncovering Plagiarism, Authorship and Social Software Misuse, pp. 10–18 (2009)
Manku, G.S., Jain, A., Das Sarma, A.: Detecting near-duplicates for web crawling. In: Proceedings of the 16th International Conference on World Wide Web, pp. 141–150. ACM (2007)
Meyer zu Eissen, S., Stein, B.: Intrinsic plagiarism detection. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 565–569. Springer, Heidelberg (2006). https://doi.org/10.1007/11735106_66
Oberreuter, G., Velásquez, J.D.: Text mining applied to plagiarism detection: the use of words for detecting deviations in the writing style. Expert Syst. Appl. 40(9), 3756–3763 (2013)
Pereira, R.C., Moreira, V.P., Galante, R.: A new approach for cross-language plagiarism analysis. Multiling. Multimodal Inf. Access Eval. 6360, 15–26 (2010)
Potthast, M., Barrón-Cedeño, A., Stein, B., Rosso, P.: Cross-language plagiarism detection. Lang. Resour. Eval. 45(1), 45–62 (2011)
Potthast, M., Gollub, T., Hagen, M., Kiesel, J., Michel, M., Oberländer, A., Tippmann, M., Barrón-Cedeño, A., Gupta, P., Rosso, P., et al.: Overview of the 4th international competition on plagiarism detection. In: CLEF (Online Working Notes/Labs/Workshop) (2012)
Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Improving the reproducibility of PAN’s shared tasks: plagiarism detection, author identification, and author profiling. In: Kanoulas, E., Lupu, M., Clough, P., Sanderson, M., Hall, M., Hanbury, A., Toms, E. (eds.) CLEF 2014. LNCS, vol. 8685, pp. 268–299. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11382-1_22
Potthast, M., Hagen, M., Beyer, A., Busse, M., Tippmann, M., Rosso, P., Stein, B.: Overview of the 6th international competition on plagiarism detection. In: CLEF (Online Working Notes/Labs/Workshop) (2014)
Potthast, M., Stein, B., Anderka, M.: A wikipedia-based multilingual retrieval model. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 522–530. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78646-7_51
Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An evaluation framework for plagiarism detection. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 997–1005. Association for Computational Linguistics (2010)
Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 76–85. ACM (2003)
Si, A., Leong, H.V., Lau, R.W.: Check: a document plagiarism detection system. In: Proceedings of the 1997 ACM Symposium on Applied Computing, pp. 70–77. ACM (1997)
Stamatatos, E.: Plagiarism detection using stopword n-grams. J. Am. Soc. Inform. Sci. Technol. 62(12), 2512–2527 (2011)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG
About this paper
Cite this paper
Ehsan, N., Shakery, A. (2018). Using Local Text Similarity in Pairwise Document Analysis for Monolingual Plagiarism Detection. In: Majumder, P., Mitra, M., Mehta, P., Sankhavara, J. (eds) Text Processing. FIRE 2016. Lecture Notes in Computer Science(), vol 10478. Springer, Cham. https://doi.org/10.1007/978-3-319-73606-8_9
Download citation
DOI: https://doi.org/10.1007/978-3-319-73606-8_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73605-1
Online ISBN: 978-3-319-73606-8
eBook Packages: Computer ScienceComputer Science (R0)