Using Local Text Similarity in Pairwise Document Analysis for Monolingual Plagiarism Detection

Ehsan, Nava; Shakery, Azadeh

doi:10.1007/978-3-319-73606-8_9

Nava Ehsan¹⁷ &
Azadeh Shakery¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10478))

Included in the following conference series:

Forum for Information Retrieval Evaluation

663 Accesses

Abstract

Rapid growth of documents and the increased accessibility of electronic documents lead to the need to develop effective tools for detecting plagiarised texts. The task of plagiarism detection entails two main subtasks, suspicious candidate retrieval and pairwise document similarity analysis also called detailed analysis. In this paper we focus on the second subtask. We will report our monolingual plagiarism detection system which is used to process the Persian plagiarism corpus for the task of pairwise document similarity. To retrieve plagiarised passages this paper presents a pairwise plagiarism detection algorithm based on a vector space model considering the proximity of the terms. The proposed framework is applicable in any language and it could also adapted for cross language domain. We evaluate the performance in terms of precision, recall, granularity and Plagdet metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Applying the Text Similarity to Detect Plagiarism

RoMaPla: Using t-Test for Evaluating Robustness of Marathi Plagiarism

Hierarchical and Pairwise Document Embedding for Plagiarism Detection

References

Abnar, S., Dehghani, M., Shakery, A.: Meta text aligner: text alignment based on predicted plagiarism relation. In: International Conference of the Cross-Language Evaluation Forum for European Languages, pp. 193–199 (2015)
Google Scholar
Androutsopoulos, I., Malakasiotis, P.: A survey of paraphrasing and textual entailment methods. J. Artif. Intell. Res. 38, 135–187 (2009)
MATH Google Scholar
Asghari, H., Mohtaj, S., Fatemi, O., Faili, H., Rosso, P., Potthast, M.: Algorithms and corpora for persian plagiarism detection: overview of pan at fire 2016. In: Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation. CEUR Workshop Proceedings, CEUR-WS.org (2016)
Google Scholar
Barrón-Cedeño, A., Rosso, P., Benedí, J.-M.: Reducing the plagiarism detection search space on the basis of the kullback-leibler distance. In: Gelbukh, A. (ed.) CICLing 2009. LNCS, vol. 5449, pp. 523–534. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-00382-0_42
Chapter Google Scholar
Barrón-Cedeno, A., Vila, M., Martí, M.A., Rosso, P.: Plagiarism meets paraphrasing: insights for the next generation in automatic plagiarism detection. Comput. Linguist. 39(4), 917–947 (2013)
Article Google Scholar
Bensalem, I., Boukhalfa, I., Rosso, P., Abouenour, L., Darwish, K., Chikhi, S.: Overview of the araplagdet pan@ fire2015 shared task on arabic plagiarism detection, pp. 111–122 (2015)
Google Scholar
Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital documents. In: ACM SIGMOD Record, vol. 24, pp. 398–409. ACM (1995)
Google Scholar
Chen, C.Y., Yeh, J.Y., Ke, H.R.: Plagiarism detection using rouge and wordNet. J. Comput. 2(3), 34–44 (2010)
Google Scholar
Chong, M., Specia, L.: Lexical generalisation for word-level matching in plagiarism detection. In: RANLP, pp. 704–709 (2011)
Google Scholar
Chow, T.W., Rahman, M.: Multilayer SOM with tree-structured data for efficient document retrieval and plagiarism detection. IEEE Trans. Neural Netw. 20(9), 1385–1402 (2009)
Article Google Scholar
Clough, P., Stevenson, M.: Developing a corpus of plagiarised short answers. Lang. Resour. Eval. 45(1), 5–24 (2011)
Article Google Scholar
Ehsan, N., Shakery, A.: Candidate document retrieval for cross-lingual plagiarism detection using two-level proximity information. Inf. Process. Manage. 52(6), 1004–1017 (2016)
Article Google Scholar
Ehsan, N., Shakery, A.: A pairwise document analysis approach for monolingual plagiarism detection. In: Working Notes of FIRE 2016-Forum for Information Retrieval Evaluation, pp. 7–10 (2016)
Google Scholar
Ehsan, N., Tompa, F.W., Shakery, A.: Using a dictionary and n-gram alignment to improve fine-grained cross-language plagiarism detection. In: Proceedings of the 2016 ACM Symposium on Document Engineering, pp. 59–68. ACM (2016)
Google Scholar
Errami, M., Sun, Z., George, A.C., Long, T.C., Skinner, M.A., Wren, J.D., Garner, H.R.: Identifying duplicate content using statistically improbable phrases. Bioinformatics 26(11), 1453–1457 (2010)
Article Google Scholar
Gollub, T., Stein, B., Burrows, S.: Ousting ivory tower research: towards a web framework for providing experiments as a service. In: Hersh, B., Callan, J., Maarek, Y., Sanderson, M. (eds.) 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2012), pp. 1125–1126. ACM (August 2012)
Google Scholar
Grozea, C., Gehl, C., Popescu, M.: Encoplot: pairwise sequence matching in linear time applied to plagiarism detection. In: 3rd PAN Workshop. Uncovering Plagiarism, Authorship and Social Software Misuse, pp. 10–18 (2009)
Google Scholar
Manku, G.S., Jain, A., Das Sarma, A.: Detecting near-duplicates for web crawling. In: Proceedings of the 16th International Conference on World Wide Web, pp. 141–150. ACM (2007)
Google Scholar
Meyer zu Eissen, S., Stein, B.: Intrinsic plagiarism detection. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 565–569. Springer, Heidelberg (2006). https://doi.org/10.1007/11735106_66
Chapter Google Scholar
Oberreuter, G., Velásquez, J.D.: Text mining applied to plagiarism detection: the use of words for detecting deviations in the writing style. Expert Syst. Appl. 40(9), 3756–3763 (2013)
Article Google Scholar
Pereira, R.C., Moreira, V.P., Galante, R.: A new approach for cross-language plagiarism analysis. Multiling. Multimodal Inf. Access Eval. 6360, 15–26 (2010)
Article Google Scholar
Potthast, M., Barrón-Cedeño, A., Stein, B., Rosso, P.: Cross-language plagiarism detection. Lang. Resour. Eval. 45(1), 45–62 (2011)
Article Google Scholar
Potthast, M., Gollub, T., Hagen, M., Kiesel, J., Michel, M., Oberländer, A., Tippmann, M., Barrón-Cedeño, A., Gupta, P., Rosso, P., et al.: Overview of the 4th international competition on plagiarism detection. In: CLEF (Online Working Notes/Labs/Workshop) (2012)
Google Scholar
Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Improving the reproducibility of PAN’s shared tasks: plagiarism detection, author identification, and author profiling. In: Kanoulas, E., Lupu, M., Clough, P., Sanderson, M., Hall, M., Hanbury, A., Toms, E. (eds.) CLEF 2014. LNCS, vol. 8685, pp. 268–299. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11382-1_22
Google Scholar
Potthast, M., Hagen, M., Beyer, A., Busse, M., Tippmann, M., Rosso, P., Stein, B.: Overview of the 6th international competition on plagiarism detection. In: CLEF (Online Working Notes/Labs/Workshop) (2014)
Google Scholar
Potthast, M., Stein, B., Anderka, M.: A wikipedia-based multilingual retrieval model. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 522–530. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78646-7_51
Chapter Google Scholar
Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An evaluation framework for plagiarism detection. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 997–1005. Association for Computational Linguistics (2010)
Google Scholar
Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 76–85. ACM (2003)
Google Scholar
Si, A., Leong, H.V., Lau, R.W.: Check: a document plagiarism detection system. In: Proceedings of the 1997 ACM Symposium on Applied Computing, pp. 70–77. ACM (1997)
Google Scholar
Stamatatos, E.: Plagiarism detection using stopword n-grams. J. Am. Soc. Inform. Sci. Technol. 62(12), 2512–2527 (2011)
Article Google Scholar

Download references

Author information

Authors and Affiliations

School of Electrical and Computer Engineering, College of Engineering, University of Tehran, Tehran, Iran
Nava Ehsan & Azadeh Shakery

Authors

Nava Ehsan
View author publications
You can also search for this author in PubMed Google Scholar
Azadeh Shakery
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nava Ehsan .

Editor information

Editors and Affiliations

DAIICT, Gujarat, India
Prasenjit Majumder
Indian Statistical Institute, Kolkata, India
Mandar Mitra
DAIICT, Gujarat, India
Parth Mehta
DAIICT, Gujarat, India
Jainisha Sankhavara

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ehsan, N., Shakery, A. (2018). Using Local Text Similarity in Pairwise Document Analysis for Monolingual Plagiarism Detection. In: Majumder, P., Mitra, M., Mehta, P., Sankhavara, J. (eds) Text Processing. FIRE 2016. Lecture Notes in Computer Science(), vol 10478. Springer, Cham. https://doi.org/10.1007/978-3-319-73606-8_9

Download citation

DOI: https://doi.org/10.1007/978-3-319-73606-8_9
Published: 04 January 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-73605-1
Online ISBN: 978-3-319-73606-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics