Skip to main content

Using Local Text Similarity in Pairwise Document Analysis for Monolingual Plagiarism Detection

  • Conference paper
  • First Online:
Text Processing (FIRE 2016)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10478))

Included in the following conference series:

  • 636 Accesses

Abstract

Rapid growth of documents and the increased accessibility of electronic documents lead to the need to develop effective tools for detecting plagiarised texts. The task of plagiarism detection entails two main subtasks, suspicious candidate retrieval and pairwise document similarity analysis also called detailed analysis. In this paper we focus on the second subtask. We will report our monolingual plagiarism detection system which is used to process the Persian plagiarism corpus for the task of pairwise document similarity. To retrieve plagiarised passages this paper presents a pairwise plagiarism detection algorithm based on a vector space model considering the proximity of the terms. The proposed framework is applicable in any language and it could also adapted for cross language domain. We evaluate the performance in terms of precision, recall, granularity and Plagdet metrics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Abnar, S., Dehghani, M., Shakery, A.: Meta text aligner: text alignment based on predicted plagiarism relation. In: International Conference of the Cross-Language Evaluation Forum for European Languages, pp. 193–199 (2015)

    Google Scholar 

  2. Androutsopoulos, I., Malakasiotis, P.: A survey of paraphrasing and textual entailment methods. J. Artif. Intell. Res. 38, 135–187 (2009)

    MATH  Google Scholar 

  3. Asghari, H., Mohtaj, S., Fatemi, O., Faili, H., Rosso, P., Potthast, M.: Algorithms and corpora for persian plagiarism detection: overview of pan at fire 2016. In: Working notes of FIRE 2016 - Forum for Information Retrieval Evaluation. CEUR Workshop Proceedings, CEUR-WS.org (2016)

    Google Scholar 

  4. Barrón-Cedeño, A., Rosso, P., Benedí, J.-M.: Reducing the plagiarism detection search space on the basis of the kullback-leibler distance. In: Gelbukh, A. (ed.) CICLing 2009. LNCS, vol. 5449, pp. 523–534. Springer, Heidelberg (2009). https://doi.org/10.1007/978-3-642-00382-0_42

    Chapter  Google Scholar 

  5. Barrón-Cedeno, A., Vila, M., Martí, M.A., Rosso, P.: Plagiarism meets paraphrasing: insights for the next generation in automatic plagiarism detection. Comput. Linguist. 39(4), 917–947 (2013)

    Article  Google Scholar 

  6. Bensalem, I., Boukhalfa, I., Rosso, P., Abouenour, L., Darwish, K., Chikhi, S.: Overview of the araplagdet pan@ fire2015 shared task on arabic plagiarism detection, pp. 111–122 (2015)

    Google Scholar 

  7. Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital documents. In: ACM SIGMOD Record, vol. 24, pp. 398–409. ACM (1995)

    Google Scholar 

  8. Chen, C.Y., Yeh, J.Y., Ke, H.R.: Plagiarism detection using rouge and wordNet. J. Comput. 2(3), 34–44 (2010)

    Google Scholar 

  9. Chong, M., Specia, L.: Lexical generalisation for word-level matching in plagiarism detection. In: RANLP, pp. 704–709 (2011)

    Google Scholar 

  10. Chow, T.W., Rahman, M.: Multilayer SOM with tree-structured data for efficient document retrieval and plagiarism detection. IEEE Trans. Neural Netw. 20(9), 1385–1402 (2009)

    Article  Google Scholar 

  11. Clough, P., Stevenson, M.: Developing a corpus of plagiarised short answers. Lang. Resour. Eval. 45(1), 5–24 (2011)

    Article  Google Scholar 

  12. Ehsan, N., Shakery, A.: Candidate document retrieval for cross-lingual plagiarism detection using two-level proximity information. Inf. Process. Manage. 52(6), 1004–1017 (2016)

    Article  Google Scholar 

  13. Ehsan, N., Shakery, A.: A pairwise document analysis approach for monolingual plagiarism detection. In: Working Notes of FIRE 2016-Forum for Information Retrieval Evaluation, pp. 7–10 (2016)

    Google Scholar 

  14. Ehsan, N., Tompa, F.W., Shakery, A.: Using a dictionary and n-gram alignment to improve fine-grained cross-language plagiarism detection. In: Proceedings of the 2016 ACM Symposium on Document Engineering, pp. 59–68. ACM (2016)

    Google Scholar 

  15. Errami, M., Sun, Z., George, A.C., Long, T.C., Skinner, M.A., Wren, J.D., Garner, H.R.: Identifying duplicate content using statistically improbable phrases. Bioinformatics 26(11), 1453–1457 (2010)

    Article  Google Scholar 

  16. Gollub, T., Stein, B., Burrows, S.: Ousting ivory tower research: towards a web framework for providing experiments as a service. In: Hersh, B., Callan, J., Maarek, Y., Sanderson, M. (eds.) 35th International ACM Conference on Research and Development in Information Retrieval (SIGIR 2012), pp. 1125–1126. ACM (August 2012)

    Google Scholar 

  17. Grozea, C., Gehl, C., Popescu, M.: Encoplot: pairwise sequence matching in linear time applied to plagiarism detection. In: 3rd PAN Workshop. Uncovering Plagiarism, Authorship and Social Software Misuse, pp. 10–18 (2009)

    Google Scholar 

  18. Manku, G.S., Jain, A., Das Sarma, A.: Detecting near-duplicates for web crawling. In: Proceedings of the 16th International Conference on World Wide Web, pp. 141–150. ACM (2007)

    Google Scholar 

  19. Meyer zu Eissen, S., Stein, B.: Intrinsic plagiarism detection. In: Lalmas, M., MacFarlane, A., Rüger, S., Tombros, A., Tsikrika, T., Yavlinsky, A. (eds.) ECIR 2006. LNCS, vol. 3936, pp. 565–569. Springer, Heidelberg (2006). https://doi.org/10.1007/11735106_66

    Chapter  Google Scholar 

  20. Oberreuter, G., Velásquez, J.D.: Text mining applied to plagiarism detection: the use of words for detecting deviations in the writing style. Expert Syst. Appl. 40(9), 3756–3763 (2013)

    Article  Google Scholar 

  21. Pereira, R.C., Moreira, V.P., Galante, R.: A new approach for cross-language plagiarism analysis. Multiling. Multimodal Inf. Access Eval. 6360, 15–26 (2010)

    Article  Google Scholar 

  22. Potthast, M., Barrón-Cedeño, A., Stein, B., Rosso, P.: Cross-language plagiarism detection. Lang. Resour. Eval. 45(1), 45–62 (2011)

    Article  Google Scholar 

  23. Potthast, M., Gollub, T., Hagen, M., Kiesel, J., Michel, M., Oberländer, A., Tippmann, M., Barrón-Cedeño, A., Gupta, P., Rosso, P., et al.: Overview of the 4th international competition on plagiarism detection. In: CLEF (Online Working Notes/Labs/Workshop) (2012)

    Google Scholar 

  24. Potthast, M., Gollub, T., Rangel, F., Rosso, P., Stamatatos, E., Stein, B.: Improving the reproducibility of PAN’s shared tasks: plagiarism detection, author identification, and author profiling. In: Kanoulas, E., Lupu, M., Clough, P., Sanderson, M., Hall, M., Hanbury, A., Toms, E. (eds.) CLEF 2014. LNCS, vol. 8685, pp. 268–299. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11382-1_22

    Google Scholar 

  25. Potthast, M., Hagen, M., Beyer, A., Busse, M., Tippmann, M., Rosso, P., Stein, B.: Overview of the 6th international competition on plagiarism detection. In: CLEF (Online Working Notes/Labs/Workshop) (2014)

    Google Scholar 

  26. Potthast, M., Stein, B., Anderka, M.: A wikipedia-based multilingual retrieval model. In: Macdonald, C., Ounis, I., Plachouras, V., Ruthven, I., White, R.W. (eds.) ECIR 2008. LNCS, vol. 4956, pp. 522–530. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-78646-7_51

    Chapter  Google Scholar 

  27. Potthast, M., Stein, B., Barrón-Cedeño, A., Rosso, P.: An evaluation framework for plagiarism detection. In: Proceedings of the 23rd International Conference on Computational Linguistics: Posters, pp. 997–1005. Association for Computational Linguistics (2010)

    Google Scholar 

  28. Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data, pp. 76–85. ACM (2003)

    Google Scholar 

  29. Si, A., Leong, H.V., Lau, R.W.: Check: a document plagiarism detection system. In: Proceedings of the 1997 ACM Symposium on Applied Computing, pp. 70–77. ACM (1997)

    Google Scholar 

  30. Stamatatos, E.: Plagiarism detection using stopword n-grams. J. Am. Soc. Inform. Sci. Technol. 62(12), 2512–2527 (2011)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nava Ehsan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Ehsan, N., Shakery, A. (2018). Using Local Text Similarity in Pairwise Document Analysis for Monolingual Plagiarism Detection. In: Majumder, P., Mitra, M., Mehta, P., Sankhavara, J. (eds) Text Processing. FIRE 2016. Lecture Notes in Computer Science(), vol 10478. Springer, Cham. https://doi.org/10.1007/978-3-319-73606-8_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-73606-8_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-73605-1

  • Online ISBN: 978-3-319-73606-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics