A machine learning approach to query generation in plagiarism source retrieval

Kong, Lei-lei; Lu, Zhi-mao; Qi, Hao-liang; Han, Zhong-yuan

doi:10.1631/FITEE.1601344

A machine learning approach to query generation in plagiarism source retrieval

Published: 15 December 2017

Volume 18, pages 1556–1572, (2017)
Cite this article

Frontiers of Information Technology & Electronic Engineering Aims and scope Submit manuscript

Lei-lei Kong ORCID: orcid.org/0000-0002-4636-3507^1,2,
Zhi-mao Lu¹,
Hao-liang Qi^2,3 &
…
Zhong-yuan Han²

90 Accesses
2 Citations
Explore all metrics

Abstract

Plagiarism source retrieval is the core task of plagiarism detection. It has become the standard for plagiarism detection to use the queries extracted from suspicious documents to retrieve the plagiarism sources. Generating queries from a suspicious document is one of the most important steps in plagiarism source retrieval. Heuristic-based query generation methods are widely used in the current research. Each heuristic-based method has its own advantages, and no one statistically outperforms the others on all suspicious document segments when generating queries for source retrieval. Further improvements on heuristic methods for source retrieval rely mainly on the experience of experts. This leads to difficulties in putting forward new heuristic methods that can overcome the shortcomings of the existing ones. This paper paves the way for a new statistical machine learning approach to select the best queries from the candidates. The statistical machine learning approach to query generation for source retrieval is formulated as a ranking framework. Specifically, it aims to achieve the optimal source retrieval performance for each suspicious document segment. The proposed method exploits learning to rank to generate queries from the candidates. To our knowledge, our work is the first research to apply machine learning methods to resolve the problem of query generation for source retrieval. To solve the essential problem of an absence of training data for learning to rank, the building of training samples for source retrieval is also conducted. We rigorously evaluate various aspects of the proposed method on the publicly available PAN source retrieval corpus. With respect to the established baselines, the experimental results show that applying our proposed query generation method based on machine learning yields statistically significant improvements over baselines in source retrieval effectiveness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DB-GPT: Large Language Model Meets Database

Article Open access 19 January 2024

Xuanhe Zhou, Zhaoyan Sun & Guoliang Li

A review of semi-supervised learning for text classification

Article 31 January 2023

José Marcio Duarte & Lilian Berton

An effective text plagiarism detection system based on feature selection and SVM techniques

Article Open access 16 May 2023

Mohamed A. El-Rashidy, Ramy G. Mohamed, … Marwa A. Shouman

References

Alzahrani, S.M., Salim, N., Abraham, A., 2012. Understanding plagiarism linguistic patterns, textual features, and de-tection methods. IEEE Trans. Syst. Man Cybern. C, 42(2): 133–149. https://doi.org/10.1109/TSMCC.2011.2134847
Article Google Scholar
Barrón-Cedeño, A., Vila, M., Martí, M.A., et al., 2013. Pla-giarism meets paraphrasing: insights for the next genera-tion in automatic plagiarism detection. Comput. Ling., 39(4): 917–947. https://doi.org/10.1162/COLI_a_00153
Article Google Scholar
Cao, Y., Xu, J., Liu, T.Y., et al., 2006. Adapting ranking SVM to document retrieval. Proc. 29th Annual Int. ACM SIGIR Conf. on Research and Development in Infor-mation Retrieval, p.186–193. https://doi.org/10.1145/1148170.1148205
Google Scholar
Cortes, C., Vapnik, V., 1995. Support-vector networks. Mach. Learn., 20(3): 273–297. https://doi.org/10.1023/A:1022627411411
MATH Google Scholar
Elizalde, V., 2013. Using statistic and semantic analysis to detect plagiarism—notebook for PAN at CLEF 2013. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.
Google Scholar
Gillam, L., 2013. Guess again and see if they line up: surrey’s runs at plagiarism detection—notebook for PAN at CLEF 2013. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.
Google Scholar
Hagen, M., Potthast, M., Stein, B., 2015. Source retrieval for plagiarism detection from large web corpora: recent ap-proaches. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.
Google Scholar
Haggag, O., El-Beltagy, S., 2013. Plagiarism candidate retrieval using selective query formulation and discriminative query scoring—notebook for PAN at CLEF 2013. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.
Google Scholar
Hastie, T., Tibshirani, R., Friedman, J., 2001. The Elements of Statistical Learning: Data Mining, Inference and Predic-tion. CRC Press, Boca Raton.
Book Google Scholar
Herbrich, R., Graepel, T., Obermayer, K., 2000. Large margin rank boundaries for ordinal regression. In: Smola, A.J., Bartlett, P., Schölkopf, B., et al. (Eds.), Advances in Large Margin Classifiers. MIT Press, Cambridge, p.115–132.
Google Scholar
Höffgen, K.U., Simon, H.U., Vanhorn, K.S., 1995. Robust trainability of single neurons. J. Comput. Syst. Sci., 50(1): 114–125. https://doi.org/10.1006/jcss.1995.1011
Article MathSciNet Google Scholar
Jayapal, A., 2012. Similarity overlap metric and greedy string tiling at PAN 2012: plagiarism detection—notebook for PAN at CLEF 2012. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.
Google Scholar
Joachims, T., 2002. Optimizing search engines using click-through data. Proc. 8th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, p.133–142. https://doi.org/10.1145/775047.775067
Google Scholar
Kong, L.L., Qi, H.L., Wang, S., et al., 2012. Approaches for candidate document retrieval and detailed comparison of plagiarism detection—notebook for PAN at CLEF 2012. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.
Google Scholar
Lee, T., Chae, J., Park, K., et al., 2013. CopyCaptor: plagia-rized source retrieval system using global word frequency and local feedback—notebook for PAN at CLEF 2013. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.
Google Scholar
Nallapati, R., 2004. Discriminative models for information retrieval. Proc. 27th Annual ACM SIGIR Int. Conf. on Research and Development in Information Retrieval, p.64–71. https://doi.org/10.1145/1008992.1009006
Google Scholar
Potthast, M., Gollub, T., Hagen, M., et al., 2012a. Overview of the 4th International Competition on Plagiarism Detec-tion. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.
Google Scholar
Potthast, M., Hagen, M., Stein, B., et al., 2012b. ChatNoir: a search engine for the ClueWeb09 corpus. Proc. 35th Int. ACM SIGIR Conf. on Research and Development in In-formation Retrieval, p.1004. https://doi.org/10.1145/2348283.2348429
Book Google Scholar
Potthast, M., Hagen, M., Gollub, T., et al., 2013a. Overview of the 5th International Competition on Plagiarism Detec-tion. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.
Google Scholar
Potthast, M., Hagen, M., Völske, M., et al., 2013b. Crowdsourcing interaction logs to understand text reuse from the web. Proc. 51st ACM Annual Meeting of the Association of Computational Linguistics, p.1212–1221.
Google Scholar
Potthast, M., Hagen, M., Beyer, A., et al., 2014. Overview of the 6th International Competition on Plagiarism Detection. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.
Google Scholar
Prakash, A., Saha, S., 2014. Experiments on document chunking and query formation for plagiarism source re-trieval—notebook for PAN at CLEF 2014. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.
Google Scholar
Rafiei, J., Mohtaj, S., Zarrabi, V., et al., 2015. Source retrieval plagiarism detection based on noun phrase and keyword phrase extraction—notebook for PAN at CLEF 2015. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.
Google Scholar
Robertson, S.E., 1997. Overview of the Okapi projects. J. Docum., 53(1): 3–7. https://doi.org/10.1108/EUM0000000007186
Article Google Scholar
Suchomel, Š., Brandejs, M., 2015. Improving synoptic que-rying for source retrieval—notebook for PAN at CLEF 2015. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.
Google Scholar
Toutanova, K., Klein, D., Manning, C.D., et al., 2003. Feature-rich part-of-speech tagging with a cyclic dependency network. Proc. Conf. of the North American Chapter of the Association for Computational Linguistics on Human Language Technology, p.173–180. https://doi.org/10.3115/1073445.1073478
Google Scholar
Williams, K., Chen, H.H., Choudhury, S.R., et al., 2013. Un-supervised ranking for plagiarism source retrieval— notebook for PAN at CLEF 2013. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.
Google Scholar
Williams, K., Chen, H.H., Giles, C.L., 2014a. Supervised ranking for plagiarism source retrieval—notebook for PAN at CLEF 2014. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.
Google Scholar
Williams, K., Chen, H.H., Giles, C.L., 2014b. Classifying and ranking search engine results as potential sources of pla-giarism. Proc. ACM Symp. on Document Engineering, p.97–106. https://doi.org/10.1145/2644866.2644879
Google Scholar
Zubarev, D., Sochenkov, I., 2014. Using sentence similarity measure for plagiarism source retrieval—notebook for PAN at CLEF 2014. Proc. CLEF Evaluation Labs and Workshop, Working Notes Papers.
Google Scholar

Download references

Author information

Authors and Affiliations

College of Information and Communication Engineering, Harbin Engineering University, Harbin, 150001, China
Lei-lei Kong & Zhi-mao Lu
School of Computer Science and Technology, Heilongjiang Institute of Technology, Harbin, 150050, China
Lei-lei Kong, Hao-liang Qi & Zhong-yuan Han
State Key Laboratory of Digital Publishing Technology, Beijing 100871, China
Hao-liang Qi

Authors

Lei-lei Kong
View author publications
You can also search for this author in PubMed Google Scholar
Zhi-mao Lu
View author publications
You can also search for this author in PubMed Google Scholar
Hao-liang Qi
View author publications
You can also search for this author in PubMed Google Scholar
Zhong-yuan Han
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Lei-lei Kong or Hao-liang Qi.

Additional information

Project supported by the National Social Science Foundation of China (No. 14CTQ032) and the National Natural Science Foundation of China (No. 61370170)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kong, Ll., Lu, Zm., Qi, Hl. et al. A machine learning approach to query generation in plagiarism source retrieval. Frontiers Inf Technol Electronic Eng 18, 1556–1572 (2017). https://doi.org/10.1631/FITEE.1601344

Download citation

Received: 17 June 2016
Accepted: 12 December 2016
Published: 15 December 2017
Issue Date: October 2017
DOI: https://doi.org/10.1631/FITEE.1601344

Keywords

CLC number

TP391.3

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A machine learning approach to query generation in plagiarism source retrieval

Abstract

Access this article

Similar content being viewed by others

DB-GPT: Large Language Model Meets Database

A review of semi-supervised learning for text classification

An effective text plagiarism detection system based on feature selection and SVM techniques

References

Author information

Authors and Affiliations

Corresponding authors

Additional information

Rights and permissions

About this article

Cite this article

Keywords

CLC number

Navigation

A machine learning approach to query generation in plagiarism source retrieval

Abstract

Access this article

Similar content being viewed by others

DB-GPT: Large Language Model Meets Database

A review of semi-supervised learning for text classification

An effective text plagiarism detection system based on feature selection and SVM techniques

References

Author information

Authors and Affiliations

Corresponding authors

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

CLC number

Search

Navigation