Hierarchical and Pairwise Document Embedding for Plagiarism Detection

Zhang, Ruitong; Liu, Lianzhong; Zhang, Jiaofu; Huang, Zihang; Yang, Caiwei; Zhao, Liangxuan; Xu, Tongge

doi:10.1007/978-3-030-65390-3_12

Ruitong Zhang¹²,
Lianzhong Liu¹²,
Jiaofu Zhang¹²,
Zihang Huang¹²,
Caiwei Yang¹²,
Liangxuan Zhao¹² &
…
Tongge Xu¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12447))

Included in the following conference series:

International Conference on Advanced Data Mining and Applications

1386 Accesses
1 Citations

Abstract

The rapid development of the Internet, especially the application of search engines and machine translation, makes it easier to copy texts. Most existing text plagiarism detection methods are not capable of dealing with the increasing number of plagiarism sources and the increasingly ambiguous plagiarized texts. In this paper, we pay attention to the task of large-scale text deduplication, and propose a multi-level distributed text computing model, which improves the checking speed through multi-level latent semantic analysis, and combines BERT to judge plagiarized text more accurately. In order to further verify the model, we also combined the latest fuzzy plagiarism technology to construct a three-level data set. The experimental results show that our model performs well when plagiarism data increases and plagiarism ambiguity increases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Haj-Yahia, Z., et al.: Towards unsupervised text classification leveraging experts and word embeddings. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 371–379 (2019)
Google Scholar
Teddy, F.: “We know it when we see it”? Is not good enough: toward a standard definition of plagiarism that transcends theft, fraud, and copyright. In: Proceedings of the 4th Asia Pacific Conference on Educational Integrity, pp. 28–30 (2009)
Google Scholar
Halavais, A.: Search Engine Society, 2nd edn. Cambridge University Press, Cambridge (2017)
Google Scholar
Johnson, M., et al.: Google’s multilingual neural machine translation system: enabling zero-shot translation. Trans. Assoc. Comput. Linguist. 5, 339–351 (2017)
Article Google Scholar
Hagen, M., Potthast, M., Adineh, P., Fatehifar, E., Stein, B.: Source retrieval for web-scale text reuse detection. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 2091–2094. ACM, November 2017
Google Scholar
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)
Article Google Scholar
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv https://arxiv.org/abs/1810.04805 (2018)
Alzahrani, S., Salim, N.: Fuzzy semantic-based string similarity for extrinsic plagiarism detection. In: Braschler, D., Harman, M. (eds.) vol. 1176, pp. 1–8 (2010)
Google Scholar
Gupta, D.: Study on extrinsic text plagiarism detection techniques and tools. J. Eng. Sci. Technol. Rev. 9(5), 8–22 (2016)
Google Scholar
Foltýnek, T., Meuschke, N., Gipp, B.: Academic plagiarism detection: a systematic literature review. ACM Comput. Surv. (CSUR) 52(6), 1–42 (2019)
Article Google Scholar
Asadi, N., Lin, J.: Effectiveness/efficiency tradeoffs for candidate generation in multi-stage retrieval architectures. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 997–1000 (2013)
Google Scholar
Véronis, J., Langlais, P.: Evaluation of parallel text alignment systems. In: Véronis, J., (eds) Parallel Text Processing, vol. 13, pp. 369–388. Springer, Dordrecht (2000). https://doi.org/10.1007/978-94-017-2535-4_19
Alvi, F., Stevenson, M., Clough, P.: Plagiarism detection in texts obfuscated with homoglyphs. In: Jose, J.M., et al. (eds.) ECIR 2017. LNCS, vol. 10193, pp. 669–675. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56608-5_64
Chapter Google Scholar
Erfaneh G., Kayvan B., Kiarash Z., Hadi V.: A deep learning approach to Persian plagiarism detection. In: Proceedings of the Forum for Information Retrieval Evaluation, pp. 154–159 (2016)
Google Scholar
Alfikri, Z., Purwarianti, A.: Detailed analysis of extrinsic plagiarism detection system using machine learning approach (Naive Bayes and SVM). Telkomnika Indones. J. Electrical Eng. 12(11), 7884–7894 (2014)
Article Google Scholar
Jiang, Z., Chen, M., Liu, X.: Semantic annotation with rescoredesa: rescoring concept features generated from explicit semantic analysis. In: Proceedings of the 7th International Workshop on Exploiting Semantic Annotations in Information Retrieval, pp. 25–27 (2014)
Google Scholar
Glavaš, G., Franco-Salvador, M., Ponzetto, S.P., Rosso, P.: A resource-light method for cross-lingual semantic textual similarity. Knowl.-Based Syst. 143, 1–9 (2018)
Article Google Scholar
Peng, H., et al.: Large-scale hierarchical text classification with recursively regularized deep graph-CNN. In: Proceedings of the 2018 World Wide Web Conference, pp. 1063–1072 (2018)
Google Scholar
Peng, H., et al.: Hierarchical taxonomy-aware and attentional graph capsule RCNNs for large-scale multi-label text classification. IEEE Trans. Knowl. Data Eng. (2019)
Google Scholar
Sun, Q., et al.: Pairwise learning for name disambiguation in large-scale heterogeneous academic networks. arXiv https://arxiv.org/abs/2008.13099 (2020)
Yang, R., et al.: Performance-aware speculative resource oversubscription for large-scale clusters. IEEE Trans. Parallel Distrib. Syst. 31(7), 1499–1517 (2020)
Article Google Scholar
He, Y., Li, J., Song, Y., He, M., Peng, H.: Time-evolving text classification with deep neural networks. In: IJCAI, pp. 2241–2247 (2018)
Google Scholar
Arif, M. H., Li, J., Iqbal, M., Peng, H.: Optimizing XCSR for text classification. In: 2017 IEEE Symposium on Service-Oriented System Engineering (SOSE), pp. 86–95(2017)
Google Scholar
Bao, M., Li, J., Zhang, J., Peng, H., Liu, X.: Learning semantic coherence for machine generated spam text detection. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2019)
Google Scholar
Yan, H., Peng, H., Li, C., Li, J., Wang, L.: Bibliographic name disambiguation with graph convolutional network. In: Cheng, R., Mamoulis, N., Sun, Y., Huang, X. (eds.) WISE 2020. LNCS, vol. 11881, pp. 538–551. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-34223-4_34
Chapter Google Scholar

Download references

Acknowledgments

This work is supported by the National Key Research and Development Program of China under the Grant No. 2018YFC0830804.

Author information

Authors and Affiliations

School of Cyber Science and Technology, Beihang University, Beijing, China
Ruitong Zhang, Lianzhong Liu, Jiaofu Zhang, Zihang Huang, Caiwei Yang, Liangxuan Zhao & Tongge Xu

Authors

Ruitong Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Lianzhong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jiaofu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Zihang Huang
View author publications
You can also search for this author in PubMed Google Scholar
Caiwei Yang
View author publications
You can also search for this author in PubMed Google Scholar
Liangxuan Zhao
View author publications
You can also search for this author in PubMed Google Scholar
Tongge Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Tongge Xu .

Editor information

Editors and Affiliations

Northeastern University, Shenyang, China
Xiaochun Yang
School of Data and Computer Science, Guangzhou, China
Chang-Dong Wang
Griffith University, Southport, QLD, Australia
Md. Saiful Islam
School of Computer Science and Technology, Shenzhen, China
Zheng Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, R. et al. (2020). Hierarchical and Pairwise Document Embedding for Plagiarism Detection. In: Yang, X., Wang, CD., Islam, M.S., Zhang, Z. (eds) Advanced Data Mining and Applications. ADMA 2020. Lecture Notes in Computer Science(), vol 12447. Springer, Cham. https://doi.org/10.1007/978-3-030-65390-3_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-65390-3_12
Published: 06 January 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-65389-7
Online ISBN: 978-3-030-65390-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics