Abstract
The rapid development of the Internet, especially the application of search engines and machine translation, makes it easier to copy texts. Most existing text plagiarism detection methods are not capable of dealing with the increasing number of plagiarism sources and the increasingly ambiguous plagiarized texts. In this paper, we pay attention to the task of large-scale text deduplication, and propose a multi-level distributed text computing model, which improves the checking speed through multi-level latent semantic analysis, and combines BERT to judge plagiarized text more accurately. In order to further verify the model, we also combined the latest fuzzy plagiarism technology to construct a three-level data set. The experimental results show that our model performs well when plagiarism data increases and plagiarism ambiguity increases.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Haj-Yahia, Z., et al.: Towards unsupervised text classification leveraging experts and word embeddings. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 371–379 (2019)
Teddy, F.: “We know it when we see it”? Is not good enough: toward a standard definition of plagiarism that transcends theft, fraud, and copyright. In: Proceedings of the 4th Asia Pacific Conference on Educational Integrity, pp. 28–30 (2009)
Halavais, A.: Search Engine Society, 2nd edn. Cambridge University Press, Cambridge (2017)
Johnson, M., et al.: Google’s multilingual neural machine translation system: enabling zero-shot translation. Trans. Assoc. Comput. Linguist. 5, 339–351 (2017)
Hagen, M., Potthast, M., Adineh, P., Fatehifar, E., Stein, B.: Source retrieval for web-scale text reuse detection. In: Proceedings of the 2017 ACM on Conference on Information and Knowledge Management, pp. 2091–2094. ACM, November 2017
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. Am. Soc. Inf. Sci. 41(6), 391–407 (1990)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv https://arxiv.org/abs/1810.04805 (2018)
Alzahrani, S., Salim, N.: Fuzzy semantic-based string similarity for extrinsic plagiarism detection. In: Braschler, D., Harman, M. (eds.) vol. 1176, pp. 1–8 (2010)
Gupta, D.: Study on extrinsic text plagiarism detection techniques and tools. J. Eng. Sci. Technol. Rev. 9(5), 8–22 (2016)
Foltýnek, T., Meuschke, N., Gipp, B.: Academic plagiarism detection: a systematic literature review. ACM Comput. Surv. (CSUR) 52(6), 1–42 (2019)
Asadi, N., Lin, J.: Effectiveness/efficiency tradeoffs for candidate generation in multi-stage retrieval architectures. In: Proceedings of the 36th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 997–1000 (2013)
Véronis, J., Langlais, P.: Evaluation of parallel text alignment systems. In: Véronis, J., (eds) Parallel Text Processing, vol. 13, pp. 369–388. Springer, Dordrecht (2000). https://doi.org/10.1007/978-94-017-2535-4_19
Alvi, F., Stevenson, M., Clough, P.: Plagiarism detection in texts obfuscated with homoglyphs. In: Jose, J.M., et al. (eds.) ECIR 2017. LNCS, vol. 10193, pp. 669–675. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-56608-5_64
Erfaneh G., Kayvan B., Kiarash Z., Hadi V.: A deep learning approach to Persian plagiarism detection. In: Proceedings of the Forum for Information Retrieval Evaluation, pp. 154–159 (2016)
Alfikri, Z., Purwarianti, A.: Detailed analysis of extrinsic plagiarism detection system using machine learning approach (Naive Bayes and SVM). Telkomnika Indones. J. Electrical Eng. 12(11), 7884–7894 (2014)
Jiang, Z., Chen, M., Liu, X.: Semantic annotation with rescoredesa: rescoring concept features generated from explicit semantic analysis. In: Proceedings of the 7th International Workshop on Exploiting Semantic Annotations in Information Retrieval, pp. 25–27 (2014)
Glavaš, G., Franco-Salvador, M., Ponzetto, S.P., Rosso, P.: A resource-light method for cross-lingual semantic textual similarity. Knowl.-Based Syst. 143, 1–9 (2018)
Peng, H., et al.: Large-scale hierarchical text classification with recursively regularized deep graph-CNN. In: Proceedings of the 2018 World Wide Web Conference, pp. 1063–1072 (2018)
Peng, H., et al.: Hierarchical taxonomy-aware and attentional graph capsule RCNNs for large-scale multi-label text classification. IEEE Trans. Knowl. Data Eng. (2019)
Sun, Q., et al.: Pairwise learning for name disambiguation in large-scale heterogeneous academic networks. arXiv https://arxiv.org/abs/2008.13099 (2020)
Yang, R., et al.: Performance-aware speculative resource oversubscription for large-scale clusters. IEEE Trans. Parallel Distrib. Syst. 31(7), 1499–1517 (2020)
He, Y., Li, J., Song, Y., He, M., Peng, H.: Time-evolving text classification with deep neural networks. In: IJCAI, pp. 2241–2247 (2018)
Arif, M. H., Li, J., Iqbal, M., Peng, H.: Optimizing XCSR for text classification. In: 2017 IEEE Symposium on Service-Oriented System Engineering (SOSE), pp. 86–95(2017)
Bao, M., Li, J., Zhang, J., Peng, H., Liu, X.: Learning semantic coherence for machine generated spam text detection. In: 2019 International Joint Conference on Neural Networks (IJCNN), pp. 1–8 (2019)
Yan, H., Peng, H., Li, C., Li, J., Wang, L.: Bibliographic name disambiguation with graph convolutional network. In: Cheng, R., Mamoulis, N., Sun, Y., Huang, X. (eds.) WISE 2020. LNCS, vol. 11881, pp. 538–551. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-34223-4_34
Acknowledgments
This work is supported by the National Key Research and Development Program of China under the Grant No. 2018YFC0830804.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Zhang, R. et al. (2020). Hierarchical and Pairwise Document Embedding for Plagiarism Detection. In: Yang, X., Wang, CD., Islam, M.S., Zhang, Z. (eds) Advanced Data Mining and Applications. ADMA 2020. Lecture Notes in Computer Science(), vol 12447. Springer, Cham. https://doi.org/10.1007/978-3-030-65390-3_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-65390-3_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-65389-7
Online ISBN: 978-3-030-65390-3
eBook Packages: Computer ScienceComputer Science (R0)