Finding Plagiarism Based on Common Semantic Sequence Model

Bao, Jun-Peng; Shen, Jun-Yi; Liu, Xiao-Dong; Liu, Hai-Yan; Zhang, Xiao-Di

doi:10.1007/978-3-540-27772-9_66

Jun-Peng Bao¹⁸,
Jun-Yi Shen¹⁸,
Xiao-Dong Liu¹⁸,
Hai-Yan Liu¹⁸ &
…
Xiao-Di Zhang¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3129))

Included in the following conference series:

International Conference on Web-Age Information Management

900 Accesses
6 Citations

Abstract

It is one of key problems in Text Mining to find document features. The string matching model and global word frequency model are two common models. But the former can hardly resist rewording noise, whereas the latter cannot find document details. We present Common Semantic Sequence Model (CSSM) and apply it to Document Copy Detection. CSSM combines the ideas of 2 models above, and it makes a trade-off between a document global features and local features. CSSM calculates the common words proportion between 2 documents semantic sequences to make a plagiarism score. A semantic sequence is indeed a continual word sequence after the low-density words are omitted. With the collection of 2 documents semantic sequences, we can detect plagiarism in a fine granularity. We test CSSM with several common copy types. The result shows that CSSM is excellent for detecting non-rewording plagiarism and valid even if documents are reworded to some extent.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Bao, J.P., et al.: Document copy detection based on kernel method. In: Proceedings of the International Conference on Natural Language Processing and Knowledge Engineering, October 26-29, pp. 250–256 (2003)
Google Scholar
Broder, A.Z., Glassman, S.C., Manasse, M.S.: Syntactic Clustering of the Web. In: Sixth International Web Conference, Santa Clara, California USA, April 7-11 (1997)
Google Scholar
Denning, P.J.: Editorial: Plagiarism in the web. Communications of the ACM 38(12) (1995)
Google Scholar
Heintze, N.: Scalable Document Fingerprinting. In: Proceedings of the Second USENIX Workshop on Electronic Commerce, Oakland, California, November 18-21 (1996)
Google Scholar
Monostori, K., Zaslavsky, A., Schmidt, H.: MatchDetectReveal: Finding Overlapping and Similar Digital Documents. In: Proceedings of Information Resources Management Association International Conference (IRMA2000), Anchorage, Alaska, USA, May 21-24 (2000)
Google Scholar
Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)
Article MATH Google Scholar
Shivakumar, N., Garcia-Molina, H.: SCAM: A copy detection mechanism for digital documents. In: Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries (DL 1995), Austin, Texas (June 1995)
Google Scholar
Si, A., Leong, H.V., Lau, R.W.H.: CHECK: A Document Plagiarism Detection System. In: Proceedings of ACM Symposium for Applied Computing, February 1997, pp. 70–77 (1997)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Xi’an Jiaotong University, Xi’an, 710049, People’s Republic of China
Jun-Peng Bao, Jun-Yi Shen, Xiao-Dong Liu, Hai-Yan Liu & Xiao-Di Zhang

Authors

Jun-Peng Bao
View author publications
You can also search for this author in PubMed Google Scholar
Jun-Yi Shen
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Dong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Hai-Yan Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xiao-Di Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, City University of Hong Kong, 83 Tat Chee Avenue, Kowloon, Hong Kong, China
Qing Li
Shenyang Liaoning, Northeastern University, 110004, China
Guoren Wang
Dept. of Computer Science & Technology, Tsinghua University, Beijing, China
Ling Feng

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bao, JP., Shen, JY., Liu, XD., Liu, HY., Zhang, XD. (2004). Finding Plagiarism Based on Common Semantic Sequence Model. In: Li, Q., Wang, G., Feng, L. (eds) Advances in Web-Age Information Management. WAIM 2004. Lecture Notes in Computer Science, vol 3129. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-27772-9_66

Download citation

DOI: https://doi.org/10.1007/978-3-540-27772-9_66
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-22418-1
Online ISBN: 978-3-540-27772-9
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics