Skip to main content

Finding Plagiarism Based on Common Semantic Sequence Model

  • Conference paper
Advances in Web-Age Information Management (WAIM 2004)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 3129))

Included in the following conference series:

Abstract

It is one of key problems in Text Mining to find document features. The string matching model and global word frequency model are two common models. But the former can hardly resist rewording noise, whereas the latter cannot find document details. We present Common Semantic Sequence Model (CSSM) and apply it to Document Copy Detection. CSSM combines the ideas of 2 models above, and it makes a trade-off between a document global features and local features. CSSM calculates the common words proportion between 2 documents semantic sequences to make a plagiarism score. A semantic sequence is indeed a continual word sequence after the low-density words are omitted. With the collection of 2 documents semantic sequences, we can detect plagiarism in a fine granularity. We test CSSM with several common copy types. The result shows that CSSM is excellent for detecting non-rewording plagiarism and valid even if documents are reworded to some extent.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bao, J.P., et al.: Document copy detection based on kernel method. In: Proceedings of the International Conference on Natural Language Processing and Knowledge Engineering, October 26-29, pp. 250–256 (2003)

    Google Scholar 

  2. Broder, A.Z., Glassman, S.C., Manasse, M.S.: Syntactic Clustering of the Web. In: Sixth International Web Conference, Santa Clara, California USA, April 7-11 (1997)

    Google Scholar 

  3. Denning, P.J.: Editorial: Plagiarism in the web. Communications of the ACM 38(12) (1995)

    Google Scholar 

  4. Heintze, N.: Scalable Document Fingerprinting. In: Proceedings of the Second USENIX Workshop on Electronic Commerce, Oakland, California, November 18-21 (1996)

    Google Scholar 

  5. Monostori, K., Zaslavsky, A., Schmidt, H.: MatchDetectReveal: Finding Overlapping and Similar Digital Documents. In: Proceedings of Information Resources Management Association International Conference (IRMA2000), Anchorage, Alaska, USA, May 21-24 (2000)

    Google Scholar 

  6. Salton, G., Wong, A., Yang, C.: A vector space model for automatic indexing. Communications of the ACM 18(11), 613–620 (1975)

    Article  MATH  Google Scholar 

  7. Shivakumar, N., Garcia-Molina, H.: SCAM: A copy detection mechanism for digital documents. In: Proceedings of 2nd International Conference in Theory and Practice of Digital Libraries (DL 1995), Austin, Texas (June 1995)

    Google Scholar 

  8. Si, A., Leong, H.V., Lau, R.W.H.: CHECK: A Document Plagiarism Detection System. In: Proceedings of ACM Symposium for Applied Computing, February 1997, pp. 70–77 (1997)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2004 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Bao, JP., Shen, JY., Liu, XD., Liu, HY., Zhang, XD. (2004). Finding Plagiarism Based on Common Semantic Sequence Model. In: Li, Q., Wang, G., Feng, L. (eds) Advances in Web-Age Information Management. WAIM 2004. Lecture Notes in Computer Science, vol 3129. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-27772-9_66

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-27772-9_66

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-22418-1

  • Online ISBN: 978-3-540-27772-9

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics