Near Similarity Search and Plagiarism Analysis

Stein, Benno; zu Eissen, Sven Meyer

doi:10.1007/3-540-31314-1_52

Benno Stein²² &
Sven Meyer zu Eissen²³

Part of the book series: Studies in Classification, Data Analysis, and Knowledge Organization ((STUDIES CLASS))

2253 Accesses
27 Citations

Abstract

Existing methods to text plagiarism analysis mainly base on “chunking”, a process of grouping a text into meaningful units each of which gets encoded by an integer number. Together theses numbers form a document’s signature or fingerprint. An overlap of two documents’ fingerprints indicate a possibly plagiarized text passage. Most approaches use MD5 hashes to construct fingerprints, which is bound up with two problems: (i) it is computationally expensive, (ii) a small chunk size must be chosen to identify matching passages, which additionally increases the effort for fingerprint computation, fingerprint comparison, and fingerprint storage.

This paper proposes a new class of fingerprints that can be considered as an abstraction of the classical vector space model. These fingerprints operationalize the concept of “near similarity” and enable one to quickly identify candidate passages for plagiarism. Experiments show that a plagiarism analysis based on our fingerprints leads to a speed-up by a factor of five and higher—without compromising the recall performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 159.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

BAKER, B.S. (1993): On finding duplication in strings and software. http://cm.bell-labs.com/cm/cs/papers.html
Google Scholar
BRIN, S., DAVIS, J., and GARCIA-MOLINA, H. (1995): Copy detection mechanisms for digital documents. SIGMOD’ 95, 398–409, New York, NY, USA. ACM Press.
Google Scholar
ENCYCLOPÆDIA BRITANNICA. New Frontiers in Cheating. http://www.britannica.com/eb/article?tocId=228894, 2005.
Google Scholar
FINKEL, R.A., ZASLAVSKY, A., MONOSTORI, K, and SCHMIDT, H. (2002): Signature Extraction for Overlap Detection in Documents. Proc. 25th Australian conference on Computer Science, 59–64. Australian Computer Society.
Google Scholar
FULLAM, K., and Park, J. (2002). Improvements for scalable and accurate plagiarism detection in digital documents. http://www.lips.utexas.edu/\( \tilde k \)fullam/pdf/DataMiningReport.pdf
Google Scholar
GUSFIELD, D. (1997): Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press.
Google Scholar
HOAD, T.C., and ZOBEL, J. (2003): Methods for Identifying Versioned and Plagiarised Documents. American Society for Information Science and Technology, 54(3):203–215.
Google Scholar
MONOSTORI, K., FINKEL, R., ZASLAVSKY, A., HODÁSZ, G., and PATAKI, M. (2002): Comparison of overlap detection techniques. LNCS, volume 2329.
Google Scholar
MONOSTORI, K., ZASLAVSKY, A., and SCHMIDT, H. (2000): Document overlap detection system for distributed digital libraries. DL’ 00, 226–227, New York, NY, USA. ACM Press.
Google Scholar
RAMAKRISHNA, M.V., and ZOBEL, J. (1997): Performance in Practice of String Hashing Functions. Proc. Intl. Conf. on Database Systems for Advanced Applications, Australia.
Google Scholar
RIVEST, R.L. (1992): The md5 message-digest algorithm. http://theory.lcs.mit.edu/\( \tilde r \)ivest/rfc1321.txt
Google Scholar
SHIVAKUMAR, N., and GARCIA-MOLINA, H. (1996): Building a scalable and accurate copy detection mechanism. DL’ 96, 160–168, New York, NY, USA. ACM Press.
Google Scholar
SI, A., LEONG, H.V., and LAU, R.W.H. (1997): Check: a document plagiarism detection system. SAC’ 97, 70–77, New York, NY, USA. ACM Press.
Google Scholar
STEIN, S. (2005): Fuzzy-Fingerprints for Text-based Information Retrieval. In: Tochtermann and Maurer (eds.): 5th Intl. Conf. on Knowledge Management (I-KNOW 05), Graz, Austria, JUCS. Know-Center.
Google Scholar

Download references

Author information

Authors and Affiliations

Faculty of Media, Media Systems, Bauhaus University Weimar, 99421, Weimar, Germany
Benno Stein
Faculty of Computer Science, Paderborn University, 33098, Paderborn, Germany
Sven Meyer zu Eissen

Authors

Benno Stein
View author publications
You can also search for this author in PubMed Google Scholar
Sven Meyer zu Eissen
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Institut für Technische und Betriebliche Informationssysteme, Otto-von-Guericke-Universität Magdeburg, Universitätsplatz 2, 39106, Magdeburg, Germany
Myra Spiliopoulou
Institut für Wissens- und Sprachverarbeitung, Otto-von-Guericke-Universität Magdeburg, Universitätsplatz 2, 39106, Magdeburg, Germany
Rudolf Kruse , Christian Borgelt & Andreas Nürnberger , &
Institut für Entscheidungstheorie und Unternehmensforschung, Universität Karlsruhe (TH), 76128, Karlsruhe
Wolfgang Gaul

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Stein, B., zu Eissen, S.M. (2006). Near Similarity Search and Plagiarism Analysis. In: Spiliopoulou, M., Kruse, R., Borgelt, C., Nürnberger, A., Gaul, W. (eds) From Data and Information Analysis to Knowledge Engineering. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-31314-1_52

Download citation

DOI: https://doi.org/10.1007/3-540-31314-1_52
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-31313-7
Online ISBN: 978-3-540-31314-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)

Publish with us

Policies and ethics