Abstract
Existing methods to text plagiarism analysis mainly base on “chunking”, a process of grouping a text into meaningful units each of which gets encoded by an integer number. Together theses numbers form a document’s signature or fingerprint. An overlap of two documents’ fingerprints indicate a possibly plagiarized text passage. Most approaches use MD5 hashes to construct fingerprints, which is bound up with two problems: (i) it is computationally expensive, (ii) a small chunk size must be chosen to identify matching passages, which additionally increases the effort for fingerprint computation, fingerprint comparison, and fingerprint storage.
This paper proposes a new class of fingerprints that can be considered as an abstraction of the classical vector space model. These fingerprints operationalize the concept of “near similarity” and enable one to quickly identify candidate passages for plagiarism. Experiments show that a plagiarism analysis based on our fingerprints leads to a speed-up by a factor of five and higher—without compromising the recall performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
BAKER, B.S. (1993): On finding duplication in strings and software. http://cm.bell-labs.com/cm/cs/papers.html
BRIN, S., DAVIS, J., and GARCIA-MOLINA, H. (1995): Copy detection mechanisms for digital documents. SIGMOD’ 95, 398–409, New York, NY, USA. ACM Press.
ENCYCLOPÆDIA BRITANNICA. New Frontiers in Cheating. http://www.britannica.com/eb/article?tocId=228894, 2005.
FINKEL, R.A., ZASLAVSKY, A., MONOSTORI, K, and SCHMIDT, H. (2002): Signature Extraction for Overlap Detection in Documents. Proc. 25th Australian conference on Computer Science, 59–64. Australian Computer Society.
FULLAM, K., and Park, J. (2002). Improvements for scalable and accurate plagiarism detection in digital documents. http://www.lips.utexas.edu/\( \tilde k \)fullam/pdf/DataMiningReport.pdf
GUSFIELD, D. (1997): Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press.
HOAD, T.C., and ZOBEL, J. (2003): Methods for Identifying Versioned and Plagiarised Documents. American Society for Information Science and Technology, 54(3):203–215.
MONOSTORI, K., FINKEL, R., ZASLAVSKY, A., HODÁSZ, G., and PATAKI, M. (2002): Comparison of overlap detection techniques. LNCS, volume 2329.
MONOSTORI, K., ZASLAVSKY, A., and SCHMIDT, H. (2000): Document overlap detection system for distributed digital libraries. DL’ 00, 226–227, New York, NY, USA. ACM Press.
RAMAKRISHNA, M.V., and ZOBEL, J. (1997): Performance in Practice of String Hashing Functions. Proc. Intl. Conf. on Database Systems for Advanced Applications, Australia.
RIVEST, R.L. (1992): The md5 message-digest algorithm. http://theory.lcs.mit.edu/\( \tilde r \)ivest/rfc1321.txt
SHIVAKUMAR, N., and GARCIA-MOLINA, H. (1996): Building a scalable and accurate copy detection mechanism. DL’ 96, 160–168, New York, NY, USA. ACM Press.
SI, A., LEONG, H.V., and LAU, R.W.H. (1997): Check: a document plagiarism detection system. SAC’ 97, 70–77, New York, NY, USA. ACM Press.
STEIN, S. (2005): Fuzzy-Fingerprints for Text-based Information Retrieval. In: Tochtermann and Maurer (eds.): 5th Intl. Conf. on Knowledge Management (I-KNOW 05), Graz, Austria, JUCS. Know-Center.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer Berlin · Heidelberg
About this paper
Cite this paper
Stein, B., zu Eissen, S.M. (2006). Near Similarity Search and Plagiarism Analysis. In: Spiliopoulou, M., Kruse, R., Borgelt, C., Nürnberger, A., Gaul, W. (eds) From Data and Information Analysis to Knowledge Engineering. Studies in Classification, Data Analysis, and Knowledge Organization. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-31314-1_52
Download citation
DOI: https://doi.org/10.1007/3-540-31314-1_52
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-31313-7
Online ISBN: 978-3-540-31314-4
eBook Packages: Mathematics and StatisticsMathematics and Statistics (R0)