Abstract
Local reuse detection is a prerequisite for a multitude of tasks ranging from document management and information retrieval to web search or plagiarism detection. Its results can be used to support authors in creating new learning resources or learners in finding existing ones by providing accurate suggestions for related documents. While the detection of local text reuse, i.e. reuse of parts of documents, is covered by various approaches, reuse detection for object-based documents has been hardly considered yet. In this paper we propose a new fingerprinting technique for local reuse detection for both text-based and object-based documents which is based on the contiguity of documents. This additional information, which is generally disregarded by existing approaches, allows the creation of shorter and more flexible fingerprints. Evaluations performed on different corpora have shown that it performs better than existing approaches while maintaining a significantly lower storage consumption.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Barrón-Cede, A., Rosso, P.: On automatic plagiarism detection based on n-grams comparison. In: ECIR 2009: Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval, pp. 696–700. Springer, Heidelberg (2009)
Brin, S., Davis, J., García-Molina, H.: Copy detection mechanisms for digital documents. In: SIGMOD 2005: Proceedings of the 1995 ACM SIGMOD international conference on Management of data, pp. 398–409. ACM, New York (1995)
Broder, A.Z.: On the resemblance and containment of documents. In: SEQUENCES 1997: Proceedings of the Compression and Complexity of Sequences 1997, Washington, DC, USA, p. 21. IEEE Computer Society, Los Alamitos (1997)
Broder, A.Z.: Identifying and filtering near-duplicate documents. In: COM 2000: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, London, UK, pp. 1–10. Springer, Heidelberg (2000)
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. In: Proceedings of the Sixth International World Wide Web Conference (WWW6), pp. 1157–1166 (1997)
Steven Burrows, S., Tahaghoghi, M.M., Zobel, J.: Efficient plagiarism detection for large code repositories. Softw. Pract. Exper. 37(2), 151–175 (2007)
Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: STOC 2002: Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pp. 380–388. ACM Press, New York (2002)
Clough, P., Gaizauskas, R., Piao, S.S.L., Wilks, Y.: METER: MEasuring TExt Reuse. In: Proceedings of the 40th Anniversary Meeting for the Association for Computational Linguistics (ACL 2002), Philadelphia, pp. 152–159 (July 2002)
Hamid, O.A., Behzadi, B., Christoph, S., Henzinger, M.: Detecting the origin of text segments efficiently. In: WWW 2009: Proceedings of the 18th international conference on World wide web, pp. 61–70. ACM, New York (2009)
Kim, J.W., Selçuk Candan, K., Tatemura, J.: Efficient overlap and content reuse detection in blogs and online news articles. In: 18th International World Wide Web Conference (April 2009)
Klerkx, J., Verbert, K., Duval, E.: Visualizing reuse: More than meets the eye. In: Proceedings of the 6th International Conference on Knowledge Management, I-KNOW 2006, Graz, Austria, pp. 489–497 (September 2006)
Lehmann, L., Hildebrandt, T., Rensing, C., Steinmetz, R.: Capture, management and utilization of lifecycle information for learning resources. IEEE Transactions on Learning Technologies 1(1), 75–87 (2008)
Lehmann, L., Mittelbach, A., Rensing, C., Steinmetz, R.: Capture of lifecycle information in office applications. International Journal of Technology Enhanced Learning 2, 41–57 (2010)
Lyon, C., Malcolm, J., Dickerson, B.: Detecting short passages of similar text in large document collections. In: Lee, L., Harman, D. (eds.) Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, Pittsburg, PA USA, pp. 118–125 (2001)
Manber, U.: Finding similar files in a large file system. In: WTEC 1994: Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference, p. 2. USENIX Association, Berkeley (1994)
Metzler, D., Bernstein, Y., Croft, B.W., Moffat, A., Zobel, J.: Similarity measures for tracking information flow. In: CIKM 2005: Proceedings of the 14th ACM international conference on Information and knowledge management, pp. 517–524. ACM, New York (2005)
Rivest, R.: The md5 message-digest algorithm (1992)
Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: Local algorithms for document fingerprinting. In: Proceedings of SIGMOD 2003, San Diego, CA. ACM Press, New York (June 2003)
Seo, J., Bruce Croft, W.: Local text reuse detection. In: Proceedings of SIGIR ’08, Singapore, July 2008, ACM Press, New York (2008)
Syropoulos, A.: Mathematics of multisets. In: WMP 2000: Proceedings of the Workshop on Multiset Processing, London, UK, pp. 347–358. Springer, Heidelberg (2000)
Verbert, K., Ochoa, X., Duval, E.: The alocom framework: Towards scalable content reuse. Journal of Digital Information, 9 (2008)
Wise, M.J.: Running karp-rabin matching and greedy string tiling. Technical report, Basser Department of Computer Science - The University of Sydney (1993)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Mittelbach, A., Lehmann, L., Rensing, C., Steinmetz, R. (2010). Automatic Detection of Local Reuse. In: Wolpers, M., Kirschner, P.A., Scheffel, M., Lindstaedt, S., Dimitrova, V. (eds) Sustaining TEL: From Innovation to Learning and Practice. EC-TEL 2010. Lecture Notes in Computer Science, vol 6383. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16020-2_16
Download citation
DOI: https://doi.org/10.1007/978-3-642-16020-2_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-16019-6
Online ISBN: 978-3-642-16020-2
eBook Packages: Computer ScienceComputer Science (R0)