Skip to main content

Automatic Detection of Local Reuse

  • Conference paper
Sustaining TEL: From Innovation to Learning and Practice (EC-TEL 2010)

Part of the book series: Lecture Notes in Computer Science ((LNPSE,volume 6383))

Included in the following conference series:

Abstract

Local reuse detection is a prerequisite for a multitude of tasks ranging from document management and information retrieval to web search or plagiarism detection. Its results can be used to support authors in creating new learning resources or learners in finding existing ones by providing accurate suggestions for related documents. While the detection of local text reuse, i.e. reuse of parts of documents, is covered by various approaches, reuse detection for object-based documents has been hardly considered yet. In this paper we propose a new fingerprinting technique for local reuse detection for both text-based and object-based documents which is based on the contiguity of documents. This additional information, which is generally disregarded by existing approaches, allows the creation of shorter and more flexible fingerprints. Evaluations performed on different corpora have shown that it performs better than existing approaches while maintaining a significantly lower storage consumption.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Barrón-Cede, A., Rosso, P.: On automatic plagiarism detection based on n-grams comparison. In: ECIR 2009: Proceedings of the 31th European Conference on IR Research on Advances in Information Retrieval, pp. 696–700. Springer, Heidelberg (2009)

    Google Scholar 

  2. Brin, S., Davis, J., García-Molina, H.: Copy detection mechanisms for digital documents. In: SIGMOD 2005: Proceedings of the 1995 ACM SIGMOD international conference on Management of data, pp. 398–409. ACM, New York (1995)

    Google Scholar 

  3. Broder, A.Z.: On the resemblance and containment of documents. In: SEQUENCES 1997: Proceedings of the Compression and Complexity of Sequences 1997, Washington, DC, USA, p. 21. IEEE Computer Society, Los Alamitos (1997)

    Google Scholar 

  4. Broder, A.Z.: Identifying and filtering near-duplicate documents. In: COM 2000: Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching, London, UK, pp. 1–10. Springer, Heidelberg (2000)

    Google Scholar 

  5. Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G.: Syntactic clustering of the web. In: Proceedings of the Sixth International World Wide Web Conference (WWW6), pp. 1157–1166 (1997)

    Google Scholar 

  6. Steven Burrows, S., Tahaghoghi, M.M., Zobel, J.: Efficient plagiarism detection for large code repositories. Softw. Pract. Exper. 37(2), 151–175 (2007)

    Article  Google Scholar 

  7. Charikar, M.S.: Similarity estimation techniques from rounding algorithms. In: STOC 2002: Proceedings of the thiry-fourth annual ACM symposium on Theory of computing, pp. 380–388. ACM Press, New York (2002)

    Chapter  Google Scholar 

  8. Clough, P., Gaizauskas, R., Piao, S.S.L., Wilks, Y.: METER: MEasuring TExt Reuse. In: Proceedings of the 40th Anniversary Meeting for the Association for Computational Linguistics (ACL 2002), Philadelphia, pp. 152–159 (July 2002)

    Google Scholar 

  9. Hamid, O.A., Behzadi, B., Christoph, S., Henzinger, M.: Detecting the origin of text segments efficiently. In: WWW 2009: Proceedings of the 18th international conference on World wide web, pp. 61–70. ACM, New York (2009)

    Google Scholar 

  10. Kim, J.W., Selçuk Candan, K., Tatemura, J.: Efficient overlap and content reuse detection in blogs and online news articles. In: 18th International World Wide Web Conference (April 2009)

    Google Scholar 

  11. Klerkx, J., Verbert, K., Duval, E.: Visualizing reuse: More than meets the eye. In: Proceedings of the 6th International Conference on Knowledge Management, I-KNOW 2006, Graz, Austria, pp. 489–497 (September 2006)

    Google Scholar 

  12. Lehmann, L., Hildebrandt, T., Rensing, C., Steinmetz, R.: Capture, management and utilization of lifecycle information for learning resources. IEEE Transactions on Learning Technologies 1(1), 75–87 (2008)

    Article  Google Scholar 

  13. Lehmann, L., Mittelbach, A., Rensing, C., Steinmetz, R.: Capture of lifecycle information in office applications. International Journal of Technology Enhanced Learning 2, 41–57 (2010)

    Article  Google Scholar 

  14. Lyon, C., Malcolm, J., Dickerson, B.: Detecting short passages of similar text in large document collections. In: Lee, L., Harman, D. (eds.) Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, Pittsburg, PA USA, pp. 118–125 (2001)

    Google Scholar 

  15. Manber, U.: Finding similar files in a large file system. In: WTEC 1994: Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference, p. 2. USENIX Association, Berkeley (1994)

    Google Scholar 

  16. Metzler, D., Bernstein, Y., Croft, B.W., Moffat, A., Zobel, J.: Similarity measures for tracking information flow. In: CIKM 2005: Proceedings of the 14th ACM international conference on Information and knowledge management, pp. 517–524. ACM, New York (2005)

    Google Scholar 

  17. Rivest, R.: The md5 message-digest algorithm (1992)

    Google Scholar 

  18. Schleimer, S., Wilkerson, D.S., Aiken, A.: Winnowing: Local algorithms for document fingerprinting. In: Proceedings of SIGMOD 2003, San Diego, CA. ACM Press, New York (June 2003)

    Google Scholar 

  19. Seo, J., Bruce Croft, W.: Local text reuse detection. In: Proceedings of SIGIR ’08, Singapore, July 2008, ACM Press, New York (2008)

    Google Scholar 

  20. Syropoulos, A.: Mathematics of multisets. In: WMP 2000: Proceedings of the Workshop on Multiset Processing, London, UK, pp. 347–358. Springer, Heidelberg (2000)

    Google Scholar 

  21. Verbert, K., Ochoa, X., Duval, E.: The alocom framework: Towards scalable content reuse. Journal of Digital Information, 9 (2008)

    Google Scholar 

  22. Wise, M.J.: Running karp-rabin matching and greedy string tiling. Technical report, Basser Department of Computer Science - The University of Sydney (1993)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Mittelbach, A., Lehmann, L., Rensing, C., Steinmetz, R. (2010). Automatic Detection of Local Reuse. In: Wolpers, M., Kirschner, P.A., Scheffel, M., Lindstaedt, S., Dimitrova, V. (eds) Sustaining TEL: From Innovation to Learning and Practice. EC-TEL 2010. Lecture Notes in Computer Science, vol 6383. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-16020-2_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-16020-2_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-16019-6

  • Online ISBN: 978-3-642-16020-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics