Skip to main content
Log in

A document comparison scheme for secure duplicate detection

  • Regular contribution
  • Published:
International Journal on Digital Libraries Aims and scope Submit manuscript

Abstract

The ever-growing volumes of textual information from various sources have fostered the development of digital libraries, making digital content readily accessible but also easy for malicious users to plagiarize, thus giving rise to security problems. In this paper, we introduce a duplicate detection scheme that is able to determine, with a particularly high accuracy, the degree to which one document is similar to another. Our pairwise document comparison scheme detects the resemblance between the content of documents by considering document chunks, representing contexts of words selected from the text. The resulting duplicate detection technique presents a good level of security in the protection of intellectual property while improving the availability of the data stored in the digital library and the correctness of the search results. Finally, the paper addresses efficiency and scalability issues by introducing new data reduction techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Secure Hash Standard (1995) Technical Report FIPS PUB 180-1 US Department of Commerce/National Institute of Standards and Technology

  2. Arms WY (2000) Digital libraries. MIT Press, Cambridge, MA

  3. Baeza-Yates R, Ribeiro-Neto B (1999) Modern information retrieval. Addison-Wesley, Reading, MA

  4. Baeza-Yates RA, Navarro G (1996) A faster algorithm for approximate string matching. In: 7th annual symposium on combinatorial pattern matching, pp 1–23

  5. Breunig M, Kriegel H, Kröger P, Sander J (2001) Data bubbles: quality preserving performance boosting for hierarchical clustering. In: Proc. ACM international conference on management of data (SIGMOD’01), pp 79–90

  6. Bricklin D (2004) Copy Protection Robs the Future. http://www.bricklin.com/robfuture.htm

  7. Brin S, Davis J, Garcia-Molina H (1995) Copy detection mechanisms for digital documents. In: Proc. 1995 ACM SIGMOD international conference on management of data, pp 398–409

  8. Broder A, Glassman S, Manasse M, Zweig G (1997) Syntactic clustering of the Web. Computer Netw ISDN Syst 29(8-13):1157–1166

    Article  Google Scholar 

  9. Chowdhury A, Frieder O, Grossman D (2002) Collection statistics for fast duplicate document detection. ACM Trans Inf Syst 20(2):171–191

    Article  Google Scholar 

  10. Ciaccia P, Patella M (2002) Searching in metric spaces with user-defined and approximate distances. Trans Database Syst 4(27):398–437

    Article  Google Scholar 

  11. Ciaccia P, Patella M, Zezula P (1997) M-Tree: an efficient access method for similarity search in metric spaces. In: Proc. 23rd international conference on very large data bases (VLDB), pp 426–435

  12. Gravano L, Ipeirotis P, Jagadish H, Koudas N, Muthukrishnan S, Srivastava D (2001) Approximate string joins in a database (almost) for free. In: Proc. 27th international conference on very large data bases (VLDB)

  13. Heintze N (1996) Scalable document fingerprinting. In: 2nd Usenix workshop on electronic commerce, pp 191–200

  14. Jain A, Dubes R (1988) Algorithms for clustering data. Prentice-Hall, Englewood Cliffs, NJ

  15. Jain A, Murty M, Flynn PJ (1999) Data clustering: a review. ACM Comput Surv 31(3):264–323

    Article  Google Scholar 

  16. Khachiyan L (1979) A polynomial algorithm in linear programming. Doklady Akademii Nauk SSSR 244:1093–1096

    MathSciNet  MATH  Google Scholar 

  17. Kwok SH (2003) Watermark-based copyright protection system security. Commun ACM 46(10):98–101

    Article  Google Scholar 

  18. Lawrence S, Bollacher K, Lee Giles C (1999) Indexing and retrieval of scientific literature. In: Proc. 8th international conference on information and knowledge management (CIKM)

  19. Litman J (2002) Digital copyright and the progress of science. ACM SIGIR Forum 36(2):44–52

  20. Mandreoli F, Martoglia R, Tiberio P (2002) A syntactic approach for searching similarities within sentences. In: Proc. 11th ACM conference of information and knowledge management (ACM CIKM)

  21. Mandreoli F, Martoglia R, Tiberio P (2003) Exploiting multi-lingual text potentialities in EBMT systems. In: Proc. 13th IEEE international workshop on research issues in data engineering: multi lingual information management (IEEE RIDE-MLIM 2003)

  22. Navarro G (2001) A guided tour to approximate string matching. ACM Comput Surv 33(1):31–88

    Article  Google Scholar 

  23. Shivakumar N, Garcia-Molina H (1995) SCAM: a copy detection mechanism for digital documents. In: Proc. 2nd international conference on theory and practice of digital libraries

  24. Sutinen E, Tarhio J (1996) Filtration with q-samples in approximate string matching. In: Proc. 7th annual symposium on combinatorial pattern matching

  25. Vitter J (1987) An efficient algorithm for sequential random sampling. ACM Trans Math Softw 13(1):58–67

    Article  Google Scholar 

  26. Zhou J, Sander J (2003) Data bubbles for non-vector data: speeding-up hierarchical clustering in arbitrary metric spaces. In: Proc. 29th international conference on very large data bases (VLDB)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Federica Mandreoli.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mandreoli, F., Martoglia, R. & Tiberio, P. A document comparison scheme for secure duplicate detection. Int J Digit Libr 4, 223–244 (2004). https://doi.org/10.1007/s00799-004-0079-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00799-004-0079-7

Keywords

Navigation