Skip to main content

Exploiting Sentence-Level Features for Near-Duplicate Document Detection

  • Conference paper
Book cover Information Retrieval Technology (AIRS 2009)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 5839))

Included in the following conference series:

Abstract

Digital documents are easy to copy. How to effectively detect possible near-duplicate copies is critical in Web search. Conventional copy detection approaches such as document fingerprinting and bag-of-word similarity target at different levels of granularity in document features, from word n-grams to whole documents. In this paper, we focus on the mutual-inclusive type of near-duplicates where only partial overlap among documents makes them similar. We propose using a simple and compact sentence-level feature, the sequence of sentence lengths, for near-duplicate copy detection. Various configurations of sentence-level and word-level algorithms are evaluated. The experimental results show that sentence-level algorithms achieved higher efficiency with comparable precision and recall rates.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Bernstein, Y., Zobel, J.: Accurate Discovery of Co-derivative Documents via Duplicate Text Detection. Information Systems 31(7), 595–609 (2006)

    Article  Google Scholar 

  2. Brin, S., Davis, J., Garcia-Molina, H.: Copy Detection Mechanisms for Digital Documents. In: The 1995 ACM International Conference on Management of Data (SIGMOD 1995), pp. 398–409 (1995)

    Google Scholar 

  3. Broder, A.: On the Resemblance and Containment of Documents. In: Compression and Complexity of Sequences, pp. 21–29 (1997)

    Google Scholar 

  4. Broder, A., Glassman, S., Manasse, M., Zweig, G.: Syntactic Clustering of the Web. In: The 6th International Conference on World Wide Web (WWW 1997), pp. 393–404 (1997)

    Google Scholar 

  5. Chang, H.C., Wang, J.H.: Organizing News Archives by Near-duplicate Copy Detection in Digital Libraries. In: Goh, D.H.-L., Cao, T.H., Sølvberg, I.T., Rasmussen, E. (eds.) ICADL 2007. LNCS, vol. 4822, pp. 410–419. Springer, Heidelberg (2007)

    Chapter  Google Scholar 

  6. Chang, H.C., Wang, J.H., Chiu, C.Y.: Finding Event-Relevant Content from the Web Using a Near-duplicate Detection Approach. In: The 2007 IEEE/WIC/ACM International Conference on Web Intelligence (WI 2007), pp. 291–294 (2007)

    Google Scholar 

  7. Charikar, M.S.: Similarity Estimation Techniques from Rounding Algorithms. In: The 34th Annual ACM Symposium on Theory of Computing (STOC 2002), pp. 380–388 (2002)

    Google Scholar 

  8. Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.C.: Collection Statistics for Fast Duplicate Document Detection. ACM Transactions on Information Systems (TOIS) 20(2), 171–191 (2002)

    Article  Google Scholar 

  9. Damerau, F.J.: A Technique for Computer Detection and Correction of Spelling Errors. Communications of the ACM 7(3), 171–176 (1964)

    Article  Google Scholar 

  10. Fetterly, D., Manasse, M., Najork, M.: Detecting Phrase-level Duplication on the World Wide Web. In: The 28th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2005), pp. 170–177 (2005)

    Google Scholar 

  11. Heintze, N.: Scalable Document Fingerprinting. In: The 2nd USENIX Workshop on Electronic Commerce (1996)

    Google Scholar 

  12. Henzinger, M.: Finding Near-duplicate Web Pages: A Large-scale Evaluation of Algorithms. In: The 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2006), pp. 284–291 (2006)

    Google Scholar 

  13. Hoad, T.C., Zobel, J.: Methods for Identifying Versioned and Plagiarized Documents. Journal of the American Society for Information Science and Technology 54(3), 203–215 (2003)

    Article  Google Scholar 

  14. Huffman, S.B., Lehman, A.R., Stolboushkin, A.P., Wong-Toi, H., Yang, F., Roehrig, H.: Multiple-signal Duplicate Detection for Search Evaluation. In: The 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2007), pp. 223–230 (2007)

    Google Scholar 

  15. Manber, U.: Finding Similar Files in a Large File System. In: USENIX Winter Technical Conference, pp. 1–10 (1994)

    Google Scholar 

  16. Manku, G.S., Jain, A., Sarma, A.D.: Detecting Near-duplicates for Web Crawling. In: The 16th International Conference on World Wide Web (WWW 2007), pp. 141–150 (2007)

    Google Scholar 

  17. Metzler, D., Bernstein, Y., Croft, W.B., Moffat, A., Zobel, J.: Similarity Measures for Tracking Information Flow. In: The 14th ACM Conference on Information and Knowledge Management (CIKM 2005), pp. 517–524 (2005)

    Google Scholar 

  18. NIST. Secure hash standard. Federal Information Processing Standards, FIPS 180-1 (1995)

    Google Scholar 

  19. NTCIR (NII Test Collection for IR Systems) project, http://research.nii.ac.jp/ntcir/ (accessed on January 23, 2009)

  20. Seo, J., Croft, W.B.: Local Text Reuse Detection. In: The 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2008), pp. 571–578 (2008)

    Google Scholar 

  21. Shivakumar, N., Garcia-Molina, H.: SCAM: A Copy Detection Mechanism for Digital Documents. In: International Conference on Theory and Practice of Digital Libraries (1995)

    Google Scholar 

  22. Theobald, M., Siddharth, J., Paepcke, A.: SpotSigs: Robust and Efficient Near Duplicate Detection in Large Web Collections. In: The 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2008), pp. 563–570 (2008)

    Google Scholar 

  23. Xiao, C., Wang, W., Lin, X., Yu, J.X.: Efficient Similarity Joins for Near Duplicate Detection. In: The 17th International Conference on World Wide Web (WWW 2008), pp. 131–140 (2008)

    Google Scholar 

  24. Yang, H., Callan, J.: Near-duplicate Detection by Instance-level Constrained Clustering. In: The 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2006), pp. 421–428 (2006)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2009 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Wang, JH., Chang, HC. (2009). Exploiting Sentence-Level Features for Near-Duplicate Document Detection. In: Lee, G.G., et al. Information Retrieval Technology. AIRS 2009. Lecture Notes in Computer Science, vol 5839. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04769-5_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-04769-5_18

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-04768-8

  • Online ISBN: 978-3-642-04769-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics