Skip to main content

Detecting Near-Duplicate Documents Using Sentence Level Features

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9262))

Abstract

In Web search engines, digital libraries and other types of online information services, duplicates and near-duplicates may cause severe problems if unaddressed. Typical problems include more space needed than necessary, longer indexing time and redundant results presented to users. In this paper, we propose a method of detecting near-duplicate documents. Two sentence level features, number of terms and terms at particular positions, are used in the method. Suffix tree is used to match sentence blocks very efficiently. Experiments are carried out to compare our method with two other representative methods and show that our method is effective and efficient. It has potential to be used in practice.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Andrei, Z.B., Steven, C.G., Mark, S., Manasse, G.Z.: Syntactic clustering of the web. Comput. Netw. 29(8–13), 1157–1166 (1997)

    Google Scholar 

  2. Chowdhury, A., Frieder, O., Grossman, D., McCabe, M.C.: Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst. 20(2), 171–191 (2002)

    Article  Google Scholar 

  3. Lin, Y.S., Liao, T.Y., Lee, S.J.: Detecting near-duplicate documents using sentence-level features and supervised learning. Expert Syst. Appl. 40, 1467–1476 (2013)

    Article  Google Scholar 

  4. Wang, J.-H., Chang, H.-C.: Exploiting sentence-level features for near-duplicate document detection. In: Lee, G.G., Song, D., Lin, C.-Y., Aizawa, A., Kuriyama, K., Yoshioka, M., Sakai, T. (eds.) AIRS 2009. LNCS, vol. 5839, pp. 205–217. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  5. Shivakumar, N., Garcia-Molina, H.: SCAM: a copy detection mechanism for digital documents. In: Proceedings of the International Conference on Theory and Practice of Digital Libraries (1995)

    Google Scholar 

  6. Zhang, Q., Zhang, Y., Yu, H.M., Huang, X.J.: Efficient partial-duplicate detection based on sequence matching. In: Proceedings of ACM SIGIR, pp. 675–682 (2010)

    Google Scholar 

  7. Dean, J., Ghemawat, S.: MapReduce: simplified data processing on large clusters. In: Proceedings of 6th Symposium on Operating System Design and Implementation (2004)

    Google Scholar 

  8. Chang, H.C., Wang, J.H., Chiu, C.Y.: Finding event-relevant content from the web using a near-duplicate detection approach. In: Proceedings of the 2007 IEEE/WIC/ACM International Conference on Web Intelligence, pp. 291–294 (2007)

    Google Scholar 

  9. Hoad, T., Zobel, J.: Methods for identifying versioned and plagiarized documents. J. Am. Soc. Inf. Sci. Technol. 203–215, 54 (2003)

    MATH  Google Scholar 

  10. Manku, G.S., Jain, A., Sarma, A.D.: Detecting near-duplicates for web crawling. In: Proceedings of the 16th International Conference on World Wide Web, pp. 141–150 (2007)

    Google Scholar 

  11. Schleimer, S., Wilkerson, D., Aiken, A.: Winnowing: local algorithms for document fingerprinting. In: Proceedings of the 2003 ACM SIGMOD, pp. 76–85 (2003)

    Google Scholar 

  12. Brin, S., Davis, J., Garcia-Molina, H.: Copy detection mechanisms for digital documents. In: Proceedings of ACM SIGMOD, pp. 388–409 (1995)

    Google Scholar 

  13. Salton, G.: The state of retrieval system evaluation. Inf. Process. Manage. 28(4), 441–448 (1992)

    Article  Google Scholar 

  14. Frakes, W.B., Baeza-Yates, R.A.: Information Retrieval: Data Structures and Algorithms. Prentice Hall, Englewood Cliffs (1992)

    Google Scholar 

  15. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval: The Concepts and technology Behind Search. Pearson Education Limited, Harlow (2011)

    Google Scholar 

  16. Ukkonen, E.: On-line construction of suffix tree. Algorithmica 14(3), 249–260 (1995)

    Article  MathSciNet  MATH  Google Scholar 

  17. Huang, L., Wang, L., Li, X.: Achieving both high precision and high recall in near-duplicate detection. In: Proceedings of ACM CIKM, pp. 63–72 (2008)

    Google Scholar 

  18. Yerra, R., Ng, Y.-K.: A sentence-based copy detection approach for web documents. In: Wang, L., Jin, Y. (eds.) FSKD 2005. LNCS (LNAI), vol. 3613, pp. 557–570. Springer, Heidelberg (2005)

    Chapter  Google Scholar 

  19. Theobald, M., Siddharth, J., Paepcke, A.: SpotSigs: robust and efficient near duplicate detection in large web collections. In: Proceedings of ACM SIGIR, pp. 563–570 (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Shengli Wu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Feng, J., Wu, S. (2015). Detecting Near-Duplicate Documents Using Sentence Level Features. In: Chen, Q., Hameurlain, A., Toumani, F., Wagner, R., Decker, H. (eds) Database and Expert Systems Applications. Globe DEXA 2015 2015. Lecture Notes in Computer Science(), vol 9262. Springer, Cham. https://doi.org/10.1007/978-3-319-22852-5_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-22852-5_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-22851-8

  • Online ISBN: 978-3-319-22852-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics