Skip to main content

Web Search Result De-duplication and Clustering

  • Reference work entry
  • First Online:
Encyclopedia of Database Systems
  • 17 Accesses

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 4,499.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Hardcover Book
USD 6,499.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Recommended Reading

  1. Broder AZ, Glassman SC, Manasse MS, Zweig G. Syntactic clustering of the web. Comput Netw. 1997;29(8–13):1157–66.

    Google Scholar 

  2. Chowdhury A, Frieder O, Grossman DA, McCabe MC. Collection statistics for fast duplicate document detection. ACM Trans Inf Syst. 2002;20(2): 171–91.

    Article  Google Scholar 

  3. Cutting DR, Pedersen JO, Karger D, Tukey JW. Scatter/gather: a cluster-based approach to browsing large document collections. In: Proceedings of the 15th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 1992. p. 318–29.

    Google Scholar 

  4. Dumais ST, Cutrell E, Chen H. Optimizing search by showing results in context. In: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems; 2001. p. 277–84.

    Google Scholar 

  5. Ferragina P, Gulli A. A personalized search engine based on Web-snippet hierarchical clustering. In: Proceedings of the 14th International World Wide Web Conference; 2005. p. 801–10.

    Google Scholar 

  6. Hearst MA, Pedersen JO. 1Reexamining the cluster hypothesis: scatter/gather on retrieval results. In: Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 1996. p. 76–84.

    Google Scholar 

  7. Hoad T, Zobel J. Methods for identifying versioned and plagiarised documents. J Am Soc Inf Sci Technol. 2003;54(3):203–15.

    Article  Google Scholar 

  8. Huffman S, Lehman A, Stolboushkin A, Wong-Toi H, Yang F, Roehrig H. Multiple-signal duplicate detection for search evaluation. In: Proceedings of the 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 2007. p. 223–30.

    Google Scholar 

  9. Jardine N, van Rijsbergen C. The use of hierarchic clustering in information retrieval. Inf Storage Retrovir. 1971;7(5):217–40.

    Article  Google Scholar 

  10. Manber U. Finding similar files in a large file system. In: Proceedings of the USENIX Winter 1994 Technical Conference; 1994. p. 1–10.

    Google Scholar 

  11. Mei Q, Shen X, Zhai C. Automatic labeling of multinomial topic models. In: Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2007. p. 490–9.

    Google Scholar 

  12. Shivakumar N, Garcia-Molina H. SCAM: a copy detection mechanism for digital documents. In: Proceedings of the 2nd International Conference in Theory and Practice of Digital Libraries; 1995.

    Google Scholar 

  13. Wang X, Zhai C. Learn from web search logs to organize search results. In: Proceedings of the 33rd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval; 2007. p. 87–94.

    Google Scholar 

  14. Willett P. Recent trends in hierarchic document clustering: a critical review. Inf Process Manag. 1988;24(5):577–97.

    Article  Google Scholar 

  15. Zamir O, Etzioni O. Grouper: a dynamic clustering interface to Web search results. In: Proceedings of the 8th International World Wide Web Conference; 1999.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xuehua Shen .

Editor information

Editors and Affiliations

Section Editor information

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Science+Business Media, LLC, part of Springer Nature

About this entry

Check for updates. Verify currency and authenticity via CrossMark

Cite this entry

Shen, X., Zhai, C. (2018). Web Search Result De-duplication and Clustering. In: Liu, L., Özsu, M.T. (eds) Encyclopedia of Database Systems. Springer, New York, NY. https://doi.org/10.1007/978-1-4614-8265-9_326

Download citation

Publish with us

Policies and ethics