Definition
Web search result de-duplication and clustering are both techniques for improving the organization and presentation of Web search results. De-duplication refers to the removal of duplicate or near-duplicate web pages in the search result page. Since a user is not likely interested in seeing redundant information, de-duplication can help improve search results by decreasing the redundancy and increasing the diversity among search results.
Web search result clustering means that given a set of web search results, the search engine partitions them into subsets (clusters) according to the similarity between search results and presents the results in a structured way. Clustering results helps improve the organization of search results because similar pages will be grouped together in a cluster and a user can easily navigate into the most relevant cluster to find relevant pages. Hierarchical clustering is often used to generate a hierarchical tree structure which facilitates...
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Recommended Reading
Broder A.Z., Glassman S.C., Manasse M.S., and Zweig G. Syntactic clustering of the web. Comput. Networks, 29(8–13): 1157–1166, 1997.
Chowdhury A., Frieder O., Grossman D.A., and McCabe M.C. Collection statistics for fast duplicate document detection. ACM Trans. Inf. Syst., 20(2):171–191, 2002.
Cutting D.R., Pedersen J.O., Karger D., and Tukey J.W. Scatter/gather: a cluster-based approach to browsing large document collections. In Proc. 15th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 318–329.1992,
Dumais S.T., Cutrell E., and Chen H. Optimizing search by showing results in context. In Proc. SIGCHI Conf. on Human Factors in Computing Systems, pp. 277–284.2001,
Ferragina P. and Gulli A. A personalized search engine based on Web-snippet hierarchical clustering. In Proc. 14th Int. World Wide Web Conference, pp. 801–810.2005,
Hearst M.A. and Pedersen J.O. 1Reexamining the cluster hypothesis: scatter/gather on retrieval results. In Proc. 19th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 76–84.1996,
Hoad T. and Zobel J. Methods for identifying versioned and plagiarised documents.. J. Am. Soc. Inf. Sci. Technol., 54(3): 203–215, 2003.
Huffman S., Lehman A., Stolboushkin A., Wong-Toi H., Yang F., and Roehrig H. Multiple-signal duplicate detection for search evaluation. In Proc. 33rd Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 223–230.2007,
Jardine N. and van Rijsbergen C. The use of hierarchic clustering in information retrieval.. Inf. Storage Retr., 7(5):217–240, 1971.
Manber U. Finding similar files in a large file system. In Proc. USENIX Winter 1994 Technical Conference, pp. 1–10.1994,
Mei Q., Shen X., and Zhai C. Automatic labeling of multinomial topic models. In Proc. 13th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining, pp. 490–499.2007,
Shivakumar N. and Garcia-Molina H. SCAM: a copy detection mechanism for digital documents. In Proc. 2nd Int. Conf. in Theory and Practice of Digital Libraries, 1995.
Wang X. and Zhai C. Learn from Web search logs to organize search results. In Proc. 33rd Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 87–94.2007,
Willett P. Recent trends in hierarchic document clustering: a critical review.. Inf. Process. Manage., 24(5):577–597, 1988.
Zamir O. and Etzioni O. Grouper: a dynamic clustering interface to Web search results. In Proc. 8th Int. World Wide Web Conference, 1999.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer Science+Business Media, LLC
About this entry
Cite this entry
Shen, X., Zhai, C. (2009). Web Search Result De-duplication and Clustering. In: LIU, L., ÖZSU, M.T. (eds) Encyclopedia of Database Systems. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-39940-9_326
Download citation
DOI: https://doi.org/10.1007/978-0-387-39940-9_326
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-35544-3
Online ISBN: 978-0-387-39940-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering