skip to main content
10.1145/511446.511502acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
Article

Evaluating strategies for similarity search on the web

Published:07 May 2002Publication History

ABSTRACT

Finding pages on the Web that are similar to a query page (Related Pages) is an important component of modern search engines. A variety of strategies have been proposed for answering Related Pages queries, but comparative evaluation by user studies is expensive, especially when large strategy spaces must be searched (e.g., when tuning parameters). We present a technique for automatically evaluating strategies using Web hierarchies, such as Open Directory, in place of user feedback. We apply this evaluation methodology to a mix of document representation strategies, including the use of text, anchor-text, and links. We discuss the relative advantages and disadvantages of the various approaches examined. Finally, we describe how to efficiently construct a similarity index out of our chosen strategies, and provide sample results from our index.

References

  1. E. Amitay. Using Common Hypertext Links to Identify the Best Phrasal Description of Target Web Documents. Proceedings of SIGIR'98 Post-Conference Workshop on Hypertext Information Retrieval for the Web, 1998.Google ScholarGoogle Scholar
  2. G. Attardi, A. Gull, and F. Sebastiani. Theseus: Categorization by context. Proceedings of WWW8, 1999.Google ScholarGoogle Scholar
  3. S. Brin and L. Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Proceedings of WWW7, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Broder. Filtering Near-duplicate Documents. Proceedings of FUN, 1998.Google ScholarGoogle Scholar
  5. A. Broder. On the Resemblance and Containment of Documents. In Compression and Complexity of Sequences, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise Independent Permutations. Proceedings of STOC, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. A. Broder, S. Glassman, M. Manasse, and G. Zweig. Syntactic Clustering of the Web. Proceedings of WWW6, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. S. Chakrabarti, B. Dom, and P. Indyk. Enhanced Hypertext Categorization Using Hyperlinks. Proceedings of SIGMOD, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. Kleinberg. Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. Proceedings of WWW7, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. Ullman, and C. Yang. Finding Interesting Associations without Support Pruning. Proceedings of ICDE, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. B. Davison. Topical Locality in the Web. Proceedings of SIGIR, 2000 Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. J. Dean and M. Henzinger. Finding Related Pages in the World Wide Web. Proceedings of WWW8, 1999 Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. L. A. Goodman and W. H. Kruskal. Measures of association for cross classifications. J. of Amer. Stat. Assoc., 49:732--764, 1954.Google ScholarGoogle Scholar
  14. T.H. Haveliwala, A. Gionis, and P. Indyk. Scalable Techniques for Clustering the Web. Informal Proceedings of the International Workshop on the Web and Databases, WebDB, 2000.Google ScholarGoogle Scholar
  15. J. Hirai, S. Raghavan, H. Garcia-Molina, and A. Paepcke. WebBase: A Repository of Web Pages. Proceedings of WWW9, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. Indyk. A Small Minwise Independent Family of Hash Functions. Proceedings of SODA, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A.K. Jain, M. Narasimha Murty, and P.J. Flynn. Data clustering: A review. ACM Computing Surveys, 31(3), 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. J. Kleinberg. Authoritative sources in a hyperlinked environment. Proceedings of SODA, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. L. Lee. Measures of Distributional Similarity. Proceedings of ACL, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. H. P. Luhn. The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development, 2:159--165, 1958.Google ScholarGoogle Scholar
  21. Open Directory Project (ODP). http://www.dmoz.com/.Google ScholarGoogle Scholar
  22. M. Porter. An Algorithm for Suffix Stripping. Program: Automated Library and Information Systems, 14(3):130--137, 1980.Google ScholarGoogle ScholarCross RefCross Ref
  23. G. Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. S. Siegel and N. J. Castellan. Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill, 1988.Google ScholarGoogle Scholar
  25. M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. TextMining Workshop, KDD, 2000.Google ScholarGoogle Scholar
  26. Yahoo! http://www.yahoo.com/.Google ScholarGoogle Scholar

Index Terms

  1. Evaluating strategies for similarity search on the web

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Conferences
                WWW '02: Proceedings of the 11th international conference on World Wide Web
                May 2002
                754 pages
                ISBN:1581134495
                DOI:10.1145/511446

                Copyright © 2002 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 7 May 2002

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • Article

                Acceptance Rates

                Overall Acceptance Rate1,899of8,196submissions,23%

                Upcoming Conference

                WWW '24
                The ACM Web Conference 2024
                May 13 - 17, 2024
                Singapore , Singapore

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader