Article

Evaluating strategies for similarity search on the web

Authors:
Taher H. Haveliwala

Stanford University, Stanford, CA

Stanford University, Stanford, CA
View Profile

,
Aristides Gionis

Stanford University, Stanford, CA

Stanford University, Stanford, CA
View Profile

,
Dan Klein

Stanford University, Stanford, CA

Stanford University, Stanford, CA
View Profile

,
Piotr Indyk

Laboratory of Computer Science, Cambridge, MA

Laboratory of Computer Science, Cambridge, MA
View Profile

WWW '02: Proceedings of the 11th international conference on World Wide WebMay 2002Pages 432–442https://doi.org/10.1145/511446.511502

Published:07 May 2002Publication History

WWW '02: Proceedings of the 11th international conference on World Wide Web

Pages 432–442

ABSTRACT

Finding pages on the Web that are similar to a query page (Related Pages) is an important component of modern search engines. A variety of strategies have been proposed for answering Related Pages queries, but comparative evaluation by user studies is expensive, especially when large strategy spaces must be searched (e.g., when tuning parameters). We present a technique for automatically evaluating strategies using Web hierarchies, such as Open Directory, in place of user feedback. We apply this evaluation methodology to a mix of document representation strategies, including the use of text, anchor-text, and links. We discuss the relative advantages and disadvantages of the various approaches examined. Finally, we describe how to efficiently construct a similarity index out of our chosen strategies, and provide sample results from our index.

References

E. Amitay. Using Common Hypertext Links to Identify the Best Phrasal Description of Target Web Documents. Proceedings of SIGIR'98 Post-Conference Workshop on Hypertext Information Retrieval for the Web, 1998.Google Scholar
G. Attardi, A. Gull, and F. Sebastiani. Theseus: Categorization by context. Proceedings of WWW8, 1999.Google Scholar
S. Brin and L. Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Proceedings of WWW7, 1998. Google ScholarDigital Library
A. Broder. Filtering Near-duplicate Documents. Proceedings of FUN, 1998.Google Scholar
A. Broder. On the Resemblance and Containment of Documents. In Compression and Complexity of Sequences, 1998. Google ScholarDigital Library
A. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise Independent Permutations. Proceedings of STOC, 1998. Google ScholarDigital Library
A. Broder, S. Glassman, M. Manasse, and G. Zweig. Syntactic Clustering of the Web. Proceedings of WWW6, 1997. Google ScholarDigital Library
S. Chakrabarti, B. Dom, and P. Indyk. Enhanced Hypertext Categorization Using Hyperlinks. Proceedings of SIGMOD, 1998. Google ScholarDigital Library
S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. Kleinberg. Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. Proceedings of WWW7, 1998. Google ScholarDigital Library
E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. Ullman, and C. Yang. Finding Interesting Associations without Support Pruning. Proceedings of ICDE, 2000. Google ScholarDigital Library
B. Davison. Topical Locality in the Web. Proceedings of SIGIR, 2000 Google ScholarDigital Library
J. Dean and M. Henzinger. Finding Related Pages in the World Wide Web. Proceedings of WWW8, 1999 Google ScholarDigital Library
L. A. Goodman and W. H. Kruskal. Measures of association for cross classifications. J. of Amer. Stat. Assoc., 49:732--764, 1954.Google Scholar
T.H. Haveliwala, A. Gionis, and P. Indyk. Scalable Techniques for Clustering the Web. Informal Proceedings of the International Workshop on the Web and Databases, WebDB, 2000.Google Scholar
J. Hirai, S. Raghavan, H. Garcia-Molina, and A. Paepcke. WebBase: A Repository of Web Pages. Proceedings of WWW9, 2000. Google ScholarDigital Library
P. Indyk. A Small Minwise Independent Family of Hash Functions. Proceedings of SODA, 1999. Google ScholarDigital Library
A.K. Jain, M. Narasimha Murty, and P.J. Flynn. Data clustering: A review. ACM Computing Surveys, 31(3), 1999. Google ScholarDigital Library
J. Kleinberg. Authoritative sources in a hyperlinked environment. Proceedings of SODA, 1998. Google ScholarDigital Library
L. Lee. Measures of Distributional Similarity. Proceedings of ACL, 1999. Google ScholarDigital Library
H. P. Luhn. The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development, 2:159--165, 1958.Google Scholar
Open Directory Project (ODP). http://www.dmoz.com/.Google Scholar
M. Porter. An Algorithm for Suffix Stripping. Program: Automated Library and Information Systems, 14(3):130--137, 1980.Google ScholarCross Ref
G. Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983. Google ScholarDigital Library
S. Siegel and N. J. Castellan. Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill, 1988.Google Scholar
M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. TextMining Workshop, KDD, 2000.Google Scholar
Yahoo! http://www.yahoo.com/.Google Scholar

Index Terms

Evaluating strategies for similarity search on the web
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
2. Information systems
  1. Information retrieval
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Pivot Selection Strategies for Permutation-Based Similarity Search
SISAP 2013: Proceedings of the 6th International Conference on Similarity Search and Applications - Volume 8199

Recently, permutation based indexes have attracted interest in the area of similarity search. The basic idea of permutation based indexes is that data objects are represented as appropriately generated permutations of a set of pivots or reference ...
Read More
Web search strategies: The influence of Web experience and task type

Despite a number of studies looking at Web experience and Web searching tactics and behaviours, the specific relationships between experience and cognitive search strategies have not been widely researched. This study investigates how the cognitive ...
Read More
Efficient link-based similarity search in web networks

The pre-computation cost in the off-line stage is significantly reduced.The efficiency of query processing is optimized by proposing a pruning algorithm.The accuracy loss of pruning algorithm is controlled by tuning threshold.The effectiveness of ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '02: Proceedings of the 11th international conference on World Wide Web
May 2002
754 pages
ISBN:1581134495
DOI:10.1145/511446
Conference Chairs:
David Lassner
University of Hawaii
,
Dave De Roure
University of Southampton
,
Arun Iyengar
IBM T.J. Watson Research Center
Copyright © 2002 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 7 May 2002
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
evaluation
open directory project
related pages
search
similarity search
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Upcoming Conference
WWW '24

Sponsor:

sigweb

The ACM Web Conference 2024

May 13 - 17, 2024

Singapore , Singapore
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 91
  Total Citations
  View Citations
- 1,509
  Total Downloads
- Downloads (Last 12 months)21
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Evaluating strategies for similarity search on the web

WWW '02: Proceedings of the 11th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Pivot Selection Strategies for Permutation-Based Similarity Search

Web search strategies: The influence of Web experience and task type

Efficient link-based similarity search in web networks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Evaluating strategies for similarity search on the web

WWW '02: Proceedings of the 11th international conference on World Wide Web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Pivot Selection Strategies for Permutation-Based Similarity Search

Web search strategies: The influence of Web experience and task type

Efficient link-based similarity search in web networks

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media