ABSTRACT
Finding pages on the Web that are similar to a query page (Related Pages) is an important component of modern search engines. A variety of strategies have been proposed for answering Related Pages queries, but comparative evaluation by user studies is expensive, especially when large strategy spaces must be searched (e.g., when tuning parameters). We present a technique for automatically evaluating strategies using Web hierarchies, such as Open Directory, in place of user feedback. We apply this evaluation methodology to a mix of document representation strategies, including the use of text, anchor-text, and links. We discuss the relative advantages and disadvantages of the various approaches examined. Finally, we describe how to efficiently construct a similarity index out of our chosen strategies, and provide sample results from our index.
- E. Amitay. Using Common Hypertext Links to Identify the Best Phrasal Description of Target Web Documents. Proceedings of SIGIR'98 Post-Conference Workshop on Hypertext Information Retrieval for the Web, 1998.Google Scholar
- G. Attardi, A. Gull, and F. Sebastiani. Theseus: Categorization by context. Proceedings of WWW8, 1999.Google Scholar
- S. Brin and L. Page. The Anatomy of a Large-Scale Hypertextual Web Search Engine. Proceedings of WWW7, 1998. Google ScholarDigital Library
- A. Broder. Filtering Near-duplicate Documents. Proceedings of FUN, 1998.Google Scholar
- A. Broder. On the Resemblance and Containment of Documents. In Compression and Complexity of Sequences, 1998. Google ScholarDigital Library
- A. Broder, M. Charikar, A. Frieze, and M. Mitzenmacher. Min-wise Independent Permutations. Proceedings of STOC, 1998. Google ScholarDigital Library
- A. Broder, S. Glassman, M. Manasse, and G. Zweig. Syntactic Clustering of the Web. Proceedings of WWW6, 1997. Google ScholarDigital Library
- S. Chakrabarti, B. Dom, and P. Indyk. Enhanced Hypertext Categorization Using Hyperlinks. Proceedings of SIGMOD, 1998. Google ScholarDigital Library
- S. Chakrabarti, B. Dom, P. Raghavan, S. Rajagopalan, D. Gibson, and J. Kleinberg. Automatic Resource Compilation by Analyzing Hyperlink Structure and Associated Text. Proceedings of WWW7, 1998. Google ScholarDigital Library
- E. Cohen, M. Datar, S. Fujiwara, A. Gionis, P. Indyk, R. Motwani, J. Ullman, and C. Yang. Finding Interesting Associations without Support Pruning. Proceedings of ICDE, 2000. Google ScholarDigital Library
- B. Davison. Topical Locality in the Web. Proceedings of SIGIR, 2000 Google ScholarDigital Library
- J. Dean and M. Henzinger. Finding Related Pages in the World Wide Web. Proceedings of WWW8, 1999 Google ScholarDigital Library
- L. A. Goodman and W. H. Kruskal. Measures of association for cross classifications. J. of Amer. Stat. Assoc., 49:732--764, 1954.Google Scholar
- T.H. Haveliwala, A. Gionis, and P. Indyk. Scalable Techniques for Clustering the Web. Informal Proceedings of the International Workshop on the Web and Databases, WebDB, 2000.Google Scholar
- J. Hirai, S. Raghavan, H. Garcia-Molina, and A. Paepcke. WebBase: A Repository of Web Pages. Proceedings of WWW9, 2000. Google ScholarDigital Library
- P. Indyk. A Small Minwise Independent Family of Hash Functions. Proceedings of SODA, 1999. Google ScholarDigital Library
- A.K. Jain, M. Narasimha Murty, and P.J. Flynn. Data clustering: A review. ACM Computing Surveys, 31(3), 1999. Google ScholarDigital Library
- J. Kleinberg. Authoritative sources in a hyperlinked environment. Proceedings of SODA, 1998. Google ScholarDigital Library
- L. Lee. Measures of Distributional Similarity. Proceedings of ACL, 1999. Google ScholarDigital Library
- H. P. Luhn. The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development, 2:159--165, 1958.Google Scholar
- Open Directory Project (ODP). http://www.dmoz.com/.Google Scholar
- M. Porter. An Algorithm for Suffix Stripping. Program: Automated Library and Information Systems, 14(3):130--137, 1980.Google ScholarCross Ref
- G. Salton and M.J. McGill. Introduction to Modern Information Retrieval. McGraw-Hill, 1983. Google ScholarDigital Library
- S. Siegel and N. J. Castellan. Nonparametric Statistics for the Behavioral Sciences. McGraw-Hill, 1988.Google Scholar
- M. Steinbach, G. Karypis, and V. Kumar. A comparison of document clustering techniques. TextMining Workshop, KDD, 2000.Google Scholar
- Yahoo! http://www.yahoo.com/.Google Scholar
Index Terms
- Evaluating strategies for similarity search on the web
Recommendations
Pivot Selection Strategies for Permutation-Based Similarity Search
SISAP 2013: Proceedings of the 6th International Conference on Similarity Search and Applications - Volume 8199Recently, permutation based indexes have attracted interest in the area of similarity search. The basic idea of permutation based indexes is that data objects are represented as appropriately generated permutations of a set of pivots or reference ...
Web search strategies: The influence of Web experience and task type
Despite a number of studies looking at Web experience and Web searching tactics and behaviours, the specific relationships between experience and cognitive search strategies have not been widely researched. This study investigates how the cognitive ...
Efficient link-based similarity search in web networks
The pre-computation cost in the off-line stage is significantly reduced.The efficiency of query processing is optimized by proposing a pruning algorithm.The accuracy loss of pruning algorithm is controlled by tuning threshold.The effectiveness of ...
Comments