ABSTRACT
As part of the process of delivering content, devices like proxies and gateways log valuable information about the activities and navigation patterns of users on the Web. In this study, we consider how this navigation data can be used to improve Web search. A query posted to a search engine together with the set of pages accessed during a search task is known as a search session. We develop a mixture model for the observed set of search sessions, and propose variants of the classical EM algorithm for training. The model itself yields a type of navigation-based query clustering. By implicitly borrowing strength between related queries, the mixture formulation allows us to identify the "highly relevant" URLs for each query cluster. Next, we explore methods for incorporating existing labeled data (the Yahoo! directory, for example) to speed convergence and help resolve low-traffic clusters. Finally, the mixture formulation also provides for a simple, hierarchical display of search results based on the query clusters. The effectiveness of our approach is evaluated using proxy access logs for the outgoing Lucent proxy.
- 1.G. Attardi, A. Gulli, and F. Sebastiani. Theseus: categorization by context. In Proceedings of the Eighth Inteaataonal World Wide Web Conference (WWWS), Toronto, Canada, May 1999. Presented in the poster session.Google Scholar
- 2.M. Balabanovic and Y. Shoham. Fab: content-based, collaborative recommendation. Communications of the ACM, 40(3):66-72, Mar. 1997. Google ScholarDigital Library
- 3.D. Beeferman and A. Berger. Agglomerative clustering of a search engine query log. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-8000), pages 407416, Boston, MA, Aug. 2000. Google ScholarDigital Library
- 4.P. S. Bradley, U. M. Fayyad, and C. A. Reina. Scaling clustering algorithms to large databases. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), pages 9-15, New York, NY, June 1998.Google ScholarDigital Library
- 5.S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the Seventh International World Wide Web Conference (WWW'I), pages 107-117, Brisbane, Australia, Apr. 1998. Google ScholarDigital Library
- 6.G. Culliss. User popularity ranked search engines. In The Search Engines Conference: Search Engines and Beyond: Developing Eficaent Knowledge Management Systems, Boston, MA, Apr. 1999.Google Scholar
- 7.J. Dean and M. R. Henzinger. Finding related web pages in the World Wide Web. In Proceedings of the Eighth International World Wide Web Conference (WWWS), pages 389-401, Toronto, Canada, May 1999. Google ScholarDigital Library
- 8.A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood for incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society, 39(B):l-38, 1977.Google Scholar
- 9.P. B. Kantor, E. Boros, B. Melamed, V. Menkov, B. Shapira, and D. J. Neu. Capturing human intelligence in the Net. Communications of the ACM, 8(43):112-115, Aug. 2000. Google ScholarDigital Library
- 10.M. Kobayashi and K. Takeda. Information retrieval on the Web. ACM Computing Surveys, 32(2), June 2000. Google ScholarDigital Library
- 11.R. Lempel and S. Moran. The stochastic approach for link-structure analysis (SALSA) and the TKC effect. In Proceedings of the Ninth International World Wide Web Conference (WWWS), number 33, pages 387-401, Amsterdam, Netherlands, May 2000. Google ScholarDigital Library
- 12.D. S. Modha and W. S. Spangler. Clustering hypertext with applications to web searching. In Proceedings of the 11th ACM Conference on Hypertext and Hypermedia, pages 143-152, San Antonio, TX, May 2000. Google ScholarDigital Library
- 13.M. Sato and S. Ishii. On-line EM algorithm for the normalized gaussian network. Neural Computation, 12(2):407-432, Feb. 2000. Google ScholarDigital Library
- 14.E. Shriver and M. Hansen. Mining Web proxy logs: a user model of searching. Technical report, Bell Labs, 2001.Google Scholar
- 15.D. Sullivan. Nielsen//netratings search engine ratings, Feb. 2001. Avaliable at http://searchengineuatch.-com/ reports/netratings.html.Google Scholar
- 16.E. M. Voorhees, N. K. Gupta, and B. Johnson-Laird. Learning collection fusion strategies. In Proceedings of the 18th Annual International ACM/SIGIR Conference on Research and Deueloprnent in Information Retrieval, pages 172-179, Seattle, WA, July 1995. Google ScholarDigital Library
- 17.E. M. Voorhees and R. M. Tong. Multiple search engines in database merging. In Proceedings of the Second ACM International Conference on Digital Libraries, pages 93-102, Philadelphia, PA, July 1997. Google ScholarDigital Library
- 18.Y. Yang and X. Liu. A re-examination of text categorization methods. In Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval, pages 4249, Berkeley, CA, Aug. 1999. Google ScholarDigital Library
- 19.T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 103-114, Montreal, Canada, June 1996. Google ScholarDigital Library
Index Terms
- Using navigation data to improve IR functions in the context of web search
Recommendations
Computational aspects of fitting mixture models via the expectation-maximization algorithm
The Expectation-Maximization (EM) algorithm is a popular tool in a wide variety of statistical settings, in particular in the maximum likelihood estimation of parameters when clustering using mixture models. A serious pitfall is that in the case of a ...
How are we searching the World Wide Web? A comparison of nine search engine transaction logs
Special issue: Formal methods for information retrievalThe Web and especially major Web search engines are essential tools in the quest to locate online information for many people. This paper reports results from research that examines characteristics and changes in Web searching from nine studies of five ...
Query clustering using user logs
Query clustering is a process used to discover frequently asked questions or most popular topics on a search engine. This process is crucial for search engines based on question-answering. Because of the short lengths of queries, approaches based on ...
Comments