Article

Using navigation data to improve IR functions in the context of web search

Authors:
Mark H. Hansen

Bell Laboratories, Murray Hill, NJ

Bell Laboratories, Murray Hill, NJ
View Profile

,
Elizabeth Shriver

Bell Laboratories, Murray Hill, NJ

Bell Laboratories, Murray Hill, NJ
View Profile

CIKM '01: Proceedings of the tenth international conference on Information and knowledge managementOctober 2001Pages 135–142https://doi.org/10.1145/502585.502609

Published:05 October 2001Publication History

CIKM '01: Proceedings of the tenth international conference on Information and knowledge management

Pages 135–142

ABSTRACT

As part of the process of delivering content, devices like proxies and gateways log valuable information about the activities and navigation patterns of users on the Web. In this study, we consider how this navigation data can be used to improve Web search. A query posted to a search engine together with the set of pages accessed during a search task is known as a search session. We develop a mixture model for the observed set of search sessions, and propose variants of the classical EM algorithm for training. The model itself yields a type of navigation-based query clustering. By implicitly borrowing strength between related queries, the mixture formulation allows us to identify the "highly relevant" URLs for each query cluster. Next, we explore methods for incorporating existing labeled data (the Yahoo! directory, for example) to speed convergence and help resolve low-traffic clusters. Finally, the mixture formulation also provides for a simple, hierarchical display of search results based on the query clusters. The effectiveness of our approach is evaluated using proxy access logs for the outgoing Lucent proxy.

References

1.G. Attardi, A. Gulli, and F. Sebastiani. Theseus: categorization by context. In Proceedings of the Eighth Inteaataonal World Wide Web Conference (WWWS), Toronto, Canada, May 1999. Presented in the poster session.Google Scholar
2.M. Balabanovic and Y. Shoham. Fab: content-based, collaborative recommendation. Communications of the ACM, 40(3):66-72, Mar. 1997. Google ScholarDigital Library
3.D. Beeferman and A. Berger. Agglomerative clustering of a search engine query log. In Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-8000), pages 407416, Boston, MA, Aug. 2000. Google ScholarDigital Library
4.P. S. Bradley, U. M. Fayyad, and C. A. Reina. Scaling clustering algorithms to large databases. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD-98), pages 9-15, New York, NY, June 1998.Google ScholarDigital Library
5.S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the Seventh International World Wide Web Conference (WWW'I), pages 107-117, Brisbane, Australia, Apr. 1998. Google ScholarDigital Library
6.G. Culliss. User popularity ranked search engines. In The Search Engines Conference: Search Engines and Beyond: Developing Eficaent Knowledge Management Systems, Boston, MA, Apr. 1999.Google Scholar
7.J. Dean and M. R. Henzinger. Finding related web pages in the World Wide Web. In Proceedings of the Eighth International World Wide Web Conference (WWWS), pages 389-401, Toronto, Canada, May 1999. Google ScholarDigital Library
8.A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood for incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society, 39(B):l-38, 1977.Google Scholar
9.P. B. Kantor, E. Boros, B. Melamed, V. Menkov, B. Shapira, and D. J. Neu. Capturing human intelligence in the Net. Communications of the ACM, 8(43):112-115, Aug. 2000. Google ScholarDigital Library
10.M. Kobayashi and K. Takeda. Information retrieval on the Web. ACM Computing Surveys, 32(2), June 2000. Google ScholarDigital Library
11.R. Lempel and S. Moran. The stochastic approach for link-structure analysis (SALSA) and the TKC effect. In Proceedings of the Ninth International World Wide Web Conference (WWWS), number 33, pages 387-401, Amsterdam, Netherlands, May 2000. Google ScholarDigital Library
12.D. S. Modha and W. S. Spangler. Clustering hypertext with applications to web searching. In Proceedings of the 11th ACM Conference on Hypertext and Hypermedia, pages 143-152, San Antonio, TX, May 2000. Google ScholarDigital Library
13.M. Sato and S. Ishii. On-line EM algorithm for the normalized gaussian network. Neural Computation, 12(2):407-432, Feb. 2000. Google ScholarDigital Library
14.E. Shriver and M. Hansen. Mining Web proxy logs: a user model of searching. Technical report, Bell Labs, 2001.Google Scholar
15.D. Sullivan. Nielsen//netratings search engine ratings, Feb. 2001. Avaliable at http://searchengineuatch.-com/ reports/netratings.html.Google Scholar
16.E. M. Voorhees, N. K. Gupta, and B. Johnson-Laird. Learning collection fusion strategies. In Proceedings of the 18th Annual International ACM/SIGIR Conference on Research and Deueloprnent in Information Retrieval, pages 172-179, Seattle, WA, July 1995. Google ScholarDigital Library
17.E. M. Voorhees and R. M. Tong. Multiple search engines in database merging. In Proceedings of the Second ACM International Conference on Digital Libraries, pages 93-102, Philadelphia, PA, July 1997. Google ScholarDigital Library
18.Y. Yang and X. Liu. A re-examination of text categorization methods. In Proceedings of SIGIR-99, 22nd ACM International Conference on Research and Development in Information Retrieval, pages 4249, Berkeley, CA, Aug. 1999. Google ScholarDigital Library
19.T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an efficient data clustering method for very large databases. In Proceedings of the ACM SIGMOD International Conference on Management of Data, pages 103-114, Montreal, Canada, June 1996. Google ScholarDigital Library

Index Terms

Using navigation data to improve IR functions in the context of web search
1. Information systems

Recommendations

Computational aspects of fitting mixture models via the expectation-maximization algorithm

The Expectation-Maximization (EM) algorithm is a popular tool in a wide variety of statistical settings, in particular in the maximum likelihood estimation of parameters when clustering using mixture models. A serious pitfall is that in the case of a ...
Read More
How are we searching the World Wide Web? A comparison of nine search engine transaction logs
Special issue: Formal methods for information retrieval

The Web and especially major Web search engines are essential tools in the quest to locate online information for many people. This paper reports results from research that examines characteristics and changes in Web searching from nine studies of five ...
Read More
Query clustering using user logs

Query clustering is a process used to discover frequently asked questions or most popular topics on a search engine. This process is crucial for search engines based on question-answering. Because of the short lengths of queries, approaches based on ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '01: Proceedings of the tenth international conference on Information and knowledge management
October 2001
616 pages
ISBN:1581134363
DOI:10.1145/502585
Editors:
Henrique Paques
Georgia Institute of Technology
,
Ling Liu
Georgia Institute of Technology
,
David Grossman
Illinois Institute of Technology
,
General Chair:
Calton Pu
Georgia Institute of Technology
Copyright © 2001 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 5 October 2001
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
expectation-maximization algorithm
model-based clustering
proxy access logs
query clustering
web searching
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate1,861of8,427submissions,22%
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 17
  Total Citations
  View Citations
- 305
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Using navigation data to improve IR functions in the context of web search

CIKM '01: Proceedings of the tenth international conference on Information and knowledge management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Computational aspects of fitting mixture models via the expectation-maximization algorithm

How are we searching the World Wide Web? A comparison of nine search engine transaction logs

Query clustering using user logs