research-article

The opposite of smoothing: a language model approach to ranking query-specific document clusters

Author:
Oren Kurland

Technion, Haifa, Israel

Technion, Haifa, Israel
View Profile

SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrievalJuly 2008Pages 171–178https://doi.org/10.1145/1390334.1390366

Published:20 July 2008Publication History

SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

Pages 171–178

ABSTRACT

Exploiting information induced from (query-specific) clustering of top-retrieved documents has long been proposed as means for improving precision at the very top ranks of the returned results. We present a novel language model approach to ranking query-specific clusters by the presumed percentage of relevant documents that they contain. While most previous cluster ranking approaches focus on the cluster as a whole, our model also exploits information induced from documents associated with the cluster. Our model substantially outperforms previous approaches for identifying clusters containing a high relevant-document percentage. Furthermore, using the model to produce document ranking yields precision-at-top-ranks performance that is consistently better than that of the initial ranking upon which clustering is performed; the performance also favorably compares with that of a state-of-the-art pseudo-feedback retrieval method.

References

N. Abdul-Jaleel, J. Allan, W. B. Croft, F. Diaz, L. Larkey, X. Li, M. D. Smucker, and C. Wade. UMASS at TREC 2004 - novelty and hard. In Proceedings of the Thirteenth Text Retrieval Conference (TREC-13), 2004.]]Google ScholarCross Ref
L. Azzopardi, M. Girolami, and K. van Rijsbergen. Topic based language models for ad hoc information retrieval. In Proceedings of International Conference on Neural Networks and IEEE International Conference on Fuzzy Systems, pages 3281--3286, 2004.]]Google ScholarCross Ref
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th International World Wide Web Conference, pages 107--117, 1998.]] Google ScholarDigital Library
C. Buckley, G. Salton, J. Allan, and A. Singhal. Automatic query expansion using SMART: TREC3. In Proceedings of the Third Text Retrieval Conference (TREC-3), pages 69--80, 1994.]]Google Scholar
http://www.clusty.com.]]Google Scholar
M. Connell, A. Feng, G. Kumaran, H. Raghavan, C. Shah, and J. Allan. UMass at TDT 2004. TDT2004 System Description, 2004.]]Google Scholar
W. B. Croft. A model of cluster searching based on classification. Information Systems, 5:189--195, 1980.]]Google ScholarCross Ref
W. B. Croft and J. Lafferty, editors. Language Modeling for Information Retrieval. Number 13 in Information Retrieval Book Series. Kluwer, 2003.]] Google ScholarDigital Library
F. Diaz. Regularizing ad hoc retrieval scores. In Proceedings of the Fourteenth International Conference on Information and Knowledge Managment (CIKM), pages 672--679, 2005.]] Google ScholarDigital Library
F. Diaz and D. Metzler. Improving the estimation of relevance models using large external corpora. In Proceedings of SIGIR, pages 154--161, 2006.]] Google ScholarDigital Library
F. Geraci, M. Pellegrini, M. Maggini, and F. Sebastiani. Cluster generation and cluster labeling for Web snippets: A fast and accurate hierarchical solution. In Proceedings of the 13th international conference on string processing and information retrieval (SPIRE), pages 25--37, 2006.]] Google ScholarDigital Library
G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, third edition, 1996.]]Google Scholar
A. Griffiths, H. C. Luckhurst, and P. Willett. Using interdocument similarity information in document retrieval systems. Journal of the American Society for Information Science (JASIS), 37(1):3--11, 1986. Reprinted in Karen Sparck Jones and Peter Willett, eds., Readings in Information Retrieval, Morgan Kaufmann, pp. 365--373, 1997.]] Google ScholarDigital Library
M. A. Hearst and J. O. Pedersen. Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In Proceedings of SIGIR, 1996.]] Google ScholarDigital Library
N. Jardine and C. J. van Rijsbergen. The use of hierarchic clustering in information retrieval. Information Storage and Retrieval, 7(5):217--240, 1971.]]Google ScholarCross Ref
J. Kleinberg. Authoritative sources in a hyperlinked environment. Technical Report Research Report RJ 10076, IBM, May 1997.]]Google Scholar
O. Kurland. Inter-document similarities, language models, and ad hoc retrieval. PhD thesis, Cornell University, 2006.]] Google ScholarDigital Library
O. Kurland and C. Domshlak. A rank-aggregation approach to searching for optimal query-specific clusters. In Proceedings of SIGIR, 2008.]] Google ScholarDigital Library
O. Kurland and L. Lee. Corpus structure, language models, and ad hoc information retrieval. In Proceedings of SIGIR, pages 194--201, 2004.]] Google ScholarDigital Library
O. Kurland and L. Lee. PageRank without hyperlinks: Structural re-ranking using links induced by language models. In Proceedings of SIGIR, pages 306--313, 2005.]] Google ScholarDigital Library
O. Kurland and L. Lee. Respect my authority! HITS without hyperlinks utilizing cluster-based language models. In Proceedings of SIGIR, pages 83--90, 2006.]] Google ScholarDigital Library
J. D. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information retrieval. In Proceedings of SIGIR, pages 111--119, 2001.]] Google ScholarDigital Library
J. D. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information retrieval. In Proceedings of SIGIR, pages 111--119, 2001.]] Google ScholarDigital Library
V. Lavrenko, J. Allan, E. DeGuzman, D. LaFlamme, V. Pollard, and S. Thomas. Relevance models for topic detection and tracking. In Proceedings of the Human Language Technology Conference (HLT), pages 104--110, 2002.]] Google ScholarDigital Library
V. Lavrenko and W. B. Croft. Relevance-based language models. In Proceedings of SIGIR, pages 120--127, 2001.]] Google ScholarDigital Library
V. Lavrenko and W. B. Croft. Relevance models in information retrieval. In Croft and Lafferty {8}, pages 11--56.]]Google Scholar
A. Leuski. Evaluating document clustering for interactive information retrieval. In Proceedings of the Tenth International Conference on Information and Knowledge Managment (CIKM), pages 33--40, 2001.]] Google ScholarDigital Library
A. Leuski and J. Allan. Evaluating a visual navigation system for a digital library. In Proceedings of the Second European conference on research and advanced technology for digital libraries (ECDL), pages 535--554, 1998.]] Google ScholarDigital Library
X. Liu and W. B. Croft. Cluster-based retrieval using language models. In Proceedings of SIGIR, pages 186--193, 2004.]] Google ScholarDigital Library
X. Liu and W. B. Croft. Experiments on retrieval of optimal clusters. Technical Report IR-478, Center for Intelligent Information Retrieval (CIIR), University of Massachusetts, 2006.]]Google Scholar
X. Liu and W. B. Croft. Representing clusters for retrieval. In Proceedings of SIGIR, pages 671--672, 2006. Poster.]] Google ScholarDigital Library
Q. Mei, X. Shen, and C. Zhai. Automatic labeling of multinomial topic models. In Proceedings of the 13th ACM SIGKDD international conference, pages 490--499, 2007.]] Google ScholarDigital Library
C. R. Palmer, J. Pesenty, R. Veldes-Perez, M. Christel, A. G. Hauptmann, D. Ng, and H. D. Wactlar. Demonstration of hierarchical document clustering of digital library retrieval results. In Proceedings of the 1st ACM/IEEE-CS joint conference on digital libraries, page 451, 2001.]] Google ScholarDigital Library
J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of SIGIR, pages 275--281, 1998.]] Google ScholarDigital Library
S. E. Preece. Clustering as an output option. In Proceedings of the American Society for Information Science, pages 189--190, 1973.]]Google Scholar
J. G. Shanahan, J. Bennett, D. A. Evans, D. A. Hull, and J. Montgomery. Clairvoyance Corporation experiments in the TREC 2003. High accuracy retrieval from documents (HARD) track. In Proceedings of the Twelfth Text Retrieval Conference (TREC-12), pages 152--160, 2003.]]Google Scholar
L. Si, R. Jin, J. Callan, and P. Ogilvie. A language modeling framework for resource selection and results merging. In Proceedings of the 11th International Conference on Information and Knowledge Managment (CIKM), pages 391--397, 2002.]] Google ScholarDigital Library
A. Tombros, R. Villa, and C. van Rijsbergen. The effectiveness of query-specific hierarchic clustering in information retrieval. Information Processing and Management, 38(4):559--582, 2002.]] Google ScholarDigital Library
P. Treeratpituk and J. Callan. Automatically labeling hierarchical clusters. In Proceedings of the sixth national conference on digital government research, pages 167--176, 2006.]] Google ScholarDigital Library
C. J. van Rijsbergen. Information Retrieval. Butterworths, second edition, 1979.]] Google ScholarDigital Library
E. M. Voorhees. The cluster hypothesis revisited. In Proceedings of SIGIR, pages 188--196, 1985.]] Google ScholarDigital Library
X. Wei and W. B. Croft. LDA-based document models for ad-hoc retrieval. In Proceedings of SIGIR, 2006.]] Google ScholarDigital Library
P. Willett. Query specific automatic document classification. International Forum on Information and Documentation, 10(2):28--32, 1985.]]Google Scholar
J. Xu and W. B. Croft. Query expansion using local and global document analysis. In Proceedings of SIGIR, pages 4--11, 1996.]] Google ScholarDigital Library
O. Zamir and O. Etzioni. Web document clustering: a feasibility demonstration. In Proceedings of SIGIR, pages 46--54, 1998.]] Google ScholarDigital Library
C. Zhai and J. D. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of SIGIR, pages 334--342, 2001.]] Google ScholarDigital Library

Index Terms

The opposite of smoothing: a language model approach to ranking query-specific document clusters
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
    2. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Ranking document clusters using markov random fields
SIGIR '13: Proceedings of the 36th international ACM SIGIR conference on Research and development in information retrieval

An important challenge in cluster-based document retrieval is ranking document clusters by their relevance to the query. We present a novel cluster ranking approach that utilizes Markov Random Fields (MRFs). MRFs enable the integration of various types ...
Read More
Re-ranking search results using language models of query-specific clusters
Abstract
To obtain high precision at top ranks by a search performed in response to a query, researchers have proposed a cluster-based re-ranking paradigm: clustering an initial list of documents that are the most highly ranked by some initial search, and ...
Read More
A study of the integration of passage-, document-, and cluster-based information for re-ranking search results
Abstract
Cluster-based and passage-based document retrieval paradigms were shown to be effective. While the former are based on utilizing query-related corpus context manifested in clusters of similar documents, the latter address the fact that a document ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
July 2008
934 pages
ISBN:9781605581644
DOI:10.1145/1390334
General Chairs:
Tat-Seng Chua
National University of Singapore
,
Mun-Kew Leong
National Library Board, Singapore
,
Program Chairs:
Syung Hyon Myaeng
Information and Communications University, Korea
,
Douglas W. Oard
University of Maryland, College Park, USA
,
Fabrizio Sebastiani
Consiglio Nazionale delle Ricerche, Italy
Copyright © 2008 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 July 2008
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
ad hoc retrieval
cluster-based language models
cluster-ranking
language models
query-specific clusters
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 29
  Total Citations
  View Citations
- 680
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

The opposite of smoothing: a language model approach to ranking query-specific document clusters

SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Ranking document clusters using markov random fields

Re-ranking search results using language models of query-specific clusters

A study of the integration of passage-, document-, and cluster-based information for re-ranking search results