skip to main content
10.1145/1390334.1390366acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

The opposite of smoothing: a language model approach to ranking query-specific document clusters

Published:20 July 2008Publication History

ABSTRACT

Exploiting information induced from (query-specific) clustering of top-retrieved documents has long been proposed as means for improving precision at the very top ranks of the returned results. We present a novel language model approach to ranking query-specific clusters by the presumed percentage of relevant documents that they contain. While most previous cluster ranking approaches focus on the cluster as a whole, our model also exploits information induced from documents associated with the cluster. Our model substantially outperforms previous approaches for identifying clusters containing a high relevant-document percentage. Furthermore, using the model to produce document ranking yields precision-at-top-ranks performance that is consistently better than that of the initial ranking upon which clustering is performed; the performance also favorably compares with that of a state-of-the-art pseudo-feedback retrieval method.

References

  1. N. Abdul-Jaleel, J. Allan, W. B. Croft, F. Diaz, L. Larkey, X. Li, M. D. Smucker, and C. Wade. UMASS at TREC 2004 - novelty and hard. In Proceedings of the Thirteenth Text Retrieval Conference (TREC-13), 2004.]]Google ScholarGoogle ScholarCross RefCross Ref
  2. L. Azzopardi, M. Girolami, and K. van Rijsbergen. Topic based language models for ad hoc information retrieval. In Proceedings of International Conference on Neural Networks and IEEE International Conference on Fuzzy Systems, pages 3281--3286, 2004.]]Google ScholarGoogle ScholarCross RefCross Ref
  3. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proceedings of the 7th International World Wide Web Conference, pages 107--117, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. C. Buckley, G. Salton, J. Allan, and A. Singhal. Automatic query expansion using SMART: TREC3. In Proceedings of the Third Text Retrieval Conference (TREC-3), pages 69--80, 1994.]]Google ScholarGoogle Scholar
  5. http://www.clusty.com.]]Google ScholarGoogle Scholar
  6. M. Connell, A. Feng, G. Kumaran, H. Raghavan, C. Shah, and J. Allan. UMass at TDT 2004. TDT2004 System Description, 2004.]]Google ScholarGoogle Scholar
  7. W. B. Croft. A model of cluster searching based on classification. Information Systems, 5:189--195, 1980.]]Google ScholarGoogle ScholarCross RefCross Ref
  8. W. B. Croft and J. Lafferty, editors. Language Modeling for Information Retrieval. Number 13 in Information Retrieval Book Series. Kluwer, 2003.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. F. Diaz. Regularizing ad hoc retrieval scores. In Proceedings of the Fourteenth International Conference on Information and Knowledge Managment (CIKM), pages 672--679, 2005.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. F. Diaz and D. Metzler. Improving the estimation of relevance models using large external corpora. In Proceedings of SIGIR, pages 154--161, 2006.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. F. Geraci, M. Pellegrini, M. Maggini, and F. Sebastiani. Cluster generation and cluster labeling for Web snippets: A fast and accurate hierarchical solution. In Proceedings of the 13th international conference on string processing and information retrieval (SPIRE), pages 25--37, 2006.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. G. H. Golub and C. F. Van Loan. Matrix Computations. The Johns Hopkins University Press, third edition, 1996.]]Google ScholarGoogle Scholar
  13. A. Griffiths, H. C. Luckhurst, and P. Willett. Using interdocument similarity information in document retrieval systems. Journal of the American Society for Information Science (JASIS), 37(1):3--11, 1986. Reprinted in Karen Sparck Jones and Peter Willett, eds., Readings in Information Retrieval, Morgan Kaufmann, pp. 365--373, 1997.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. A. Hearst and J. O. Pedersen. Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In Proceedings of SIGIR, 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. N. Jardine and C. J. van Rijsbergen. The use of hierarchic clustering in information retrieval. Information Storage and Retrieval, 7(5):217--240, 1971.]]Google ScholarGoogle ScholarCross RefCross Ref
  16. J. Kleinberg. Authoritative sources in a hyperlinked environment. Technical Report Research Report RJ 10076, IBM, May 1997.]]Google ScholarGoogle Scholar
  17. O. Kurland. Inter-document similarities, language models, and ad hoc retrieval. PhD thesis, Cornell University, 2006.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. O. Kurland and C. Domshlak. A rank-aggregation approach to searching for optimal query-specific clusters. In Proceedings of SIGIR, 2008.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. O. Kurland and L. Lee. Corpus structure, language models, and ad hoc information retrieval. In Proceedings of SIGIR, pages 194--201, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. O. Kurland and L. Lee. PageRank without hyperlinks: Structural re-ranking using links induced by language models. In Proceedings of SIGIR, pages 306--313, 2005.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. O. Kurland and L. Lee. Respect my authority! HITS without hyperlinks utilizing cluster-based language models. In Proceedings of SIGIR, pages 83--90, 2006.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. J. D. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information retrieval. In Proceedings of SIGIR, pages 111--119, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. J. D. Lafferty and C. Zhai. Document language models, query models, and risk minimization for information retrieval. In Proceedings of SIGIR, pages 111--119, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. V. Lavrenko, J. Allan, E. DeGuzman, D. LaFlamme, V. Pollard, and S. Thomas. Relevance models for topic detection and tracking. In Proceedings of the Human Language Technology Conference (HLT), pages 104--110, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. V. Lavrenko and W. B. Croft. Relevance-based language models. In Proceedings of SIGIR, pages 120--127, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. V. Lavrenko and W. B. Croft. Relevance models in information retrieval. In Croft and Lafferty {8}, pages 11--56.]]Google ScholarGoogle Scholar
  27. A. Leuski. Evaluating document clustering for interactive information retrieval. In Proceedings of the Tenth International Conference on Information and Knowledge Managment (CIKM), pages 33--40, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. A. Leuski and J. Allan. Evaluating a visual navigation system for a digital library. In Proceedings of the Second European conference on research and advanced technology for digital libraries (ECDL), pages 535--554, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. X. Liu and W. B. Croft. Cluster-based retrieval using language models. In Proceedings of SIGIR, pages 186--193, 2004.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. X. Liu and W. B. Croft. Experiments on retrieval of optimal clusters. Technical Report IR-478, Center for Intelligent Information Retrieval (CIIR), University of Massachusetts, 2006.]]Google ScholarGoogle Scholar
  31. X. Liu and W. B. Croft. Representing clusters for retrieval. In Proceedings of SIGIR, pages 671--672, 2006. Poster.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Q. Mei, X. Shen, and C. Zhai. Automatic labeling of multinomial topic models. In Proceedings of the 13th ACM SIGKDD international conference, pages 490--499, 2007.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. C. R. Palmer, J. Pesenty, R. Veldes-Perez, M. Christel, A. G. Hauptmann, D. Ng, and H. D. Wactlar. Demonstration of hierarchical document clustering of digital library retrieval results. In Proceedings of the 1st ACM/IEEE-CS joint conference on digital libraries, page 451, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. J. M. Ponte and W. B. Croft. A language modeling approach to information retrieval. In Proceedings of SIGIR, pages 275--281, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. S. E. Preece. Clustering as an output option. In Proceedings of the American Society for Information Science, pages 189--190, 1973.]]Google ScholarGoogle Scholar
  36. J. G. Shanahan, J. Bennett, D. A. Evans, D. A. Hull, and J. Montgomery. Clairvoyance Corporation experiments in the TREC 2003. High accuracy retrieval from documents (HARD) track. In Proceedings of the Twelfth Text Retrieval Conference (TREC-12), pages 152--160, 2003.]]Google ScholarGoogle Scholar
  37. L. Si, R. Jin, J. Callan, and P. Ogilvie. A language modeling framework for resource selection and results merging. In Proceedings of the 11th International Conference on Information and Knowledge Managment (CIKM), pages 391--397, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. A. Tombros, R. Villa, and C. van Rijsbergen. The effectiveness of query-specific hierarchic clustering in information retrieval. Information Processing and Management, 38(4):559--582, 2002.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. P. Treeratpituk and J. Callan. Automatically labeling hierarchical clusters. In Proceedings of the sixth national conference on digital government research, pages 167--176, 2006.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. C. J. van Rijsbergen. Information Retrieval. Butterworths, second edition, 1979.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. E. M. Voorhees. The cluster hypothesis revisited. In Proceedings of SIGIR, pages 188--196, 1985.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. X. Wei and W. B. Croft. LDA-based document models for ad-hoc retrieval. In Proceedings of SIGIR, 2006.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. P. Willett. Query specific automatic document classification. International Forum on Information and Documentation, 10(2):28--32, 1985.]]Google ScholarGoogle Scholar
  44. J. Xu and W. B. Croft. Query expansion using local and global document analysis. In Proceedings of SIGIR, pages 4--11, 1996.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. O. Zamir and O. Etzioni. Web document clustering: a feasibility demonstration. In Proceedings of SIGIR, pages 46--54, 1998.]] Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. C. Zhai and J. D. Lafferty. A study of smoothing methods for language models applied to ad hoc information retrieval. In Proceedings of SIGIR, pages 334--342, 2001.]] Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. The opposite of smoothing: a language model approach to ranking query-specific document clusters

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGIR '08: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval
          July 2008
          934 pages
          ISBN:9781605581644
          DOI:10.1145/1390334

          Copyright © 2008 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 20 July 2008

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate792of3,983submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader