Skip to main content
Log in

Optimizing top-k retrieval: submodularity analysis and search strategies

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

The key issue in top-k retrieval, finding a set of k documents (from a large document collection) that can best answer a user’s query, is to strike the optimal balance between relevance and diversity. In this paper, we study the top-k retrieval problem in the framework of facility location analysis and prove the submodularity of that objective function which provides a theoretical approximation guarantee of factor 1−\(\frac{1}{e}\) for the (best-first) greedy search algorithm. Furthermore, we propose a two-stage hybrid search strategy which first obtains a high-quality initial set of top-k documents via greedy search, and then refines that result set iteratively via local search. Experiments on two large TREC benchmark datasets show that our two-stage hybrid search strategy approach can supersede the existing ones effectively and efficiently.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Manning C D, Raghavan P, Schütze H. Introduction to Information Retrieval. Cambridge: Cambridge University Press, 2008

    Book  MATH  Google Scholar 

  2. Chen H, Karger D R. Less is more: probabilistic models for retrieving fewer relevant documents. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2006, 429–436

    Google Scholar 

  3. Carbonell J G, Goldstein J. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1998, 335–336

    Google Scholar 

  4. Zhai C, Cohen W W, Lafferty J D. Beyond independent relevance: Methods and evaluation metrics for subtopic retrieval. In: Proceedings of the 26th Annal International ACM SIGIR Conference on Research and Development in Information Retrieval. 2003, 10–17

    Google Scholar 

  5. Wang J, Zhu J. Portfolio theory of information retrieval. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2009, 115–122

    Google Scholar 

  6. Zuccon G, Azzopardi L. Using the quantum probability ranking principle to rank interdependent documents. In: Proceedings of the 32th European Conference on Information Retrieval Research. 2010, 357–369

    Google Scholar 

  7. Chandar P, Carterette B. Diversification of search results using webgraphs. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2010, 869–870

    Google Scholar 

  8. Santos R L T, Macdonald C, Ounis I. Intent-aware search result diversification. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2011, 595–604

    Google Scholar 

  9. Zuccon G, Azzopardi L, Zhang D, Wang J. Top-k retrieval using facility location analysis. In: Proceedings of the 34th European Conference on Information Retrieval Research. 2012, 305–316

    Google Scholar 

  10. Gonzalez T F. Handbook of Approximation Algorithms and Metaheuristics. Boca Raton: CRC Press, 2007

    Book  MATH  Google Scholar 

  11. Russell S, Norvig P. Artificial Intelligence: A Modern Approach. 3rd ed. Englewood Cliffs, NJ: Prentice Hall, 2009

    MATH  Google Scholar 

  12. Sha C, Wang K, Zhang D, Wang X, Zhou A. Optimizing top-k retrieval: submodularity analysis and search strategies. In: Proceedings of the 15th International Conference on Web-Age Information Management. 2014, 18–29

    Google Scholar 

  13. Nemhauser G, Wolsey L, Fisher M. An analysis of approximations for maximizing submodular set functions —I. Mathematical Programming, 1978, 14(1): 265–294

    Article  MathSciNet  MATH  Google Scholar 

  14. Agrawal R, Gollapudi S, Halverson A, Ieong S. Diversifying search results. In: Proceedings of the 2nd ACM International Conference on Web Search and Data Mining. 2009, 5–14

    Chapter  Google Scholar 

  15. He J, Hollink V, de Vries A P. Combining implicit and explicit topic representations for result diversification. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2012, 851–860

    Google Scholar 

  16. Santos R L T, Macdonald C, Ounis I. Exploiting query reformulations for Web search result diversification. In: Proceedings of the 19th International World Wide Web Conference. 2010, 881–890

    Google Scholar 

  17. Vallet D, Castells P. Personalized diversification of search results. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2012, 841–850

    Google Scholar 

  18. Vargas S, Castells P, Vallet D. Explicit relevance models in intentoriented information retrieval diversification. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2012, 75–84

    Google Scholar 

  19. Gollapudi S, Sharma A. An axiomatic approach for result diversification. In: Proceedings of the 18th International Conference on World Wide Web. 2009, 381–390

    Chapter  Google Scholar 

  20. Krause A, Golovin D. Submodular function maximization. Tractability: Practical Approaches to Hard Problems, 2012, 3: 19

    Google Scholar 

  21. Lin H, Bilmes J. A class of submodular functions for document summarization. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 2011, 510–520

    Google Scholar 

  22. Chapelle O, Metlzer D, Zhang Y, Grinspan P. Expected reciprocal rank for graded relevance. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management. 2009, 621–630

    Google Scholar 

  23. Clarke C L A, Kolla M, Cormack G V, Vechtomova O, Ashkan A, Buttcher S, MacKinnon I. Novelty and diversity in information retrieval evaluation. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2008, 659–666

    Google Scholar 

  24. Krause A, Guestrin C. Near-optimal nonmyopic value of information in graphical models. In: Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence. 2005, 324–331

    Google Scholar 

  25. Kempe D, Kleinberg J, Tardos E. Maximizing the spread of influence through a social network. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2003, 137–146

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chaofeng Sha.

Additional information

Chaofeng Sha is an associate professor in Fudan University, China. He received the BS degree in applied mathematics in 1998 from Xidian University, China, the MS degree in 2001 and the PhD degree in 2009 fromFudan University, China, both in computer science. Since 2001, he has been in the School of Computer Science at Fudan University. His work is in the area of data mining and data management.

Keqiang Wang received the bachelor degree from East China Normal University (ECNU), China in 2012. He is currently a PhD student at ECNU. His research interests mainly focus on recommender system and data mining.

Dell Zhang is a senior lecturer in the Department of Computer Science and Information Systems at Birkbeck, University of London (UOL), UK. He is also a senior member of ACM, a senior member of IEEE, and a Fellow of RSS.He joined Birkbeck in 2005. Before he moved to the UK, he was a research fellow at the Singapore- MIT Alliance. His research is on the theme of improving information retrieval and organisation through machine learning or data mining.

Xiaoling Wang received the bachelor, master, and doctoral degrees from Southeastern University, China in 1997, 2000, and 2003, respectively. She is currently a professor and vice dean in Software Engineering Institute, East China Normal University (ECNU), China. She was an assistant professor and an associate professor at Fudan University from 2003 to 2008, and joined ECNU in 2008. She achieved the Programs of New-Century Talent of Ministry of Education of China. Her research interests mainly include Web data management, data mining and data service technology.

Aoying Zhou is a professor in computer science at East China Normal University (ECNS), China, where he is heading the Institute of Massive Computing. Before joining ECNU in 2008, he worked for Fudan University at the Computer Science Department for 15 years. He is the winner of the National Science Fund for Distinguished Young Scholars supported by the National Natural Science Foundation of China and the professorship appointment under Changjiang Scholars Program of Ministry of Education. He is now acting as a vice director of ACMSIGMOD China and Database Technology Committee of China Computer Federation. He is serving as a member of the editorial boards VLDB Journal,WWWJournal, etc. His research interests include data management, memory cluster computing, big data benchmarking and performance optimization.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sha, C., Wang, K., Zhang, D. et al. Optimizing top-k retrieval: submodularity analysis and search strategies. Front. Comput. Sci. 10, 477–487 (2016). https://doi.org/10.1007/s11704-015-5222-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11704-015-5222-7

Keywords

Navigation