Abstract
The key issue in top-k retrieval, finding a set of k documents (from a large document collection) that can best answer a user’s query, is to strike the optimal balance between relevance and diversity. In this paper, we study the top-k retrieval problem in the framework of facility location analysis and prove the submodularity of that objective function which provides a theoretical approximation guarantee of factor 1−\(\frac{1}{e}\) for the (best-first) greedy search algorithm. Furthermore, we propose a two-stage hybrid search strategy which first obtains a high-quality initial set of top-k documents via greedy search, and then refines that result set iteratively via local search. Experiments on two large TREC benchmark datasets show that our two-stage hybrid search strategy approach can supersede the existing ones effectively and efficiently.
Similar content being viewed by others
References
Manning C D, Raghavan P, Schütze H. Introduction to Information Retrieval. Cambridge: Cambridge University Press, 2008
Chen H, Karger D R. Less is more: probabilistic models for retrieving fewer relevant documents. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2006, 429–436
Carbonell J G, Goldstein J. The use of MMR, diversity-based reranking for reordering documents and producing summaries. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 1998, 335–336
Zhai C, Cohen W W, Lafferty J D. Beyond independent relevance: Methods and evaluation metrics for subtopic retrieval. In: Proceedings of the 26th Annal International ACM SIGIR Conference on Research and Development in Information Retrieval. 2003, 10–17
Wang J, Zhu J. Portfolio theory of information retrieval. In: Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2009, 115–122
Zuccon G, Azzopardi L. Using the quantum probability ranking principle to rank interdependent documents. In: Proceedings of the 32th European Conference on Information Retrieval Research. 2010, 357–369
Chandar P, Carterette B. Diversification of search results using webgraphs. In: Proceedings of the 33rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2010, 869–870
Santos R L T, Macdonald C, Ounis I. Intent-aware search result diversification. In: Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2011, 595–604
Zuccon G, Azzopardi L, Zhang D, Wang J. Top-k retrieval using facility location analysis. In: Proceedings of the 34th European Conference on Information Retrieval Research. 2012, 305–316
Gonzalez T F. Handbook of Approximation Algorithms and Metaheuristics. Boca Raton: CRC Press, 2007
Russell S, Norvig P. Artificial Intelligence: A Modern Approach. 3rd ed. Englewood Cliffs, NJ: Prentice Hall, 2009
Sha C, Wang K, Zhang D, Wang X, Zhou A. Optimizing top-k retrieval: submodularity analysis and search strategies. In: Proceedings of the 15th International Conference on Web-Age Information Management. 2014, 18–29
Nemhauser G, Wolsey L, Fisher M. An analysis of approximations for maximizing submodular set functions —I. Mathematical Programming, 1978, 14(1): 265–294
Agrawal R, Gollapudi S, Halverson A, Ieong S. Diversifying search results. In: Proceedings of the 2nd ACM International Conference on Web Search and Data Mining. 2009, 5–14
He J, Hollink V, de Vries A P. Combining implicit and explicit topic representations for result diversification. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2012, 851–860
Santos R L T, Macdonald C, Ounis I. Exploiting query reformulations for Web search result diversification. In: Proceedings of the 19th International World Wide Web Conference. 2010, 881–890
Vallet D, Castells P. Personalized diversification of search results. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2012, 841–850
Vargas S, Castells P, Vallet D. Explicit relevance models in intentoriented information retrieval diversification. In: Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2012, 75–84
Gollapudi S, Sharma A. An axiomatic approach for result diversification. In: Proceedings of the 18th International Conference on World Wide Web. 2009, 381–390
Krause A, Golovin D. Submodular function maximization. Tractability: Practical Approaches to Hard Problems, 2012, 3: 19
Lin H, Bilmes J. A class of submodular functions for document summarization. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. 2011, 510–520
Chapelle O, Metlzer D, Zhang Y, Grinspan P. Expected reciprocal rank for graded relevance. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management. 2009, 621–630
Clarke C L A, Kolla M, Cormack G V, Vechtomova O, Ashkan A, Buttcher S, MacKinnon I. Novelty and diversity in information retrieval evaluation. In: Proceedings of the 31st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2008, 659–666
Krause A, Guestrin C. Near-optimal nonmyopic value of information in graphical models. In: Proceedings of the 21st Conference on Uncertainty in Artificial Intelligence. 2005, 324–331
Kempe D, Kleinberg J, Tardos E. Maximizing the spread of influence through a social network. In: Proceedings of the 9th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2003, 137–146
Author information
Authors and Affiliations
Corresponding author
Additional information
Chaofeng Sha is an associate professor in Fudan University, China. He received the BS degree in applied mathematics in 1998 from Xidian University, China, the MS degree in 2001 and the PhD degree in 2009 fromFudan University, China, both in computer science. Since 2001, he has been in the School of Computer Science at Fudan University. His work is in the area of data mining and data management.
Keqiang Wang received the bachelor degree from East China Normal University (ECNU), China in 2012. He is currently a PhD student at ECNU. His research interests mainly focus on recommender system and data mining.
Dell Zhang is a senior lecturer in the Department of Computer Science and Information Systems at Birkbeck, University of London (UOL), UK. He is also a senior member of ACM, a senior member of IEEE, and a Fellow of RSS.He joined Birkbeck in 2005. Before he moved to the UK, he was a research fellow at the Singapore- MIT Alliance. His research is on the theme of improving information retrieval and organisation through machine learning or data mining.
Xiaoling Wang received the bachelor, master, and doctoral degrees from Southeastern University, China in 1997, 2000, and 2003, respectively. She is currently a professor and vice dean in Software Engineering Institute, East China Normal University (ECNU), China. She was an assistant professor and an associate professor at Fudan University from 2003 to 2008, and joined ECNU in 2008. She achieved the Programs of New-Century Talent of Ministry of Education of China. Her research interests mainly include Web data management, data mining and data service technology.
Aoying Zhou is a professor in computer science at East China Normal University (ECNS), China, where he is heading the Institute of Massive Computing. Before joining ECNU in 2008, he worked for Fudan University at the Computer Science Department for 15 years. He is the winner of the National Science Fund for Distinguished Young Scholars supported by the National Natural Science Foundation of China and the professorship appointment under Changjiang Scholars Program of Ministry of Education. He is now acting as a vice director of ACMSIGMOD China and Database Technology Committee of China Computer Federation. He is serving as a member of the editorial boards VLDB Journal,WWWJournal, etc. His research interests include data management, memory cluster computing, big data benchmarking and performance optimization.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Sha, C., Wang, K., Zhang, D. et al. Optimizing top-k retrieval: submodularity analysis and search strategies. Front. Comput. Sci. 10, 477–487 (2016). https://doi.org/10.1007/s11704-015-5222-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11704-015-5222-7