Diversification on big data in query processing

Zhang, Meifan; Wang, Hongzhi; Li, Jianzhong; Gao, Hong

doi:10.1007/s11704-019-8324-9

Diversification on big data in query processing

Research Article
Published: 03 January 2020

Volume 14, article number 144607, (2020)
Cite this article

Frontiers of Computer Science Aims and scope Submit manuscript

Meifan Zhang¹,
Hongzhi Wang¹,
Jianzhong Li¹ &
…
Hong Gao¹

91 Accesses
8 Citations
Explore all metrics

Abstract

Recently, in the area of big data, some popular applications such as web search engines and recommendation systems, face the problem to diversify results during query processing. In this sense, it is both significant and essential to propose methods to deal with big data in order to increase the diversity of the result set. In this paper, we firstly define the diversity of a set and the ability of an element to improve the overall diversity. Based on these definitions, we propose a diversification framework which has good performance in terms of effectiveness and efficiency. Also, this framework has theoretical guarantee on probability of success. Secondly, we design implementation algorithms based on this framework for both numerical and string data. Thirdly, for numerical and string data respectively, we carry out extensive experiments on real data to verify the performance of our proposed framework, and also perform scalability experiments on synthetic data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Enhancing Query Processing in Big Data: Scalability and Performance Optimization

A survey of query result diversification

Article 21 September 2016

Enhancing Scalability and Performance in Big Data Query Processing: A Multi-faceted Approach

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

References

Drosou M, Pitoura E. Search result diversification. Special Interest Group on Management of Data Record, 2010, 39(1): 41–47
Google Scholar
Drosou M, Jagadish H V, Pitoura E, Stoyanovich J. Diversity in big data: a review. Big Data, 2017, 5(2): 73
Article Google Scholar
Angel A, Koudas N. Efficient diversity-aware search. In: Proceedings of the ACM SIGMOD International Conference on Management of Data. 2011, 781–792
Vieira M R, Razente H L, Barioni M C, Hadjieleftheriou M, Srivastava D, Jr C T, Tsotras V J. On query result diversification. In: Proceedings of International Conference on Data Engineering. 2011, 1163–1174
Agrawal R, Gollapudi S, Halverson A, Ieong S. Diversifying search results. In: Proceedings of the 2nd International Conference on Web Search and Web Data Mining. 2009, 5–14
Ashkan A, Kveton B, Berkovsky S, Wen Z. Optimal greedy diversity for recommendation. In: Proceedings of the 24th International Joint Conference on Artificial Intelligence. 2015, 1742–1748
Gollapudi S, Sharma A. An axiomatic approach for result diversification. In: Proceedings of the 18th International Conference on World Wide Web. 2009, 381–390
Zhang M, Hurley N. Avoiding monotony: improving the diversity of recommendation lists. In: Proceedings of ACM Conference on Recommender Systems. 2008, 123–130
Liu K, Terzi E, Grandison T. Highlighting diverse concepts in documents. In: Proceedings of the SIAM International Conference on Data Mining. 2009, 545–556
Sarma A D, Gollapudi S, Ieong S. Bypass rates: reducing query abandonment using negative inferences. In: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 2008, 177–185
Wu T, Chen L, Hui P, Zhang C J, Li W. Hear the whole story: towards the diversity of opinion in crowdsourcing markets. Proceedings of the VLDB Endowment, 2015, 8(5): 485–496
Article Google Scholar
Clarke C L, Kolla M, Cormack G V, Vechtomova O, Ashkan A, Buttcher S, MacKinnon I. Novelty and diversity in information retrieval evaluation. In: Proceedings of the 21st Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2008, 659–666
Zhang Y, Callan J P, Minka T P. Novelty and redundancy detection in adaptive filtering. In: Proceedings of the 25th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2002, 81–88
Santos R L, Macdonald C, Ounis I. Exploiting query reformulations for web search result diversification. In: Proceedings of the 19th International Conference on World Wide Web. 2010, 881–890
Ozdemiray A M, Altingovde I S. Explicit search result diversification using score and rank aggregation methods. Journal of the Association for Information Science and Technology, 2015, 66(6): 1212–1228
Article Google Scholar
Carbinell J, Goldstein J. The use of MMR, diversity-based reranking for reordering documents and producing summaries. Special Interest Group on Information Retrieval Forum, 2017, 51(2): 209–210
Google Scholar
Capannini G, Nardini F M, Perego R, Silvestri F. Efficient diversification of web search results. Proceedings of the VLDB Endowment, 2011, 4(7): 451–459
Article Google Scholar
Ziegler C, Mcnee S M, Konstan J A, Lausen G. Improving recommendation lists through topic diversification. In: Proceedings of the 14th International Conference on World Wide Web. 2005, 22–32
Radlinski F, Dumais S T. Improving personalized web search using result diversification. In: Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. 2006, 691–692
Yu C, Lakshmanan L V, Ameryahia S. It takes variety to make a world: diversification in recommender systems. In: Proceedings of the 12th International Conference on Extending Database Technology. 2009, 368–378
Vee E, Srivastava U, Shanmugasundaram J, Bhat P, Yahia S A. Efficient computation of diverse query results. In: Proceedings of the 24th International Conference on Data Engineering. 2008, 228–236
Drosou M, Pitoura E. Diverse set selection over dynamic data. IEEE Transactions on Knowledge and Data Engineering, 2014, 26(5): 1102–1116
Article Google Scholar
Zhu Y, Lan Y, Guo J, Cheng X, Niu S. Learning for search result diversification. In: Proceedings of the 27th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2014, 293–302
Xia L, Xu J, Lan Y, Guo J, Cheng X. Learning maximal marginal relevance model via directly optimizing diversity evaluation measures. In: Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2015, 113–122
Xu J, Xia L, Lan Y, Guo J, Cheng X. Directly optimize diversity evaluation measures: a new approach to search result diversification. ACM Transactions on Intelligent Systems and Technology, 2017, 8(3): 41
Article Google Scholar
Xia L, Xu J, Lan Y, Guo J, Cheng X. Modeling document novelty with neural tensor network for search result diversification. In: Proceedings of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2016, 395–404
Erkut E, Ülküsal Y, Yeniçerioglu O. A comparison of p-dispersion heuristics. Computers & Operations Research, 1994, 21(10): 1103–1113
Article Google Scholar
Baryossef Z, Jayram T S, Kumar R, Sivakumar D, Trevisan L. Counting distinct elements in a data stream. In: Proceedings of International Workshop on Randomization and Approximation Techniques in Computer Science. 2002, 1–10
Cormen T H, Leiserson C E, Rivest R L, Stein C. Introduction to Algorithms. 2nd ed. Cambridge: The MIT Press and McGraw-Hill Book Company, 2001
MATH Google Scholar
Mitzenmacher M, Upfal E. Probability and Computing — Randomized Algorithms and Probabilistic Analysis. Cambridge: Cambridge University Press, 2005
Book Google Scholar
Hadjieleftheriou M, Li C. Efficient approximate search on string collections. Proceedings of the VLDB Endowment, 2009, 2(2): 1660–1661
Article Google Scholar

Download references

Acknowledgements

This paper was partially supported by NSFC (Grant Nos. U1509216, U1866602, 61602129) and Microsoft Research Asia.

Author information

Authors and Affiliations

Department of Computer Science and Technology, Harbin Institute of Technology, Harbin, 150001, China
Meifan Zhang, Hongzhi Wang, Jianzhong Li & Hong Gao

Authors

Meifan Zhang
View author publications
Search author on:PubMed Google Scholar
Hongzhi Wang
View author publications
Search author on:PubMed Google Scholar
Jianzhong Li
View author publications
Search author on:PubMed Google Scholar
Hong Gao
View author publications
Search author on:PubMed Google Scholar

Corresponding author

Correspondence to Hongzhi Wang.

Additional information

Meifan Zhang received the bachelor’s degree in computer science from Harbin Institute of Technology, China in 2014, where she is currently pursuing the PhD degree. Her research interest includes big data analystics, data quality and machine learning.

Hongzhi Wang received the PhD degree in computer science from Harbin Institute of Technology, China in 2008. From 2008 to 2010, he was an assistant professor in Harbin Institute of Technology, China. From 2010 to 2015, he was an associate professor. Since 2015, he has been a professor of Department of Computer Science and Technology, Harbin Institute of Technology, China. His research interest includes big data management, data quality, graph data management, Web data management. Prof. Wang was a recipient of the Microsoft fellowship, the Chinese Excellent database engineer, and the IBM PHD fellowship.

Jianzhong Li received the BS degree from Heilongjiang University, China in 1975. He worked in the University of California at Berkeley as a visiting scholar in 1985. He has also been a visiting professor at the University of Minnesota at Minneapolis, USA, from 1991 to 1992 and from 1998 to 1999. Since 1998, he has been a professor of Department of Computer Science and Technology, Harbin Institute of Technology, China. His current research interests include database management systems, data warehousing and data mining, sensor network, and data intensive super computing. Prof. Li was a recipient of awards and honors, including the Chairman of the ACM SIGMOD China and the Director of the China Computer Federation.

Hong Gao is a professor and doctoral supervisor of Harbin Institute of Technology, China. She received her PhD degree from Harbin Institute of Technology, China. She has long engaged in research work of massive data computation and quality management, wireless sensor networks and graphic data management and computation. She was a recipient of awards and honors, including the Assistant Director of the China Computer Federation Technical Committee on Databases, a member of the China Computer Federation Technical Committee on Sensor Network, and the Deputy Director of the Massive Data Computing Lab.

Electronic supplementary material