Proactive and reactive multi-dimensional histogram maintenance for selectivity estimation☆
Introduction
Extensive literature exists on the selectivity estimation problem. A variety of solutions have been proposed, but most popular ones are histogram-based solutions. Many state-of-the-art histogram methods adopt a “self-tuning” strategy, and construct and maintain histograms based on the feedback information provided in the form of query results (Aboulnaga and Chaudhuri, 1999, Bruno et al., 2001, Stillger et al., 2001). This allows them to keep more histogram buckets in regions queried more frequently. Due to the common spatial and temporal locality of user queries, these methods usually make more efficient use of the limited memory than methods based solely on data distribution.
However, all existing self-tuning histograms are “reactive” in nature, that is, they are updated based on the feedback information of the past queries only. This reactive approach does not consider the possibility that certain future query occurrences can be predicted. In this paper, by building and using “proactive histograms” based on query prediction, we show that a proactive approach can make more efficient use of histogram memory, when combined with the reactive approach. We believe this paper is the first to introduce the idea of proactive histograms and to provide a method to take advantage of these histograms.
Query prediction may be obtained by inspecting the query log and identifying patterns in it, or may be derived directly from the business practice of the enterprise. For example, (1) in a university database, queries on students’ course grades occur more frequently at the end of every semester, and (2) in a financial company database, queries on financial records occur more frequently at the end of every financial quarter.
Queries are predictable to a varying degree, either because they follow patterns to a varying degree or the patterns are recognizable to a varying degree. It is thus important to handle queries with different degrees of predictability. To this end, we equip our method with the ability to adjust itself to be more proactive or reactive depending on query predictability. We call our method the proactive and reactive histogram (PRHist).
Within the PRHist method, we need to deal with two main issues: (1) predicting future queries, and (2) using the predicted queries to improve selectivity estimation. In order to address the first issue, we have PRHist find patterns in the query log.1 Looking for patterns of individual query occurrences is not feasible because those occurrences may rarely repeat themselves (the classical overfitting phenomenon). Therefore, PRHist groups queries into “clusters” and looks for patterns considering all queries in the same cluster as a whole. Section 4.1 provides a formal definition of a cluster of queries and cluster patterns.
In order to address the second issue, we have PRHist operate in two phases: off-line and on-line. In the off-line phase, we divide the future into a sequence of consecutive subintervals of time which we call f-subintervals. Then we predict a set of queries for each f-subinterval. Then, it builds a sequence of proactive histograms, one histogram for each of the f-subintervals. In the on-line phase, PRHist maintains an on-line reactive histogram using query feedback as done in the STHoles. At the beginning of each f-subinterval, it loads the proactive histogram built for the f-subinterval and merges it with the on-line reactive histogram. At the time of the merge, PRHist assigns a weight to each histogram based on its “confidence” in the proactive histogram, thereby becoming more reactive (when the confidence is low) or more proactive (otherwise). The confidence is measured as a combination of two factors: the predictability of queries and the extent of data updates.
Our method incurs very small additional run-time overhead over above existing reactive histogram techniques. This small overhead is for loading the off-line proactive histogram into memory and merging it with the on-line reactive histogram at the beginning of each f-subinterval. For example, if a proactive histogram is loaded every 30 min as is the case in our experiments with the real query set, this amounts to one or two extra page loadings2 plus one merge operation for every 30 min. The merge operation is fast since the histograms often contain only a small number of buckets.
We have conducted extensive performance comparisons between PRHist and the state-of-the-art reactive histogram method STHoles, using both real and synthetic data sets. The results show that PRHist outperforms STHoles for most test cases even when only a very small portion of the queries are predictable or even when there are a high percentage of data updates.
We make two key contributions through this paper. We (1) develop a novel histogram maintenance method, which uses proactive histograms complemented with reactive histograms to improve selectivity estimation accuracy, and (2) conduct an extensive performance study to show the performance advantage of our method in a variety of situations.
The rest of the paper is organized as follows. Following this introduction, we discuss related work in Section 2, and formally define the problem addressed by PRHist in Section 3. We describe the off-line proactive histogram construction in Section 4 and the on-line PRHist maintenance in Section 5. In Section 6, we present the experiments, and with Section 7, we conclude the paper.
Section snippets
Related work
In this section we discuss three areas of related work: histograms for selectivity estimation, forecasting methods for query prediction, and proactive optimization methods used in database systems.
Problem formulation
In this section, we formally define the selectivity estimation problem addressed by PRHist.
Off-line proactive histogram sequence construction
Fig. 3 shows the high-level algorithm used for off-line proactive histogram sequence construction. First, the query log is used to find a set of “useful” clusters (line 1). These clusters are then used for predicting queries, which in turn are used to construct a sequence of proactive histograms (line 2). Then, optimal weights are assigned to the buckets of the proactive and reactive histograms using an optimization algorithm (line 3). (The weights are used during the on-line phase for
On-line PRHist maintenance
Fig. 12 gives a high-level overview of the on-line maintenance of PRHist. A proactive histogram built off-line is loaded and merged with an on-line reactive histogram at the beginning of every f-subinterval. At this point, the bucket weights are adjusted depending on the percentage of data updates. Then, during the f-subinterval, the histogram is updated reactively with the feedback from incrementally occurring queries.
Experiments
We evaluate PRHist by comparing it with STHoles using both real and synthetic query/data sets. In this section, we first describe the experimental setup, and then present the results from five experiments.
Conclusions
In this paper we have developed a novel multi-dimensional histogram method, called PRHist. We believe it is the first method that incorporates proactive histograms for selectivity estimation. PRHist has an off-line and an on-line phase. In the off-line phase it proactively builds customized histograms for query sets predicted for future time intervals. To accommodate the fact that predictability of queries may vary, PRHist uses a weighted combination of reactive and proactive histograms during
Acknowledgments
We thank Dennis Fuchs for providing us with the implementation of STHoles. We also thank Nancy Snow Littlefield, Michael Grundhauser, and Keith Kennedy for providing us with the real data and query sets. This research has been supported by the US National Science Foundation through Grant No. IIS-0415023 and the US Department of Energy through Grant No. DE-FG02-ER45962.
Zhen He is a lecturer in the department of computer science at La Trobe University. His main research areas are database systems optimization, time series mining, wireless sensor networks and XML information retrieval. Prior to joining La Trobe University, he worked as a postdoctoral research associate in the University of Vermont. He holds a Bachelor of Computer Science, Honors and Ph.D. degrees from the Australian National University.
References (37)
- Aboulnaga, A., Chaudhuri, S., 1999. Self-tuning histograms: Building histograms without looking at data. In: ACM...
- Acharya, A., Gibbons, P.B., Poosala, V., 2000. Congressional samples for approximate answering of group-by queries. In:...
- Agrawal, D., Abbadi, A.E., Singh, A.K., Yurek, T., 1997. Efficient view maintenance at data warehouses. In: Proceedings...
- Babcock, B., Chaudhuri, S., Das, G., 2003. Dynamic sample selection for approximate query processing. In: Proceedings...
- Babu, S., Bizarro, P., DeWitt, D., 2005. Proactive re-optimization. In: Processings of SIGMOD, pp....
- et al.
Time Series Analysis, Forecasting and Control
(1994) Smoothing, Forecasting and Prediction
(1963)- Bruno, N., Chaudhuri, S., Gravano, L., 2001. STHoles: a multidimensional workload-aware histogram. In: ACM SIGMOD, pp....
Time-Series Forecasting
(2001)- et al.
Holt-Winters forecasting: some practical issues
The Statistician
(1988)
Quantitative Forecasting Methods
Applied Regression Analysis, Linear Models, and Related Methods
Forecasting trends in time series
Management Science
Cited by (4)
A query execution scheduling scheme for Impala system
2018, Concurrency and Computation: Practice and ExperienceQuery workload aware multi-histogram based on equi-width sub-histograms for selectivity estimations of range queries
2015, Communications in Computer and Information ScienceSynopses for massive data: Samples, histograms, wavelets, sketches
2011, Foundations and Trends in DatabasesVisual basic application for statistical process control: A case of metal frame for actuator production process
2010, Proceedings of the International MultiConference of Engineers and Computer Scientists 2010, IMECS 2010
Zhen He is a lecturer in the department of computer science at La Trobe University. His main research areas are database systems optimization, time series mining, wireless sensor networks and XML information retrieval. Prior to joining La Trobe University, he worked as a postdoctoral research associate in the University of Vermont. He holds a Bachelor of Computer Science, Honors and Ph.D. degrees from the Australian National University.
Byung Suk Lee is Associate Professor of Computer Science at the University of Vermont. His main research areas are database systems, data management, and query processing. He held positions in industry and academia: Gold Star Electric, Bell Communications Research, Datacom Global Communications, University of St. Thomas, and currently University of Vermont. He was also a visiting professor at Dartmouth College and a participating guest at Lawrence Livermore National Laboratory. He served on international conferences as a program committee member, a publicity chair, and a special session organizer, and also on US federal funding proposal review panel. He holds a B.S. degree from Seoul National University, M.S. from Korea Advanced Institute of Science and Technology, and Ph.D. from Stanford University.
X. Sean Wang is the Dorothean Professor of Computer Science at the University of Vermont (UVM), Burlington, Vermont. He obtained his PhD degree in 1992 in Computer Science from the University of Southern California, Los Angeles, California, and earned his MS and BS degrees in Computer Science earlier from Fudan University, Shanghai, China. In 1992, he joined the faculty of the Information and Software Engineering Department at George Mason University, Fairfax, Virginia, and in 2003, he moved to UVM. His research areas include database systems, system support for temporal data and time series data, data mining, temporal reasoning, and information security. He is the Principal Investigator or Co-Principal Investigator of a number of federally sponsored research projects, and was a recipient of both the National Science Foundation’s Career and Research Initiation Awards. He has published widely in the general area of database systems in journals and at conferences such as ACM Transactions on Database Systems (TODS), IEEE Transactions on Knowledge and Data Engineering (TKDE), ACM SIGMOD International Conference on Management of Data (SIGMOD), and Very Large Data Bases Conference (VLDB). He has served on Program Committees (PC), as PC chair, or in other capacities for many conference organizations, and is on the editorial boards of a number of technical journals.
- ☆
This work was partially done while the author was at the Department of Computer Science, University of Vermont.