Proactive and reactive multi-dimensional histogram maintenance for selectivity estimation

https://doi.org/10.1016/j.jss.2007.03.088Get rights and content

Abstract

Many state-of-the-art selectivity estimation methods use query feedback to maintain histogram buckets, thereby using the limited memory efficiently. However, they are “reactive” in nature, that is, they update the histogram based on queries that have come to the system in the past for evaluation. In some applications, future occurrences of certain queries may be predicted and a “proactive” approach can bring much needed performance gain, especially when combined with the reactive approach. For these applications, this paper provides a method that builds customized proactive histograms based on query prediction and mergers them into reactive histograms when the predicted future arrives. Thus, the method is called the proactive and reactive histogram (PRHist). Two factors affect the usefulness of the proactive histograms and are dealt with during the merge process: the first is the predictability of queries and the second is the extent of data updates. PRHist adjusts itself to be more reactive or more proactive depending on these two factors. Through extensive experiments using both real and synthetic data and query sets, this paper shows that in most cases, PRHist outperforms STHoles, the state-of-the-art reactive method, even when only a small portion of the queries are predictable and a significant portion of data is updated.

Introduction

Extensive literature exists on the selectivity estimation problem. A variety of solutions have been proposed, but most popular ones are histogram-based solutions. Many state-of-the-art histogram methods adopt a “self-tuning” strategy, and construct and maintain histograms based on the feedback information provided in the form of query results (Aboulnaga and Chaudhuri, 1999, Bruno et al., 2001, Stillger et al., 2001). This allows them to keep more histogram buckets in regions queried more frequently. Due to the common spatial and temporal locality of user queries, these methods usually make more efficient use of the limited memory than methods based solely on data distribution.

However, all existing self-tuning histograms are “reactive” in nature, that is, they are updated based on the feedback information of the past queries only. This reactive approach does not consider the possibility that certain future query occurrences can be predicted. In this paper, by building and using “proactive histograms” based on query prediction, we show that a proactive approach can make more efficient use of histogram memory, when combined with the reactive approach. We believe this paper is the first to introduce the idea of proactive histograms and to provide a method to take advantage of these histograms.

Query prediction may be obtained by inspecting the query log and identifying patterns in it, or may be derived directly from the business practice of the enterprise. For example, (1) in a university database, queries on students’ course grades occur more frequently at the end of every semester, and (2) in a financial company database, queries on financial records occur more frequently at the end of every financial quarter.

Queries are predictable to a varying degree, either because they follow patterns to a varying degree or the patterns are recognizable to a varying degree. It is thus important to handle queries with different degrees of predictability. To this end, we equip our method with the ability to adjust itself to be more proactive or reactive depending on query predictability. We call our method the proactive and reactive histogram (PRHist).

Within the PRHist method, we need to deal with two main issues: (1) predicting future queries, and (2) using the predicted queries to improve selectivity estimation. In order to address the first issue, we have PRHist find patterns in the query log.1 Looking for patterns of individual query occurrences is not feasible because those occurrences may rarely repeat themselves (the classical overfitting phenomenon). Therefore, PRHist groups queries into “clusters” and looks for patterns considering all queries in the same cluster as a whole. Section 4.1 provides a formal definition of a cluster of queries and cluster patterns.

In order to address the second issue, we have PRHist operate in two phases: off-line and on-line. In the off-line phase, we divide the future into a sequence of consecutive subintervals of time which we call f-subintervals. Then we predict a set of queries for each f-subinterval. Then, it builds a sequence of proactive histograms, one histogram for each of the f-subintervals. In the on-line phase, PRHist maintains an on-line reactive histogram using query feedback as done in the STHoles. At the beginning of each f-subinterval, it loads the proactive histogram built for the f-subinterval and merges it with the on-line reactive histogram. At the time of the merge, PRHist assigns a weight to each histogram based on its “confidence” in the proactive histogram, thereby becoming more reactive (when the confidence is low) or more proactive (otherwise). The confidence is measured as a combination of two factors: the predictability of queries and the extent of data updates.

Our method incurs very small additional run-time overhead over above existing reactive histogram techniques. This small overhead is for loading the off-line proactive histogram into memory and merging it with the on-line reactive histogram at the beginning of each f-subinterval. For example, if a proactive histogram is loaded every 30 min as is the case in our experiments with the real query set, this amounts to one or two extra page loadings2 plus one merge operation for every 30 min. The merge operation is fast since the histograms often contain only a small number of buckets.

We have conducted extensive performance comparisons between PRHist and the state-of-the-art reactive histogram method STHoles, using both real and synthetic data sets. The results show that PRHist outperforms STHoles for most test cases even when only a very small portion of the queries are predictable or even when there are a high percentage of data updates.

We make two key contributions through this paper. We (1) develop a novel histogram maintenance method, which uses proactive histograms complemented with reactive histograms to improve selectivity estimation accuracy, and (2) conduct an extensive performance study to show the performance advantage of our method in a variety of situations.

The rest of the paper is organized as follows. Following this introduction, we discuss related work in Section 2, and formally define the problem addressed by PRHist in Section 3. We describe the off-line proactive histogram construction in Section 4 and the on-line PRHist maintenance in Section 5. In Section 6, we present the experiments, and with Section 7, we conclude the paper.

Section snippets

Related work

In this section we discuss three areas of related work: histograms for selectivity estimation, forecasting methods for query prediction, and proactive optimization methods used in database systems.

Problem formulation

In this section, we formally define the selectivity estimation problem addressed by PRHist.

Off-line proactive histogram sequence construction

Fig. 3 shows the high-level algorithm used for off-line proactive histogram sequence construction. First, the query log is used to find a set of “useful” clusters (line 1). These clusters are then used for predicting queries, which in turn are used to construct a sequence of proactive histograms (line 2). Then, optimal weights are assigned to the buckets of the proactive and reactive histograms using an optimization algorithm (line 3). (The weights are used during the on-line phase for

On-line PRHist maintenance

Fig. 12 gives a high-level overview of the on-line maintenance of PRHist. A proactive histogram built off-line is loaded and merged with an on-line reactive histogram at the beginning of every f-subinterval. At this point, the bucket weights are adjusted depending on the percentage of data updates. Then, during the f-subinterval, the histogram is updated reactively with the feedback from incrementally occurring queries.

Experiments

We evaluate PRHist by comparing it with STHoles using both real and synthetic query/data sets. In this section, we first describe the experimental setup, and then present the results from five experiments.

Conclusions

In this paper we have developed a novel multi-dimensional histogram method, called PRHist. We believe it is the first method that incorporates proactive histograms for selectivity estimation. PRHist has an off-line and an on-line phase. In the off-line phase it proactively builds customized histograms for query sets predicted for future time intervals. To accommodate the fact that predictability of queries may vary, PRHist uses a weighted combination of reactive and proactive histograms during

Acknowledgments

We thank Dennis Fuchs for providing us with the implementation of STHoles. We also thank Nancy Snow Littlefield, Michael Grundhauser, and Keith Kennedy for providing us with the real data and query sets. This research has been supported by the US National Science Foundation through Grant No. IIS-0415023 and the US Department of Energy through Grant No. DE-FG02-ER45962.

Zhen He is a lecturer in the department of computer science at La Trobe University. His main research areas are database systems optimization, time series mining, wireless sensor networks and XML information retrieval. Prior to joining La Trobe University, he worked as a postdoctoral research associate in the University of Vermont. He holds a Bachelor of Computer Science, Honors and Ph.D. degrees from the Australian National University.

References (37)

  • Aboulnaga, A., Chaudhuri, S., 1999. Self-tuning histograms: Building histograms without looking at data. In: ACM...
  • Acharya, A., Gibbons, P.B., Poosala, V., 2000. Congressional samples for approximate answering of group-by queries. In:...
  • Agrawal, D., Abbadi, A.E., Singh, A.K., Yurek, T., 1997. Efficient view maintenance at data warehouses. In: Proceedings...
  • Babcock, B., Chaudhuri, S., Das, G., 2003. Dynamic sample selection for approximate query processing. In: Proceedings...
  • Babu, S., Bizarro, P., DeWitt, D., 2005. Proactive re-optimization. In: Processings of SIGMOD, pp....
  • P. Box et al.

    Time Series Analysis, Forecasting and Control

    (1994)
  • G.R. Brown

    Smoothing, Forecasting and Prediction

    (1963)
  • Bruno, N., Chaudhuri, S., Gravano, L., 2001. STHoles: a multidimensional workload-aware histogram. In: ACM SIGMOD, pp....
  • C. Chatfield

    Time-Series Forecasting

    (2001)
  • C. Chatfield et al.

    Holt-Winters forecasting: some practical issues

    The Statistician

    (1988)
  • Chaudhuri, S., Das, G., Narasayya, V., 2001. A robust, optimization-based approach for approximate answering of...
  • Colby, L.S., Griffin, T., Libkin, L., Mumick, I.S., Trickey, H., 1996. Algorithms for deferred view maintenance. In:...
  • Curewitz, K.M., Krishnan, P., Vitter, J.S., 1993. Proceedings of practical prefetching via data compression. In: ACM...
  • Donjerkovic, D., Ioannidis, Y., Ramakrishnan, R., 2000. Dynamic histograms: capturing evolving data sets. In: ICDE, pp....
  • R.N. Farnum et al.

    Quantitative Forecasting Methods

    (1989)
  • J. Fox

    Applied Regression Analysis, Linear Models, and Related Methods

    (1997)
  • S.E. Gardner et al.

    Forecasting trends in time series

    Management Science

    (1985)
  • Gerlhof, C.A., Kemper, A., 1994. A multi-threaded architecture for prefetching in object bases. In: EDBT, pp....
  • Cited by (4)

    Zhen He is a lecturer in the department of computer science at La Trobe University. His main research areas are database systems optimization, time series mining, wireless sensor networks and XML information retrieval. Prior to joining La Trobe University, he worked as a postdoctoral research associate in the University of Vermont. He holds a Bachelor of Computer Science, Honors and Ph.D. degrees from the Australian National University.

    Byung Suk Lee is Associate Professor of Computer Science at the University of Vermont. His main research areas are database systems, data management, and query processing. He held positions in industry and academia: Gold Star Electric, Bell Communications Research, Datacom Global Communications, University of St. Thomas, and currently University of Vermont. He was also a visiting professor at Dartmouth College and a participating guest at Lawrence Livermore National Laboratory. He served on international conferences as a program committee member, a publicity chair, and a special session organizer, and also on US federal funding proposal review panel. He holds a B.S. degree from Seoul National University, M.S. from Korea Advanced Institute of Science and Technology, and Ph.D. from Stanford University.

    X. Sean Wang is the Dorothean Professor of Computer Science at the University of Vermont (UVM), Burlington, Vermont. He obtained his PhD degree in 1992 in Computer Science from the University of Southern California, Los Angeles, California, and earned his MS and BS degrees in Computer Science earlier from Fudan University, Shanghai, China. In 1992, he joined the faculty of the Information and Software Engineering Department at George Mason University, Fairfax, Virginia, and in 2003, he moved to UVM. His research areas include database systems, system support for temporal data and time series data, data mining, temporal reasoning, and information security. He is the Principal Investigator or Co-Principal Investigator of a number of federally sponsored research projects, and was a recipient of both the National Science Foundation’s Career and Research Initiation Awards. He has published widely in the general area of database systems in journals and at conferences such as ACM Transactions on Database Systems (TODS), IEEE Transactions on Knowledge and Data Engineering (TKDE), ACM SIGMOD International Conference on Management of Data (SIGMOD), and Very Large Data Bases Conference (VLDB). He has served on Program Committees (PC), as PC chair, or in other capacities for many conference organizations, and is on the editorial boards of a number of technical journals.

    This work was partially done while the author was at the Department of Computer Science, University of Vermont.

    View full text