Abstract
Most of the histograms, maintained by the actual DBMSs, make the uniform frequency assumption and most commonly approximate all frequencies in a bucket by their average. Thus, these histograms require storing the average frequency for each bucket. Hence, the accuracy of any estimation performed using the histogram depends highly on the technique used for approximating values into each bucket. Several approaches for approximating the set of attribute values with in a bucket have been studied in the literature. Some of histograms record every distinct value that appears in each bucket and other ones make crude assumptions about it. The most significant are the continuous values assumption, the uniform spread assumption and finally, the point value assumption. Other existing approaches are based on sampling techniques to approximate values inside a histogram bucket. The problem here is that all the proposed techniques assume that attribute values have equal spreads. Motivated by the inaccuracy of previous approaches in approximating value sets with non uniform spreads and by the significant estimation error that can be reached with the various assumptions, we need to compute d distinct values v1, v2, . . ., vd that lie between the lowest and highest values in the range of each bucket without making any assumption about the values spreadsheet. For this reason, we propose an efficient algorithm for calculating these d values dynamically as new values are inserted into the attribute. The problem can be returned to calculate values of (d-2) quantiles; namely, the 1/d-, 2/d-, …, (d-2)/d-quantiles, along with the lowest and highest values in the bucket. For each quantile to be estimated, we maintain a set of five markers that are updated after every new value inserted in the attribute. The results of a set of experiments comparing the accuracy of the proposed algorithm to the uniform spread assumption using various sets of values, over different types of histograms, show the effectiveness of our technique especially when values have non-equal spreads.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Kooi, R.P.: The optimization of queries in relational databases. PhD thesis, Case Western Reserver University (September 1980)
Shapiro, G.P., Connell, C.: Accurate estimation of the number of tuples satisfying a condition. In: Proc. of ACM SIGMOD, pp. 256–276 (1984)
Ioannidis, Y.: Universality of serial histograms. In: Proc. of 19th VLDB, pp. 256–267 (1993)
Ioannidis, Y., Christodoulakis, S.: Optimal histograms for limiting worst-case error propagation in the size of join results. In: Proc. of ACM TODS (1993)
Selinger, P.G., Astrahan, M.M., Chamberlin, D.D., Lorie, R.A., Price, T.T.: Access path selection in a relational database management system. In: Proc. of ACM SIGMOD, pp. 23–34 (1979)
Poosala, V., Ioannidis, Y., Haas, P., Shekita, E.: Improved histograms for selectivity estimation of range predicates. In: Proc. of ACM SIGMOD, pp. 294–305 (1996)
Ioannidis, Y., Poosala, V.: Balancing histogram optimality and practicality for query result size estimation. In: Proc. of ACM SIGMOD, pp. 233–244 (1995)
Poosala, V., Ioannidis, Y.: Estimation of query-result distribution and its application in parallel-join load balancing. In: Proc. of 22nd VLDB, pp. 448–459 (1996)
Labbadi, W., Akaichi, J.: Improving range query result size estimation based on a new optimal histogram. In: Larsen, H.L., Martin-Bautista, M.J., Vila, M.A., Andreasen, T., Christiansen, H. (eds.) FQAS 2013. LNCS, vol. 8132, pp. 40–56. Springer, Heidelberg (2013)
Jain, R., Chlamtac, I.: The p2 algorithm for dynamic calculation of quantiles and histograms without storing observations. Communications oh the ACM, 1076–1085 (1985)
Jagadish, H.V., Koudas, N., Muthukrishnan, S., Poosala, V., Sevcik, K., Suel, T.: Optimal histograms with quality guarantees. In: Proc. of 24th VLDB, pp. 275–286 (1998)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Labbadi, W., Akaichi, J. (2014). Efficient Algorithm to Approximate Values with Non-uniform Spreads Inside a Histogram Bucket. In: Ait Ameur, Y., Bellatreche, L., Papadopoulos, G.A. (eds) Model and Data Engineering. MEDI 2014. Lecture Notes in Computer Science, vol 8748. Springer, Cham. https://doi.org/10.1007/978-3-319-11587-0_28
Download citation
DOI: https://doi.org/10.1007/978-3-319-11587-0_28
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-11586-3
Online ISBN: 978-3-319-11587-0
eBook Packages: Computer ScienceComputer Science (R0)