Abstract
A skyline query returns a set of candidate records that satisfy several preferences. It is an operation commonly performed to aid decision making. Since executing a skyline query is expensive and a query plan may combine skyline queries with other data operations such as join, it is important that the query optimizer can quickly yield an accurate cardinality estimate for a skyline query. Log Sampling (LS) and Kernel-Based ( KB) skyline cardinality estimation are the two state-of-the-art skyline cardinality estimation methods. LS is based on a hypothetical model A(log(n))B. Since this model is originally derived under strong assumptions like data independence between dimensions, it does not apply well to an arbitrary data set. Consequently, LS can yield large estimation errors. KB relies on the integration of the estimated probability density function (PDF) to derive the scale factor Ψ ds . As the estimation of PDF and the ensuing integration both involve complex mathematical calculations, KB is time consuming. In view of these problems, we propose an innovative purely sampling-based (PS) method for skyline cardinality estimation. PS is non-parametric. It does not assume any particular data distribution and is, thus, more robust than LS. PS does not require complex mathematical calculations. Therefore, it is much simpler to implement and much faster to yield the estimates than KB. Extensive empirical studies show that for a variety of real and synthetic data sets, PS outperforms LS in terms of estimation speed, estimation accuracy, and estimation variability under the same space budget. PS outperforms KB in terms of estimation speed and estimation variability under the same performance mark.
Similar content being viewed by others
References
Bartolini I, Ciaccia P, Patella M (2008) Efficient sort-based skyline evaluation. ACM Trans Database Syst 33(4): 1–49
Bartolini I, Ciaccia P, Patella M (2010) Query processing issues in region-based image databases. Knowl Inf Syst 25(2): 389–420
Bellman R (1961) Adaptive control processes: a guided tour. Princeton University Press, Princeton
Bentley J, Kung H, Schkolnick M, Thompson C (1978) On the average number of maxima in a set of vectors and applications. J ACM 25(4): 536–543
Bentley J, Clarkson K, Levine D (1990) Fast linear expected-time alogorithms for computing maxima and convex hulls. In: SODA ’90: proceedings of the first annual ACM-SIAM symposium on discrete algorithms. pp 179–187
Börzsönyi S, Kossmann D, Stocker K (2001) The skyline operator. In: Proceedings of the 17th international conference on data engineering. pp 421–430
Briggs W, Henson V (1995) DFT: an owner’s manual for the discrete Fourier transform. Society for industrial and applied Mathematics Published, Philadelphia
Chaudhuri S, Motwani R, Narasayya V (1999) On random sampling over joins. In: Proceedings of ACM SIGMOD conference. pp 263–274
Chaudhuri S, Dalvi N, Kaushik R (2006) Robust cardinality and cost estimation for skyline operator. In: ICDE ’06: proceedings of the 22nd international conference on data engineering. p 64
Chomicki J, Godfrey P, Gryz J, Liang D (2003) Skyline with presorting. In: Proceedings of ICDE 2003. pp 717–816
Ganguly S, Gibbons P, Matias Y, Silberschatz A (1996) Bifocal sampling for skew-resistant join size estimation. In: Proceedings of the 1996 ACM SIGMOD international conference on management of data. pp 271–281
Gilbert A, Kotidis Y, Muthukrishnan S, Strauss M (2001) Surfing wavelets on streams: one-pass summaries for approximate aggregate queries. In: Proceedings of the 27th international conferrence on VLDB. pp 79–88
Godfrey P, Shipley R, Gryz J (2005) Maximal vector computation in large data sets. In: Proceedings of VLDB. pp 229–240
Godfrey P, Shipley R, Gryz J (2007) Algorithms and analyses for maximal vector computation. VLDB J 16(1): 5–28
Gunopulos D, Kollios G, Tsotras J, Domeniconi C (2005) Selectivity estimators for multidimensional range queries over real attributes. VLDB J 14(2): 137–154
Hass P, Naughton J, Seshadri S, Swami A (1993) Fixed-precision estimation of join selectivity. In: Proceedings of 12th ACM symposium on principles of database systems. pp 190–201
Hass P, Naughton J, Seshadri S, Stokes L (1995) Sampling-based estimation of the number of distinct values of an attribute. In: Proceedings of 21st international conference on very large data bases. pp 311–322
Hou W-C, Ozsoyoglu G, Taneja, BK (1988) Statistical estimators for relational algebra expression. In: Proceedings of 7th ACM symposium on principles of database systems. pp 276–287
Hou W-C, Ozsoyoglu G, Taneja, BK (1989) Processing aggregate relational queries with hard time constraints. In: Proceedings of ACM SIGMOD international conference on management of data. pp 68–77
Huang Z, Sun S, Wang W (2010) Efficient mining of skyline objects in subspaces over data streams. Knowl Inf Syst 22(2): 159–183
Hwang J-N, Lippman S-R (1994) A nonparametric multivariate density estimation: a comparative study. IEEE Trans Signal Process 42(10): 2795–2810
Kung H, Luccio F, Preparata F (1975) On finding the maxima of a set of vectors. J. ACM 22(4): 469–476
Lee K, Zheng B, Li H, Lee, W (2007) Approaching the skyline in Z order. In: VLDB ’07: proceedings of the 33rd international conference on very large data bases. pp 279–290
Lipton R, Naughton J, Schneider D (1990) Practical selectivity estimation through adaptive sampling. In: Proceedings 1990 ACM SIGMOD international conference managment of data. pp 1–11
Matias Y, Vitter J, Wang M (1998) Wavelet-based histograms for selectivity estimation. In: Proceedings of SIGMOD
Poosala V, Ioannidis Y (1997) Selectivity estimation without the attribute value independence assumption. In: VLDB ’97: proceedings of the 23rd international conference on very large data bases. pp 486–495
Sun S, Huang Z, Zhong H, Dai D, Liu H (2010) Efficient monitoring of skyline queries over distributed data streams. Knowl Inf Syst 25(3): 575–606
Zhang Z, Yang Y, Cai R, Papadias D, Tung A (2009) Kernel-based skyline cardinality estimation. In: SIGMOD ’09: proceedings of the 35th SIGMOD international conference on Management of data. pp 509–522
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Luo, C., Jiang, Z., Hou, WC. et al. A sampling approach for skyline query cardinality estimation. Knowl Inf Syst 32, 281–301 (2012). https://doi.org/10.1007/s10115-011-0441-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-011-0441-1