Skip to main content
Log in

A sampling approach for skyline query cardinality estimation

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

A skyline query returns a set of candidate records that satisfy several preferences. It is an operation commonly performed to aid decision making. Since executing a skyline query is expensive and a query plan may combine skyline queries with other data operations such as join, it is important that the query optimizer can quickly yield an accurate cardinality estimate for a skyline query. Log Sampling (LS) and Kernel-Based ( KB) skyline cardinality estimation are the two state-of-the-art skyline cardinality estimation methods. LS is based on a hypothetical model A(log(n))B. Since this model is originally derived under strong assumptions like data independence between dimensions, it does not apply well to an arbitrary data set. Consequently, LS can yield large estimation errors. KB relies on the integration of the estimated probability density function (PDF) to derive the scale factor Ψ ds . As the estimation of PDF and the ensuing integration both involve complex mathematical calculations, KB is time consuming. In view of these problems, we propose an innovative purely sampling-based (PS) method for skyline cardinality estimation. PS is non-parametric. It does not assume any particular data distribution and is, thus, more robust than LS. PS does not require complex mathematical calculations. Therefore, it is much simpler to implement and much faster to yield the estimates than KB. Extensive empirical studies show that for a variety of real and synthetic data sets, PS outperforms LS in terms of estimation speed, estimation accuracy, and estimation variability under the same space budget. PS outperforms KB in terms of estimation speed and estimation variability under the same performance mark.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Bartolini I, Ciaccia P, Patella M (2008) Efficient sort-based skyline evaluation. ACM Trans Database Syst 33(4): 1–49

    Article  Google Scholar 

  2. Bartolini I, Ciaccia P, Patella M (2010) Query processing issues in region-based image databases. Knowl Inf Syst 25(2): 389–420

    Article  Google Scholar 

  3. Bellman R (1961) Adaptive control processes: a guided tour. Princeton University Press, Princeton

    MATH  Google Scholar 

  4. Bentley J, Kung H, Schkolnick M, Thompson C (1978) On the average number of maxima in a set of vectors and applications. J ACM 25(4): 536–543

    Article  MathSciNet  MATH  Google Scholar 

  5. Bentley J, Clarkson K, Levine D (1990) Fast linear expected-time alogorithms for computing maxima and convex hulls. In: SODA ’90: proceedings of the first annual ACM-SIAM symposium on discrete algorithms. pp 179–187

  6. Börzsönyi S, Kossmann D, Stocker K (2001) The skyline operator. In: Proceedings of the 17th international conference on data engineering. pp 421–430

  7. Briggs W, Henson V (1995) DFT: an owner’s manual for the discrete Fourier transform. Society for industrial and applied Mathematics Published, Philadelphia

  8. Chaudhuri S, Motwani R, Narasayya V (1999) On random sampling over joins. In: Proceedings of ACM SIGMOD conference. pp 263–274

  9. Chaudhuri S, Dalvi N, Kaushik R (2006) Robust cardinality and cost estimation for skyline operator. In: ICDE ’06: proceedings of the 22nd international conference on data engineering. p 64

  10. Chomicki J, Godfrey P, Gryz J, Liang D (2003) Skyline with presorting. In: Proceedings of ICDE 2003. pp 717–816

  11. Ganguly S, Gibbons P, Matias Y, Silberschatz A (1996) Bifocal sampling for skew-resistant join size estimation. In: Proceedings of the 1996 ACM SIGMOD international conference on management of data. pp 271–281

  12. Gilbert A, Kotidis Y, Muthukrishnan S, Strauss M (2001) Surfing wavelets on streams: one-pass summaries for approximate aggregate queries. In: Proceedings of the 27th international conferrence on VLDB. pp 79–88

  13. Godfrey P, Shipley R, Gryz J (2005) Maximal vector computation in large data sets. In: Proceedings of VLDB. pp 229–240

  14. Godfrey P, Shipley R, Gryz J (2007) Algorithms and analyses for maximal vector computation. VLDB J 16(1): 5–28

    Google Scholar 

  15. Gunopulos D, Kollios G, Tsotras J, Domeniconi C (2005) Selectivity estimators for multidimensional range queries over real attributes. VLDB J 14(2): 137–154

    Article  Google Scholar 

  16. Hass P, Naughton J, Seshadri S, Swami A (1993) Fixed-precision estimation of join selectivity. In: Proceedings of 12th ACM symposium on principles of database systems. pp 190–201

  17. Hass P, Naughton J, Seshadri S, Stokes L (1995) Sampling-based estimation of the number of distinct values of an attribute. In: Proceedings of 21st international conference on very large data bases. pp 311–322

  18. Hou W-C, Ozsoyoglu G, Taneja, BK (1988) Statistical estimators for relational algebra expression. In: Proceedings of 7th ACM symposium on principles of database systems. pp 276–287

  19. Hou W-C, Ozsoyoglu G, Taneja, BK (1989) Processing aggregate relational queries with hard time constraints. In: Proceedings of ACM SIGMOD international conference on management of data. pp 68–77

  20. Huang Z, Sun S, Wang W (2010) Efficient mining of skyline objects in subspaces over data streams. Knowl Inf Syst 22(2): 159–183

    Article  Google Scholar 

  21. Hwang J-N, Lippman S-R (1994) A nonparametric multivariate density estimation: a comparative study. IEEE Trans Signal Process 42(10): 2795–2810

    Article  Google Scholar 

  22. Kung H, Luccio F, Preparata F (1975) On finding the maxima of a set of vectors. J. ACM 22(4): 469–476

    Article  MathSciNet  MATH  Google Scholar 

  23. Lee K, Zheng B, Li H, Lee, W (2007) Approaching the skyline in Z order. In: VLDB ’07: proceedings of the 33rd international conference on very large data bases. pp 279–290

  24. Lipton R, Naughton J, Schneider D (1990) Practical selectivity estimation through adaptive sampling. In: Proceedings 1990 ACM SIGMOD international conference managment of data. pp 1–11

  25. Matias Y, Vitter J, Wang M (1998) Wavelet-based histograms for selectivity estimation. In: Proceedings of SIGMOD

  26. Poosala V, Ioannidis Y (1997) Selectivity estimation without the attribute value independence assumption. In: VLDB ’97: proceedings of the 23rd international conference on very large data bases. pp 486–495

  27. Sun S, Huang Z, Zhong H, Dai D, Liu H (2010) Efficient monitoring of skyline queries over distributed data streams. Knowl Inf Syst 25(3): 575–606

    Article  Google Scholar 

  28. Zhang Z, Yang Y, Cai R, Papadias D, Tung A (2009) Kernel-based skyline cardinality estimation. In: SIGMOD ’09: proceedings of the 35th SIGMOD international conference on Management of data. pp 509–522

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cheng Luo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Luo, C., Jiang, Z., Hou, WC. et al. A sampling approach for skyline query cardinality estimation. Knowl Inf Syst 32, 281–301 (2012). https://doi.org/10.1007/s10115-011-0441-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-011-0441-1

Keywords

Navigation