A sampling approach for skyline query cardinality estimation

Luo, Cheng; Jiang, Zhewei; Hou, Wen-Chi; He, Shan; Zhu, Qiang

doi:10.1007/s10115-011-0441-1

A sampling approach for skyline query cardinality estimation

Regular Paper
Published: 16 September 2011

Volume 32, pages 281–301, (2012)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Cheng Luo¹,
Zhewei Jiang²,
Wen-Chi Hou³,
Shan He⁴ &
…
Qiang Zhu⁵

243 Accesses
11 Citations
Explore all metrics

Abstract

A skyline query returns a set of candidate records that satisfy several preferences. It is an operation commonly performed to aid decision making. Since executing a skyline query is expensive and a query plan may combine skyline queries with other data operations such as join, it is important that the query optimizer can quickly yield an accurate cardinality estimate for a skyline query. Log Sampling (LS) and Kernel-Based ( KB) skyline cardinality estimation are the two state-of-the-art skyline cardinality estimation methods. LS is based on a hypothetical model A(log(n))^B. Since this model is originally derived under strong assumptions like data independence between dimensions, it does not apply well to an arbitrary data set. Consequently, LS can yield large estimation errors. KB relies on the integration of the estimated probability density function (PDF) to derive the scale factor Ψ_ds. As the estimation of PDF and the ensuing integration both involve complex mathematical calculations, KB is time consuming. In view of these problems, we propose an innovative purely sampling-based (PS) method for skyline cardinality estimation. PS is non-parametric. It does not assume any particular data distribution and is, thus, more robust than LS. PS does not require complex mathematical calculations. Therefore, it is much simpler to implement and much faster to yield the estimates than KB. Extensive empirical studies show that for a variety of real and synthetic data sets, PS outperforms LS in terms of estimation speed, estimation accuracy, and estimation variability under the same space budget. PS outperforms KB in terms of estimation speed and estimation variability under the same performance mark.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

References

Bartolini I, Ciaccia P, Patella M (2008) Efficient sort-based skyline evaluation. ACM Trans Database Syst 33(4): 1–49
Article Google Scholar
Bartolini I, Ciaccia P, Patella M (2010) Query processing issues in region-based image databases. Knowl Inf Syst 25(2): 389–420
Article Google Scholar
Bellman R (1961) Adaptive control processes: a guided tour. Princeton University Press, Princeton
MATH Google Scholar
Bentley J, Kung H, Schkolnick M, Thompson C (1978) On the average number of maxima in a set of vectors and applications. J ACM 25(4): 536–543
Article MathSciNet MATH Google Scholar
Bentley J, Clarkson K, Levine D (1990) Fast linear expected-time alogorithms for computing maxima and convex hulls. In: SODA ’90: proceedings of the first annual ACM-SIAM symposium on discrete algorithms. pp 179–187
Börzsönyi S, Kossmann D, Stocker K (2001) The skyline operator. In: Proceedings of the 17th international conference on data engineering. pp 421–430
Briggs W, Henson V (1995) DFT: an owner’s manual for the discrete Fourier transform. Society for industrial and applied Mathematics Published, Philadelphia
Chaudhuri S, Motwani R, Narasayya V (1999) On random sampling over joins. In: Proceedings of ACM SIGMOD conference. pp 263–274
Chaudhuri S, Dalvi N, Kaushik R (2006) Robust cardinality and cost estimation for skyline operator. In: ICDE ’06: proceedings of the 22nd international conference on data engineering. p 64
Chomicki J, Godfrey P, Gryz J, Liang D (2003) Skyline with presorting. In: Proceedings of ICDE 2003. pp 717–816
Ganguly S, Gibbons P, Matias Y, Silberschatz A (1996) Bifocal sampling for skew-resistant join size estimation. In: Proceedings of the 1996 ACM SIGMOD international conference on management of data. pp 271–281
Gilbert A, Kotidis Y, Muthukrishnan S, Strauss M (2001) Surfing wavelets on streams: one-pass summaries for approximate aggregate queries. In: Proceedings of the 27th international conferrence on VLDB. pp 79–88
Godfrey P, Shipley R, Gryz J (2005) Maximal vector computation in large data sets. In: Proceedings of VLDB. pp 229–240
Godfrey P, Shipley R, Gryz J (2007) Algorithms and analyses for maximal vector computation. VLDB J 16(1): 5–28
Google Scholar
Gunopulos D, Kollios G, Tsotras J, Domeniconi C (2005) Selectivity estimators for multidimensional range queries over real attributes. VLDB J 14(2): 137–154
Article Google Scholar
Hass P, Naughton J, Seshadri S, Swami A (1993) Fixed-precision estimation of join selectivity. In: Proceedings of 12th ACM symposium on principles of database systems. pp 190–201
Hass P, Naughton J, Seshadri S, Stokes L (1995) Sampling-based estimation of the number of distinct values of an attribute. In: Proceedings of 21st international conference on very large data bases. pp 311–322
Hou W-C, Ozsoyoglu G, Taneja, BK (1988) Statistical estimators for relational algebra expression. In: Proceedings of 7th ACM symposium on principles of database systems. pp 276–287
Hou W-C, Ozsoyoglu G, Taneja, BK (1989) Processing aggregate relational queries with hard time constraints. In: Proceedings of ACM SIGMOD international conference on management of data. pp 68–77
Huang Z, Sun S, Wang W (2010) Efficient mining of skyline objects in subspaces over data streams. Knowl Inf Syst 22(2): 159–183
Article Google Scholar
Hwang J-N, Lippman S-R (1994) A nonparametric multivariate density estimation: a comparative study. IEEE Trans Signal Process 42(10): 2795–2810
Article Google Scholar
Kung H, Luccio F, Preparata F (1975) On finding the maxima of a set of vectors. J. ACM 22(4): 469–476
Article MathSciNet MATH Google Scholar
Lee K, Zheng B, Li H, Lee, W (2007) Approaching the skyline in Z order. In: VLDB ’07: proceedings of the 33rd international conference on very large data bases. pp 279–290
Lipton R, Naughton J, Schneider D (1990) Practical selectivity estimation through adaptive sampling. In: Proceedings 1990 ACM SIGMOD international conference managment of data. pp 1–11
Matias Y, Vitter J, Wang M (1998) Wavelet-based histograms for selectivity estimation. In: Proceedings of SIGMOD
Poosala V, Ioannidis Y (1997) Selectivity estimation without the attribute value independence assumption. In: VLDB ’97: proceedings of the 23rd international conference on very large data bases. pp 486–495
Sun S, Huang Z, Zhong H, Dai D, Liu H (2010) Efficient monitoring of skyline queries over distributed data streams. Knowl Inf Syst 25(3): 575–606
Article Google Scholar
Zhang Z, Yang Y, Cai R, Papadias D, Tung A (2009) Kernel-based skyline cardinality estimation. In: SIGMOD ’09: proceedings of the 35th SIGMOD international conference on Management of data. pp 509–522

Download references

Author information

Authors and Affiliations

Department of Mathematics and Computer Science, Coppin State University, 2500 West North Avenue, Baltimore, MD, 21216, USA
Cheng Luo
Frederick Community College, 7932 Opossumtown Pike, Frederick, MD, 21702, USA
Zhewei Jiang
Computer Science Department, Southern Illinois University Carbondale, Carbondale, IL, 62901, USA
Wen-Chi Hou
School of Economics and Management, Southwest Petroleum University, Chengdu, 610500, Sichuan, People’s Republic of China
Shan He
Department of Computer and Information Science, University of Michigan, Dearborn, MI, 48128, USA
Qiang Zhu

Authors

Cheng Luo
View author publications
You can also search for this author in PubMed Google Scholar
Zhewei Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Wen-Chi Hou
View author publications
You can also search for this author in PubMed Google Scholar
Shan He
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Zhu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cheng Luo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Luo, C., Jiang, Z., Hou, WC. et al. A sampling approach for skyline query cardinality estimation. Knowl Inf Syst 32, 281–301 (2012). https://doi.org/10.1007/s10115-011-0441-1

Download citation

Received: 12 July 2010
Revised: 12 July 2011
Accepted: 27 August 2011
Published: 16 September 2011
Issue Date: August 2012
DOI: https://doi.org/10.1007/s10115-011-0441-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A sampling approach for skyline query cardinality estimation

Abstract

Access this article

Similar content being viewed by others

Sampling-Based Approximate Skyline Calculation on Big Data

Probabilistic n-of-N Skyline Computation over Uncertain Data Streams

Preference-Based Top-k Representative Skyline Queries on Uncertain Databases

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A sampling approach for skyline query cardinality estimation

Abstract

Access this article

Similar content being viewed by others

Sampling-Based Approximate Skyline Calculation on Big Data

Probabilistic n-of-N Skyline Computation over Uncertain Data Streams

Preference-Based Top-k Representative Skyline Queries on Uncertain Databases

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation