A fast and progressive algorithm for skyline queries with totally- and partially-ordered domains

https://doi.org/10.1016/j.jss.2009.09.032Get rights and content

Abstract

We devise a skyline algorithm that can efficiently mitigate the enormous overhead of processing millions of tuples on totally- and partially-ordered domains (henceforth, TODs and PODs). With massive datasets, existing techniques spend a significant amount of time on a dominance comparison because of both a large number of skyline points and the unprogressive method of skyline computing with PODs. (If data has high dimensionality, the situation is undoubtedly aggravated.) The progressiveness property turns out to be the key feature for solving all remaining problems. This article presents a FAST-SKY algorithm that deals successfully with these two obstacles and improves skyline query processing time strikingly, even with high-dimensional data. Progressive skyline evaluation with PODs is guaranteed by new index structures and topological sorting order. A stratification technique is adopted to index data on PODs, and we propose two new index structures: stratified R-trees (SR-trees) for low-dimensional data and stratified MinMax treaps (SM-treaps) for high-dimensional data. A fast dominance comparison is achieved by using a reporting query instead of a dominance query, and a dimensionality reduction technique. Experimental results suggest that in general cases (anti-correlated and uniform distributions) FAST-SKY is orders of magnitude faster than existing algorithms.

Introduction

The database research community has recently given considerable attention to skyline computation, particularly for progressive algorithms that can immediately return intermediate skyline results. Using the common definition in the literature, given a set of objects p1,p2,,pN, the skyline operator returns all objects pi such that pi is not dominated by another object pj. The underlying query type in skyline computation is known to be a dominance query, which requires a series of pairwise comparisons.

Since the skyline operator was formally introduced in Börzsönyi et al. (2001), many researchers have devised novel algorithms for skyline queries in various environments (see Balke et al., 2004, Papadias et al., 2005, Chan et al., 2005, Sacharidis et al., 2009, Kossmann et al., 2002, Pei et al., 2006, Wang et al., 2007, Lee et al., 2007, Lian and Chen, 2008, Vlachou et al., 2008). Most skyline evaluation methods developed recently are aimed at efficient computation with TODs, and we have seen remarkable progress in that area. On the other hand, skyline computation with TODs and PODs has remained in a premature state since it was first tackled by Chan et al. (2005) until recent work by Sacharidis et al. (2009) advances the performance of skyline processing with PODs using topological sort.

The performance of skyline computation is mainly affected by two activities. The first one, which has already been researched extensively, is the IO overhead of accessing data. Many researchers think that reducing the number of IO accesses to a disk would have a great benefit, and lots of excellent algorithms showed substantial improvements. However, the IO overhead becomes less meaningful as data increases to millions of records. Rather, the CPU cost of carrying out a series of dominance checks with existing skyline points becomes the major factor in performance overhead, which is the second activity. Existing skyline techniques do not fully take the second type of overhead into consideration, and as a result, leave much room for further enhancement. In addition, if data has high dimensionality, the entire overhead grows nontrivially; in particular, skyline processing with high-dimensional data on PODs has not yet been researched thoroughly.

The main focus of this paper is on developing a fast and progressive algorithm for skyline queries with TODs and PODs. We propose the FAST-SKY algorithm that has the progressiveness property with TODs and PODs, and as a result, FAST-SKY can use a reporting query in dominance comparisons to tremendous effect. In order to achieve IO-optimality1 and progressiveness (Tan et al., 2001) in skyline computation with PODs, we adopt a stratification technique and design two new index structures, each of which has merits and demerits. The stratification that is applied to PODs through domain classification, which partitions data into disjoint sets according to a value from POD, enables the FAST-SKY algorithm to perform IO-optimal and progressive skyline evaluation.

The new index structures are stratified R-trees (SR-trees) and stratified MinMax treaps (SM-treaps). Due to the stratification, both index structures are not concerned with attributes in PODs, only attributes in TODs. An SR-tree considers full d-dimensionality of data, while an SM-treap sees only two dimensions, min and max coordinates, when indexing data. An SR-tree has an advantage in that it supports all types of progressive evaluations such as arbitrary dimensionality. The drawback, however, is the severe performance degradation of the R-tree that indexes high-dimensional data (d>10). To overcome the performance deterioration, an SM-treap enforces dimensionality reduction on multidimensional data in such a way that data is represented by two coordinates. It then inserts data into a treap with min (key) and max (priority) values. The merits of an SM-treap are twofold. An SM-treap can index high-dimensional data and support progressive computation, and unlike the R-tree-based algorithm (i.e., BBS (Papadias et al., 2005)), it does not require a heap space to buffer skyline candidates because the preorder traversal of an SM-treap guarantees progressive processing. Eliminating the heap overhead by guaranteeing progressive processing, which was discussed in ZBtree (Lee et al., 2007) using data on an integer domain (TODs only), contributes to a substantial performance gain in our problem domain as well (real TODs and PODs). To the best of our knowledge, this is the first work on space efficiency and the progressiveness property of an SM-treap for high-dimensional data in a real domain.

For an efficient dominance comparison, we use the well-known property in orthogonal range searching that if an intermediate skyline set is an insert-only2 set, then a dominance check can be realized by using a reporting query, instead of a dominance query. In other words, if skyline computation is progressive, then a dominance check can be optimized directly by the above property. Because of the progressiveness property, FAST-SKY can also exploit a reporting query. To deal with high-dimensional skyline points as well, we reduce all intermediate skyline points in 2-dimensional space by using a dimensionality reduction technique; we adopt and modify the iMinMax technique (Yu et al., 2004). Performing a dominance check using a reporting query on the reduced dimensions improves the entire query processing time significantly.

The main contributions of this paper follow:

  • We address the problem of skyline computation with TODs and PODs even in high-dimensional space. We also note that the CPU cost of a dominance test becomes the major factor in performance overhead as datasets grow over millions of records.

  • For progressive skyline evaluation with PODs, we design two new index structures: SR-trees and SM-treaps, each of which has strengths and weaknesses when used for different purposes. In particular, SM-treaps resolve progressive skyline processing with PODs in high dimensional space efficiently.

  • We achieve a fast dominance test by using a reporting query on reduced dimensions. An iMax based B+-tree with novel pruning techniques decreases the cost of dominance tests.

The rest of the paper is organized as follows: Section 2 reviews the related work in further detail and discusses its advantages and limitations. Section 3 introduces fundamental definitions. Section 4 gives an in-depth explanation of the FAST-SKY algorithm, and Section 5 experimentally evaluates FAST-SKY and compares it to SDC+ in a variety of settings. Finally, Section 6 concludes the article and describes directions for future work.

Section snippets

Related work

Previous work on algorithms for computing skyline queries can be grouped into two categories, namely non-index-based (e.g., block nested loop (Börzsönyi et al., 2001) and divide & conquer (Börzsönyi et al., 2001)) and index-based (e.g., B-tree-based scheme (Börzsönyi et al., 2001), R-tree-based schemes (Börzsönyi et al., 2001, Papadias et al., 2005), bitmap (Tan et al., 2001), and index (Tan et al., 2001)). Typically, the index-based approaches outperform the non-index-based approaches. They

Preliminaries

This section introduces some notations and definitions we use in the rest of the paper. All data points are defined on the set of attributes A={A1,A2,,An}, where A=AtotalApartial and where Atotal and Apartial denote, respectively, the subset of totally- and partially-ordered attributes.

Definition 1

Partially-ordered set = poset

For each attribute AiA,(Di,i) denotes the poset for its domain values Di. Each i is a reflexive,

Fast and progressive skyline algorithm

Before we delve into the details, we would like to state that the skyline considered in this paper is the min operator for all attributes. In this section, key motivations are discussed in Section 4.1. We then describe the detailed algorithmic techniques of the FAST-SKY algorithm in the remaining sections.

Performance study

In this section, we give performance results detailing (i) the efficiency and progressiveness of FAST-SKY and (ii) the applicability of the fast dominance check.

Conclusion

In this article, we address the problem of skyline evaluation with TODs and PODs even with high-dimensional data. We also note that most database algorithms for skyline computation are based on a dominance query, which incurs severe computation overhead as data size increases to millions of records. We design two new index structures for indexing data on TODs and PODs via a stratification technique, and they are very efficient in progressive skyline processing. By utilizing the progressiveness

Acknowledgement

This study was supported by the Seoul R&BD Program (10561), Seoul, Korea.

Hyungsoo Jung received the BS degree in mechanical engineering from Korea University, Seoul, Korea, in 2002; and the MS and the PhD degrees in computer science from Seoul National University, Seoul, Korea in 2004 and 2009, respectively. He is currently a post-doctoral research associate at Seoul National University, Seoul, Korea. His research interests are in the areas of distributed systems, database systems, and transaction processing.

References (26)

  • Agrawal, R., Wimmers, E.L., 2000. A framework for expressing and combining preferences. In:...
  • Babock, B., Babu, S., Datar, M., Motwani, R., Widom, J., 2002. Models and issues in data stream systems. In:...
  • Balke, W., Güntzer, U., Zheng, X., 2004. Efficient distributed skylining for web information systems. In:...
  • Börzsönyi, S., Kossmann, D., Stocker, K., 2001. The skyline operator. In:...
  • Chan, C.Y., Eng, P.K., Tan, K.L., 2005. Stratified computation of skylines with partially-ordered domains. In:...
  • B. Chazelle

    Lower bounds for orthogonal range searching: I. The reporting case

    JACM

    (1990)
  • B. Chazelle

    Lower bounds for orthogonal range searching: II. The arithmetic model

    JACM

    (1990)
  • Cohen, W.W., Schapire, R.E., Singer, Y., 1997. Learning to order things. In: Proceedings of the 1997 conference on...
  • T.H. Cormen et al.

    Introduction to Algorithms

    (2001)
  • Hadjieleftheriou, M., 2003. Spatial index library....
  • Kießling, W., 2002. Foundations of preferences in database systems. In:...
  • Kießling, W., Köstler, G., 2002. Preference sql – design, implementation, experiences. In:...
  • Kossmann, D., Ramsak, F., Rost, S., 2002. Shooting stars in the sky: an online algorithm for skyline queries. In:...
  • Cited by (7)

    • An efficient skyline framework for matchmaking applications

      2011, Journal of Network and Computer Applications
      Citation Excerpt :

      In our previous work (Han et al., 2009), we proposed an efficient sequential skyline algorithm for data with multiple low-cardinality and unrestricted attributes. In Jung et al. (2010), we extended our skyline algorithm to process skyline queries for partially-ordered attributes which generally have more complex structures than low-cardinality attributes. Additionally, the iMinMax-based (Yu et al., 2004) dominance-check algorithm is also proposed.

    • An Efficient Indexing Method for Skyline Computations with Partially Ordered Domains

      2017, IEEE Transactions on Knowledge and Data Engineering
    • Two-level filtering based skyline query algorithm in sensor networks

      2013, Xitong Fangzhen Xuebao / Journal of System Simulation
    • A data-related cluster architecture based pruning strategy

      2011, 2011 International Conference on Electronics, Communications and Control, ICECC 2011 - Proceedings
    View all citing articles on Scopus

    Hyungsoo Jung received the BS degree in mechanical engineering from Korea University, Seoul, Korea, in 2002; and the MS and the PhD degrees in computer science from Seoul National University, Seoul, Korea in 2004 and 2009, respectively. He is currently a post-doctoral research associate at Seoul National University, Seoul, Korea. His research interests are in the areas of distributed systems, database systems, and transaction processing.

    Hyuck Han received his BS and MS degrees in Computer Science and Engineering from Seoul National University, Seoul, Korea, in 2003 and 2006, respectively. Currently, he is a PhD candidate at Seoul National University. His research interests are distributed computing systems, parallel computing, and database systems.

    Heon Y. Yeom is a Professor with the Department of Computer Science and Engineering, Seoul National Univ. He received his BS degree in computer science from Seoul National Univ. in 1984 and received the MS and PhD degree in computer science from Texas A&M Univ. in 1986 and 1992, respectively. From 1992 to 1993, he was with Samsung Data Systems as a Research Scientist. He joined the Department of Computer Science, Seoul National University in 1993, where he currently teaches and researches on distributed systems, multimedia systems and transaction processing, etc.

    Sooyong Kang received his BS degree in mathematics and the MS and PhD degrees in Computer Science, from Seoul National University, Seoul, Korea, in 1996, 1998, and 2002, respectively. He was then a Postdoctoral researcher in the School of Computer Science and Engineering, SNU. He is now with the Division of Computer Science and Engineering, Hanyang University, Seoul. His research interests include Operating System, Multimedia System, Storage System, Flash Memories and Next Generation Nonvolatile Memories.

    View full text