Elsevier

Information Sciences

Volume 179, Issue 19, 9 September 2009, Pages 3286-3308
Information Sciences

The partitioned-layer index: Answering monotone top-k queries using the convex skyline and partitioning-merging technique

https://doi.org/10.1016/j.ins.2009.05.016Get rights and content

Abstract

A top-k query returns k tuples with the highest (or the lowest) scores from a relation. The score is computed by combining the values of one or more attributes. We focus on top-k queries having monotone linear score functions. Layer-based methods are well-known techniques for top-k query processing. These methods construct a database as a single list of layers. Here, the ith layer has the tuples that can be the top-i tuple. Thus, these methods answer top-k queries by reading at most k layers. Query performance, however, is poor when the number of tuples in each layer (simply, the layer size) is large. In this paper, we propose a new layer-ordering method, called the Partitioned-Layer Index (simply, the PL Index), that significantly improves query performance by reducing the layer size. The PL Index uses the notion of partitioning, which constructs a database as multiple sublayer lists instead of a single layer list subsequently reducing the layer size. The PL Index also uses the convex skyline, which is a subset of the skyline, to construct a sublayer to further reduce the layer size. The PL Index has the following desired properties. The query performance of the PL Index is quite insensitive to the weights of attributes (called the preference vector) of the score function and is approximately linear in the value of k. The PL Index is capable of tuning query performance for the most frequently used value of k by controlling the number of sublayer lists. Experimental results using synthetic and real data sets show that the query performance of the PL Index significantly outperforms existing methods except for small values of k (say, k9).

Introduction

A top-k (ranked) query returns k tuples with the highest (or the lowest) scores in a relation [18]. A score function is generally in the form of a linearly weighted sum as shown in Eq. (1) [10], [15], [22]. Here, p[i] and t[i] denote the weight and the value of the ith attribute of a tuple t, respectively. The d-dimensional vector that has p[i] as the ith element is called the preference vector [15], where d denotes the number of attributes of t. In this paper, we focus on monotone linear score functions where p[i]0(1id). In Section 3, we explain the score function in detail.f(t)=i=1dp[i]t[i]

For example, top-k queries are used for ranking colleges. Colleges are represented by a relation that has numerical attributes such as the research quality assessment, tuition, and graduate employment rate [8], [22]. Users search for the k colleges ranked according to the score function with their own preference vector [8], [15], [22]. Thus, users having budget concern may assign a high weight to the “tuition” attribute while users expecting good employment will assign a high weight to the “graduate employment rate” attribute. For another example, top-k queries are used for ordering houses listed for sale. Houses are represented by a relation that has numerical attributes such as the price, number of bedrooms, age, and square footage [15], [22]. As shown in the previous example, users can search for the best houses with their own preference vector [8], [15], [22].

To process a top-k query, a naive method would calculate the score of each tuple according to the score function, and then, finds the top k tuples by sorting the tuples based on their scores. This method, however, is not appropriate for a query with a relatively small value of k over large databases because it incurs a significant overhead by reading even those tuples that cannot possibly be the results [22].

There have been a number of methods proposed to efficiently answer top-k queries by accessing only a subset of the database instead of unnecessarily accessing the entire one. These methods are classified into two categories depending on whether or not they exploit the relationship among the attributes (i.e., the attribute correlation [8]). The ones that do not exploit attribute correlation regard the attributes independent of one another. That is, they consider a tuple as a potential top-k answer only if the tuple is ranked high in at least one of the attributes. We refer to these methods as list-based methods because they require maintaining one sorted list per each attribute [2], [11]. While these methods show significant improvement compared to the naive method, they often consider an unnecessarily large number of tuples. For instance, when a tuple is ranked high in one attribute but low in all others, the tuple is likely to be ranked low in the final answer and can potentially be ignored, but these methods have to consider it because of its high rank in that one attribute.

The ones that exploit attribute correlation regard the attributes dependent of one another. That is, they consider all attribute values when constructing an index in contrast to the list-based methods. These methods are further classified into two categories – layer-based methods and view-based methods [22]. The view-based methods build multiple dedicated indices for multiple top-k queries with different preference vectors. That is, these methods create these top-k queries, execute each query, and store the result as a view. These methods answer top-k queries by reading tuples from the view(s) whose preference vector is the most similar to that of a given query. These methods have a disadvantage of being sensitive to the preference vector [22]. If the preference vector used in creating the view is similar to that of a given query, query performance is good; otherwise, it is very poor [22]. Query performance can be improved by increasing the number of views, but the space overhead increases in proportion to the number of views [21].

The layer-based methods construct the database as a single list of layers (simply, a layer list) where the ith layer contains the tuples that can potentially be the top-i answer. These methods answer top-k queries by reading at most k layers from the layer list. For constructing layers, convex hulls [4] or skylines1 [6] can be used [8], [22]. The convex hull is a useful notion when supporting (monotone or non-monotone) linear functions [8]; the skyline (linear or non-linear) monotone functions [6]. The layer-based methods have two advantages over the view-based methods: 1) storage overhead is negligible [8], and 2) query performance is not very sensitive to the preference vector of the score function given by the query [22]. Nevertheless, when the number of tuples in each layer (simply, the layer size) is large, these methods have bad query performance because many unnecessary tuples have to be read to process the query [15].

In this paper, we propose a new layer-based method, called the Partitioned-Layer Index (simply, the PL Index), that significantly improves top-k query processing by reducing the layer size. The PL Index overcomes the drawback of the layer-based methods. Besides, since the PL Index belongs to a layer-based approach, it inherently takes advantage of exploiting attribute correlation in contrast to the list-based methods and has the two desirable properties over the view-based methods. The contributions we make in this paper are as follows.

  • (1)

    We propose the notion of partitioning for constructing multiple sublayer lists instead of a single layer list. We partition the database into a number of distinct parts, and then, constructs a sublayer lists for each part. That is, the PL Index consists of a set of sublayer lists. The partitioning method allows us to reduce the sizes of the sublayers inversely proportional to the number of sublayer lists. Accordingly, the PL Index overcomes the drawback of existing layer-based methods of large layer sizes while avoiding the space overhead of the view-based methods.

  • (2)

    By using the partitioning method, we propose the novel PL Index and its associated top-k query processing algorithm. Our algorithm dynamically constructs a virtual layer list by merging the sublayer lists of the PL Index to process a specific top-k query and returns the query results progressively. A novel feature of our algorithm is that it finds a top-i (2ik) tuple by reading at most one sublayer from a specific sublayer list.

  • (3)

    We formally define the notion of the convex skyline, which is a subset of the skyline, to further reduce the sublayer size. The convex skyline is a notion more useful than either the skyline or the convex hull when supporting monotone linear score functions due to this characteristic of reducing the layer size.

The PL Index has four desirable properties. First, it significantly outperforms existing methods except for small values of k (say, k9). Second, its query performance is quite insensitive to the preference vector of the score function. Third, it is capable of tuning the query performance for the most frequently used value of k by controlling the number of sublayer lists. Fourth, its query performance is approximately linear in the value of k. We investigate these properties in Section 6 and present the results of performance evaluation in Section 7.

The rest of this paper is organized as follows. Section 2 describes existing work related to this paper. Section 3 formally defines the problem. Section 4 proposes the method for building the PL Index. Section 5 presents the algorithm for processing top-k queries using the PL Index. Section 6 analyzes the performance of the PL Index in index building and querying. Section 7 presents the results of performance evaluation. Section 8 summarizes and concludes the paper.

Section snippets

Related work

There have been a number of methods proposed to answer top-k queries efficiently by accessing only a subset of the database. We classify the existing methods into three categories: the list-based method, layer-based method, and view-based method. We briefly review each of these methods in this section.

Problem definition

In this section, we formally define the problem of top-k queries. A target relation R has d attributes, A1,A2,,Ad of real values, and the cardinality of R is N [10], [15], [22]. Every tuple in the relation R can be considered as a point in the d-dimensional space [0.0,1.0]d. Hereafter, we call the space [0.0,1.0]d as the universe, refer to a tuple t in R as an object t in the universe, and use the tuple and the object interchangeably as is appropriate. Table 1 summarizes the notation to be

Overview

We now explain how to construct the PL Index to process top-k queries. The goal of the PL Index is to reduce the sizes of the layers that are read for processing top-k queries because the layer size affects query performance in layer-based methods as explained in Introduction.

To achieve the goal, the PL Index is built through the following two steps as shown in Fig. 1: (1) Partitioning step: partitioning the universe into a number of subregions with long narrow hyperplanes of similar sizes that

The query processing algorithm using the PL Index

As shown in Fig. 10b, the PL Index can be considered as a set of ordered lists of sublayers. That is, as mentioned in Lemma 5, there is an order among sublayers within a sublayer list, but there is none among sublayers in different sublayer lists. Thus, to process queries using the PL Index, we merge the sublayer lists by sequentially reading sublayers from their heads. To find the top-1 object, we read the first sublayers of all the sublayer lists because we do not know which of them contains

Analysis of the PL Index

In this section, we analyze the index building time and the query processing time of the PL Index. For ease of analysis, we use uniform object distributions.

Experimental data and environment

We compare the index building time and the query performance of the PL Index with those of existing layer-based methods ONION [8] and AppRI [22], and existing view-based methods PREFER [15] and LPTA [10]. For the PL Index, we use the UniversePartitioning algorithm in Fig. 7. All these methods and the PL Index except for ONION support only monotone linear score functions. Thus, for a fair comparison with ONION, we also include Convex Skyline Index (CSI_ONION) in our comparison as a variant of

Conclusions

In this paper, we have proposed the PL Index that significantly improves the top-k query processing performance. We have also proposed the DynamicLayerOrdering algorithm that effectively evaluates top-k queries by using the PL Index. The PL Index partitions the universe (the database) into a number of subregions, and then, constructs a sublayer list for each subregion.

Building the PL Index consists of two steps: the partitioning step and the layering step. For the partitioning step, we have

Acknowledgements

This work was partially supported by the Korea Science and Engineering Foundation (KOSEF) grant funded by the Korean Government (MEST) (No. R0A-2007-000-20101-0). This work was also partially supported by the Internet Services Theme Program funded by Microsoft Research Asia.

References (24)

  • C. Buchta

    On the average number of maxima in a set of vectors

    Informat. Process. Lett.

    (1989)
  • L.A. Zadeh

    Toward a generalized theory of uncertainty (GTU)

    Informat. Sci.

    (2005)
  • L.A. Zadeh

    Is there a need for fuzzy logic?

    Informat. Sci.

    (2008)
  • B. Barber et al.

    The quickhull algorithm for convex hulls

    ACM Trans. Math. Software

    (1996)
  • M. Bast, D. Majumdar, R. Schenkel, M. Theobald, G. Weikum, IO-Top-k: index-access optimized top-k query processing, in:...
  • J.L. Bentley et al.

    On the average number of maxima in a set of vectors and applications

    J. ACM

    (1978)
  • M. Berg et al.

    Computational Geometry: Algorithms and Applications

    (2000)
  • G. Beskales, M.A. Soliman, I.F. Ilyas, Efficient search for the top-k probable nearest neighbors in uncertain...
  • S. Borzsonyi, D. Kossmann, K. Stocker, The skyline operator, in: Proceedings of the 17th International Conference on...
  • Y.C. Chang, L. Bergman, V. Castelli, C.-S. Li, M.-L. Lo, J.R. Smith, The onion technique: indexing for linear...
  • C.-Y. Chan, P.-K. Eng, K.-L. Tan, Stratified computation of skylines with partially-ordered domains, in: SIGMOD,...
  • G. Das, D. Gunopulos, N. Koudas, D. Tsirogiannis, Answering top-k queries using views, in: Proceedings of the 32nd...
  • Cited by (12)

    • Approximate convex skyline: A partitioned layer-based index for efficient processing top-k queries

      2014, Knowledge-Based Systems
      Citation Excerpt :

      We show the superiority of the proposed methods through the results of the various experiments in Section 6.2. We compare the index building time, precision, memory usage, and query performance of AppCSE with the existing methods convex skyline [18] (simply, CS) and skyline [6]. We also compare the index building time and the precision of AppCS (PartFirst), which is constructed by partitioning step at first with AppCS and AppCSE.

    • Skyline queries on keyword-matched data

      2013, Information Sciences
    • Subspace top-k query processing using the hybrid-layer index with a tight bound

      2013, Data and Knowledge Engineering
      Citation Excerpt :

      From now on, if we need to differentiate these two new versions of the HL-index from the earlier one, we refer to the earlier one as HL-index (convex), the first new one as HL-index (skyline), and the second new one as HL-index (convex skyline) (simply, HL-index(cvxsky)). We compare the index building cost and the query performance of the HL-index with the following existing methods: ONION [4] (a layer-based method), TA [3] (a list-based method), PREFER [6] (a view-based method), and SUB-TOPK [21], and PL-index [8]. We use the number of bytes as the measure of the index storage cost, the wall clock time as the measure of the index building time, and the number of objects read from database, Num_Objects_Read, as the measure of the query performance.

    • Supporting efficient distributed skyline computation using skyline views

      2012, Information Sciences
      Citation Excerpt :

      However, since these algorithms assume centralized database environments, they do not consider the cost model of accessing objects in distributed environments, i.e., sorted access and random access [4,7,10]. In particular, the model is widely used in top-k retrieval [8,24,25]. In clear contrast, our work aims to identify subspace skylines with minimal cost in distributed scenarios by reusing materialized skyline views.

    View all citing articles on Scopus
    View full text