Elsevier

Information Systems

Volume 38, Issue 8, November 2013, Pages 1212-1233
Information Systems

Probabilistic skyline operator over sliding windows

https://doi.org/10.1016/j.is.2012.03.002Get rights and content

Abstract

Skyline computation has many applications including multi-criteria decision making. In this paper, we study the problem of efficiently computing the skyline over sliding windows on uncertain data elements against probability thresholds. Firstly, we characterize the properties of elements to be kept in our computation. Then, we show the size of dynamically maintained candidate set and the size of skyline. Novel, efficient techniques are developed to process continuous probabilistic skyline queries over sliding windows. Finally, we extend our techniques to cover the applications where multiple probability thresholds are given, “top-k” skyline data objects are retrieved, or elements have individual life-spans. Our extensive experiments demonstrate that the proposed techniques are very efficient and can handle a high-speed data stream in real time.

Introduction

Uncertain data analysis is a key in many important applications, such as sensor networks, trend prediction, moving object management, data cleaning and integration, economic decision making, and market surveillance. In such applications, uncertain data is often collected in a streaming fashion. Uncertain streaming data computation has drawn considerable attention from the database research community recently (e.g. [9], [15], [30]).

Skyline analysis is shown as a very useful tool [4], [8], [23], [26] in multi-criterion decision making. Given a certain dataset D, an object s1D dominates another object s2D if s1 is better than s2 in at least one aspect and not worse than s2 in all other aspects according to the preferences specified by users. The skyline of D comprises of objects in D that are not dominated by any other object from D. Skyline computation against uncertain data has also been studied recently [2], [21], [24]. In this paper, we will investigate the problem of efficient skyline computation over uncertain streaming data where each data element has a probability to occur.

In many online monitoring problems, the appearance of a data element is often uncertain. Below are two examples. In an online shopping system, products are evaluated in various aspects such as price, condition (e.g., brand new, excellent, good, average, etc.), and brand. A customer may want to select a product, say laptops, based on the multiple criteria (preferences) such as low price, good condition, and good brand. It is well known [4] that the skyline provides a candidate set of best deals. In the application, each seller is also associated with a “trustability” value which is derived from customers' feedback on the seller's product quality, delivery handling, etc.; the trustability value may be regarded as the “appearance” probability of the product since it represents the probability that the product occurs exactly as described in the advertisement in terms of delivery and quality. For simplicity, we assume that a customer only prefers ThinkPad T61; thus we remove the brand dimension from ranking. Table 1 lists four qualified results. Both L1 and L4 are skyline points regarding (price, condition), L1 is better than (dominates) L2, and L4 is better than L3. Nevertheless, L1 is posted long time ago, and the trustability of L4 is quite low. In such applications, customers may want to continuously monitor online advertisements by selecting the candidates for the best deal—skyline points. Clearly, we need to “discount” the dominating ability from offers with too low trustability. Moreover, too old offers may not be quite relevant. We could model such an online selection problem as probabilistic skyline against sliding windows by treating online advertisements as an uncertain data stream (see Section 2 for details) such that each data element (advertisement) has an occurrence probability.

An uncertain data stream may arrive at a very high speed. Consider stock market applications where clients may be eager to buy a particular stock and want to online monitor good offers (for sale) from other clients for this particular stock. An offer is recorded by two aspects (price, volume) where price is the price per share in the offer and volume is the number of shares offered for sale. In such applications, customers may want to know the top offers so far, as one of many kinds of statistic information, before making trade decisions. An offer a is better than another deal b if a involves a higher volume and is cheaper (per share) than those of b, respectively. Nevertheless, an offer from a client may be withdrawn from time to time; thus, it also has a probability to exist (i.e. has an occurrence probability). Consequently, a stream of sale offers may be treated as a stream of uncertain elements such that each element has a probability to occur. Clearly, some clients may only want to know “top” offers (skyline) among the most recent N offers (sliding windows) or the offers made in the most recent T period; this, together with the consideration of the uncertainty of each deal gives another example of probabilistic skyline against sliding windows.

While the two examples above demonstrate the usefulness of online monitoring skyline over uncertain data, online monitoring skyline over uncertain streaming data regarding sliding windows has many other applications. In this paper we investigate the problem of efficiently monitoring probabilistic skyline against sliding windows. To the best of our knowledge, there is no similar work existing in the literature in the context of skyline computation over uncertain data steams. In the light of data stream computation, it is highly desirable to develop online, efficient, memory based, incremental techniques using small memory. Our contribution may be summarized as follows.

  • We characterize the minimum information needed in continuously computing probabilistic skyline against a sliding window.

  • We show that the volume of such minimum information is expected to be bounded by poly-logarithmic sizes regarding a given window size.

  • We develop novel, incremental techniques to continuously compute probabilistic skyline over sliding windows.

  • We extend our techniques to support multiple pre-given probability thresholds, “top-k” probabilistic skyline data elements, and data elements with individual life-spans.

Our extensive experiments demonstrate that the developed techniques can support online computation against very rapid data streams.

The rest of the paper is organized as follows. In Section 2, we formally define the problem of sliding-window skyline computation on uncertain data streams and present background information. 3 Framework, 4 Algorithms present the framework, fundamental theories, and techniques for processing probability threshold based sliding window queries. Section 5 extends our techniques to multi-thresholds, top-k skyline, and time-based sliding windows where each data element has a life-span. Results of comprehensive performance studies are discussed in Section 6. Section 7 summarizes related work and Section 8 concludes the paper.

Section snippets

Background

We use DS to represent a sequence (stream) of data elements in a d-dimensional numeric space such that each element a has a probability P(a) (0<P(a)1) to occur where a.i (for 1id) denotes the i-th dimension value. Without loss of generality, we assume that on each dimension, users prefer small values. For two elements u and v, u dominates v, denoted by uv, if u.iv.i for 1id, and there exists a dimension j with u.j<v.j. Given a set of elements, the skyline consists of all points which are

Framework

A critical requirement in data stream computation is to have small memory space and fast computation. Consequently, given a probability threshold q and a sliding window with length N, it would be ideal if the continuous computation can be conducted against the probabilistic q-skyline SKYN,q of DSN by excluding all the other elements. However, this is impossible. For instance, regarding the example in Fig. 1(a), assume that N=5 and q=0.5. It is immediate that Psky(a4)=0.0342<q; that is, a4SKYN,q

Algorithms

A trivial execution of Algorithm 1 is to visit each element in SN,q to update skyline probability when an element inserts or deletes; then choose elements a from SN,q with Psky|SN,q(a)q. Note a new data element may cause several elements to be deleted from SN,q, nevertheless, the time complexity is O(|SN,q|) per element which is expected to be poly-logarithmic regarding N with the order of d (Section 3.2). In this section, we present novel techniques to efficiently execute Algorithm 1 based on

Variants

The techniques developed in the paper can be immediately extended to cover the following variants.

Performance evaluation

In this section, we only evaluate our techniques since this is the first paper studying the problem of probabilistic skyline computation and its variations over sliding windows. Specifically, we implement and evaluate the following techniques.

  • SSKY

    Techniques in Section 4 to continuously compute q-skyline (i.e., skyline with the probability no less than a given q) against a sliding window.

  • MSKY

    Techniques in Section 5.1 to continuously compute multiple q-skylines concurrently regarding multiple given

Related work

We review related work in two aspects, skylines and uncertain data streams.

Skylines. Börzsönyi et al. [4] first study the skyline operator in the context of databases and propose an SQL syntax for the skyline query. They also develop two computation techniques based on block-nested-loop and divide-and-conquer paradigms, respectively. Another block-nested-loop based technique SFS (sort-filter-skyline) is proposed by Chomicki et al. [8], which takes advantage of a presorting step. SFS is then

Conclusion

In this paper, we investigate the problem of efficiently computing skyline against sliding windows over an uncertain data stream. We first model the probability threshold based skyline problem. Then, we present a framework which is based on efficiently maintaining a candidate set. We show that such a candidate set is the minimum information we need to keep. Efficient techniques have been presented to process continuous queries. We extend our techniques to concurrently support processing a set

References (32)

  • C.C. Aggarwal, P.S. Yu, A framework for clustering uncertain data streams, in: ICDE,...
  • M.J. Atallah, Y. Qi, Computing all skyline probabilities for uncertain data, in: PODS,...
  • W.-T. Balke, U. Guntzer, J.X. Zheng, Efficient distributed skylining for web information systems, in: EDBT...
  • S. Borzsonyi, D. Kossmann, K. Stocker, The skyline operator, in: ICDE,...
  • C.-Y. Chan, P.-K. Eng, K.-L. Tan, Stratified computation of skylines with partially ordered domains, in: SIGMOD,...
  • C.-Y. Chan, H.V. Jagadish, K.-L. Tan, A.K.H. Tung, On high dimensional skylines, in: EDBT,...
  • C.-Y. Chan, H.V. Jagadish, K.-L. Tan, A.K.H. Tung, Z. Zhang, Finding k-dominant skylines in high dimensional space, in:...
  • J. Chomicki, P. Godfrey, J. Gryz, D. Liang, Skyline with presorting, in: ICDE,...
  • G. Cormode, M. Garofalakis, Sketching probabilistic data streams, in: SIGMOD,...
  • E. Dellis, B. Seeger, Efficient computation of reverse skyline queries, in: VLDB,...
  • P. Godfrey, R. Shipley, J. Gryz, Maximal vector computation in large datasets, in: VLDB,...
  • Y.-W. Huang, N. Jing, E.A. Rundensteiner, Spatial joins using R-trees: breadth-first traversal with global...
  • Z. Huang, C.S. Jensen, H. Lu, B.C. Ooi, Skyline queries against mobile lightweight devices in MANETs, in: ICDE,...
  • T. Jayram, S. Kale, E. Vee, Efficient aggregation algorithms for probabilistic data, in: SODA,...
  • T.S. Jayram, A. McGregor, S. Muthukrishan, E. Vee, Estimating statistical aggregrates on probabilistic data streams,...
  • C. Jin, K. Yi, L. Chen, J.X. Yu, X. Lin, Sliding-window top-k queries on uncertain streams, in: VLDB,...
  • Cited by (20)

    • Top k probabilistic skyline queries on uncertain data

      2018, Neurocomputing
      Citation Excerpt :

      The first study about the skyline query over uncertain data, namely p-skyline, was reported in [13]. This pioneering work has inspired many follow-up studies about reverse skyline search in uncertain databases [14], probabilistic skyline queries in distributed environments [15–17], probabilistic group subspace skyline queries [18], probabilistic skyline operator over sliding windows [6], uncertain dynamic skyline queries [19], to name just a few. The p-skyline query returns objects whose skyline probabilities are no less than a specified threshold [13].

    • Optimizined skyline queries over uncertain data using improved scalable framework

      2018, Computers and Electrical Engineering
      Citation Excerpt :

      The communication and computational cost is reduced effectively with this approach. Sliding windowing technique [10] is applied over probabilistic and continuous skyline queries using possible threshold. This technique mainly retrieves top-k data of skyline queries in high speed streaming real time environment.

    • Selecting skyline stars over uncertain databases: Semantics and refining methods in the evidence theory setting

      2017, Applied Soft Computing Journal
      Citation Excerpt :

      Atallah and Qi [29] develop sub-quadratic algorithms to compute skyline probabilities for every object. Zhang et al. [20] tackle the problem of efficiently on-line computing probabilistic skyline over sliding windows. Yong et al. [30] studied the problem of supporting skyline queries for uncertain data with maybe confidence.

    • The σ-neighborhood skyline queries

      2015, Information Sciences
      Citation Excerpt :

      However, the number of retrieved points is quite large [36], particularly when there are many skyline points in the dataset, thereby hampering the decision-making process. When querying an uncertain database (i.e., skyline points may not exist), users can utilize a probabilistic skyline query [1,10,12,26,38,40] to retrieve other data points that have high probability of being skyline points as a substitution. Unfortunately, this approach is only applicable to uncertain databases, which limits its generalizability.

    View all citing articles on Scopus
    View full text