Probabilistic skyline operator over sliding windows

doi:10.1016/j.is.2012.03.002

Information Systems

Volume 38, Issue 8, November 2013, Pages 1212-1233

https://doi.org/10.1016/j.is.2012.03.002 Get rights and content

Abstract

Skyline computation has many applications including multi-criteria decision making. In this paper, we study the problem of efficiently computing the skyline over sliding windows on uncertain data elements against probability thresholds. Firstly, we characterize the properties of elements to be kept in our computation. Then, we show the size of dynamically maintained candidate set and the size of skyline. Novel, efficient techniques are developed to process continuous probabilistic skyline queries over sliding windows. Finally, we extend our techniques to cover the applications where multiple probability thresholds are given, “top-k” skyline data objects are retrieved, or elements have individual life-spans. Our extensive experiments demonstrate that the proposed techniques are very efficient and can handle a high-speed data stream in real time.

Introduction

Uncertain data analysis is a key in many important applications, such as sensor networks, trend prediction, moving object management, data cleaning and integration, economic decision making, and market surveillance. In such applications, uncertain data is often collected in a streaming fashion. Uncertain streaming data computation has drawn considerable attention from the database research community recently (e.g. [9], [15], [30]).

Skyline analysis is shown as a very useful tool [4], [8], [23], [26] in multi-criterion decision making. Given a certain dataset D, an object $s_{1} \in D$ dominates another object $s_{2} \in D$ if s₁ is better than s₂ in at least one aspect and not worse than s₂ in all other aspects according to the preferences specified by users. The skyline of D comprises of objects in D that are not dominated by any other object from D. Skyline computation against uncertain data has also been studied recently [2], [21], [24]. In this paper, we will investigate the problem of efficient skyline computation over uncertain streaming data where each data element has a probability to occur.

In many online monitoring problems, the appearance of a data element is often uncertain. Below are two examples. In an online shopping system, products are evaluated in various aspects such as price, condition (e.g., brand new, excellent, good, average, etc.), and brand. A customer may want to select a product, say laptops, based on the multiple criteria (preferences) such as low price, good condition, and good brand. It is well known [4] that the skyline provides a candidate set of best deals. In the application, each seller is also associated with a “trustability” value which is derived from customers' feedback on the seller's product quality, delivery handling, etc.; the trustability value may be regarded as the “appearance” probability of the product since it represents the probability that the product occurs exactly as described in the advertisement in terms of delivery and quality. For simplicity, we assume that a customer only prefers ThinkPad T61; thus we remove the brand dimension from ranking. Table 1 lists four qualified results. Both L₁ and L₄ are skyline points regarding (price, condition), L₁ is better than (dominates) L₂, and L₄ is better than L₃. Nevertheless, L₁ is posted long time ago, and the trustability of L₄ is quite low. In such applications, customers may want to continuously monitor online advertisements by selecting the candidates for the best deal—skyline points. Clearly, we need to “discount” the dominating ability from offers with too low trustability. Moreover, too old offers may not be quite relevant. We could model such an online selection problem as probabilistic skyline against sliding windows by treating online advertisements as an uncertain data stream (see Section 2 for details) such that each data element (advertisement) has an occurrence probability.

An uncertain data stream may arrive at a very high speed. Consider stock market applications where clients may be eager to buy a particular stock and want to online monitor good offers (for sale) from other clients for this particular stock. An offer is recorded by two aspects (price, volume) where price is the price per share in the offer and volume is the number of shares offered for sale. In such applications, customers may want to know the top offers so far, as one of many kinds of statistic information, before making trade decisions. An offer a is better than another deal b if a involves a higher volume and is cheaper (per share) than those of b, respectively. Nevertheless, an offer from a client may be withdrawn from time to time; thus, it also has a probability to exist (i.e. has an occurrence probability). Consequently, a stream of sale offers may be treated as a stream of uncertain elements such that each element has a probability to occur. Clearly, some clients may only want to know “top” offers (skyline) among the most recent N offers (sliding windows) or the offers made in the most recent T period; this, together with the consideration of the uncertainty of each deal gives another example of probabilistic skyline against sliding windows.

While the two examples above demonstrate the usefulness of online monitoring skyline over uncertain data, online monitoring skyline over uncertain streaming data regarding sliding windows has many other applications. In this paper we investigate the problem of efficiently monitoring probabilistic skyline against sliding windows. To the best of our knowledge, there is no similar work existing in the literature in the context of skyline computation over uncertain data steams. In the light of data stream computation, it is highly desirable to develop online, efficient, memory based, incremental techniques using small memory. Our contribution may be summarized as follows.

•
We characterize the minimum information needed in continuously computing probabilistic skyline against a sliding window.
•
We show that the volume of such minimum information is expected to be bounded by poly-logarithmic sizes regarding a given window size.
•
We develop novel, incremental techniques to continuously compute probabilistic skyline over sliding windows.
•
We extend our techniques to support multiple pre-given probability thresholds, “top-k” probabilistic skyline data elements, and data elements with individual life-spans.

Our extensive experiments demonstrate that the developed techniques can support online computation against very rapid data streams.

The rest of the paper is organized as follows. In Section 2, we formally define the problem of sliding-window skyline computation on uncertain data streams and present background information. 3 Framework, 4 Algorithms present the framework, fundamental theories, and techniques for processing probability threshold based sliding window queries. Section 5 extends our techniques to multi-thresholds, top-k skyline, and time-based sliding windows where each data element has a life-span. Results of comprehensive performance studies are discussed in Section 6. Section 7 summarizes related work and Section 8 concludes the paper.

Section snippets

Background

We use DS to represent a sequence (stream) of data elements in a d-dimensional numeric space such that each element a has a probability $P (a)$ ( $0 < P (a) \leq 1$ ) to occur where $a . i$ (for $1 \leq i \leq d$ ) denotes the i-th dimension value. Without loss of generality, we assume that on each dimension, users prefer small values. For two elements u and v, u dominates v, denoted by $u ≺ v$ , if $u . i \leq v . i$ for $1 \leq i \leq d$ , and there exists a dimension j with $u . j < v . j$ . Given a set of elements, the skyline consists of all points which are

Framework

A critical requirement in data stream computation is to have small memory space and fast computation. Consequently, given a probability threshold q and a sliding window with length N, it would be ideal if the continuous computation can be conducted against the probabilistic q-skyline ${SKY}_{N, q}$ of DS_N by excluding all the other elements. However, this is impossible. For instance, regarding the example in Fig. 1(a), assume that N=5 and q=0.5. It is immediate that $P_{sky} (a_{4}) = 0.0342 < q$ ; that is, $a_{4} \notin {SKY}_{N, q}$

Algorithms

A trivial execution of Algorithm 1 is to visit each element in $S_{N, q}$ to update skyline probability when an element inserts or deletes; then choose elements a from $S_{N, q}$ with $P_{sky} |_{S_{N, q}} (a) \geq q$ . Note a new data element may cause several elements to be deleted from $S_{N, q}$ , nevertheless, the time complexity is $O (| S_{N, q} |)$ per element which is expected to be poly-logarithmic regarding N with the order of d (Section 3.2). In this section, we present novel techniques to efficiently execute Algorithm 1 based on

Variants

The techniques developed in the paper can be immediately extended to cover the following variants.

Performance evaluation

In this section, we only evaluate our techniques since this is the first paper studying the problem of probabilistic skyline computation and its variations over sliding windows. Specifically, we implement and evaluate the following techniques.

SSKY
Techniques in Section 4 to continuously compute q-skyline (i.e., skyline with the probability no less than a given q) against a sliding window.
MSKY
Techniques in Section 5.1 to continuously compute multiple q-skylines concurrently regarding multiple given

Related work

We review related work in two aspects, skylines and uncertain data streams.

Skylines. Börzsönyi et al. [4] first study the skyline operator in the context of databases and propose an SQL syntax for the skyline query. They also develop two computation techniques based on block-nested-loop and divide-and-conquer paradigms, respectively. Another block-nested-loop based technique SFS (sort-filter-skyline) is proposed by Chomicki et al. [8], which takes advantage of a presorting step. SFS is then

Conclusion

In this paper, we investigate the problem of efficiently computing skyline against sliding windows over an uncertain data stream. We first model the probability threshold based skyline problem. Then, we present a framework which is based on efficiently maintaining a candidate set. We show that such a candidate set is the minimum information we need to keep. Efficient techniques have been presented to process continuous queries. We extend our techniques to concurrently support processing a set

References (32)

C.C. Aggarwal, P.S. Yu, A framework for clustering uncertain data streams, in: ICDE,...
M.J. Atallah, Y. Qi, Computing all skyline probabilities for uncertain data, in: PODS,...
W.-T. Balke, U. Guntzer, J.X. Zheng, Efficient distributed skylining for web information systems, in: EDBT...
S. Borzsonyi, D. Kossmann, K. Stocker, The skyline operator, in: ICDE,...
C.-Y. Chan, P.-K. Eng, K.-L. Tan, Stratified computation of skylines with partially ordered domains, in: SIGMOD,...
C.-Y. Chan, H.V. Jagadish, K.-L. Tan, A.K.H. Tung, On high dimensional skylines, in: EDBT,...
C.-Y. Chan, H.V. Jagadish, K.-L. Tan, A.K.H. Tung, Z. Zhang, Finding k-dominant skylines in high dimensional space, in:...
J. Chomicki, P. Godfrey, J. Gryz, D. Liang, Skyline with presorting, in: ICDE,...
G. Cormode, M. Garofalakis, Sketching probabilistic data streams, in: SIGMOD,...
E. Dellis, B. Seeger, Efficient computation of reverse skyline queries, in: VLDB,...

P. Godfrey, R. Shipley, J. Gryz, Maximal vector computation in large datasets, in: VLDB,...

Y.-W. Huang, N. Jing, E.A. Rundensteiner, Spatial joins using R-trees: breadth-first traversal with global...

Z. Huang, C.S. Jensen, H. Lu, B.C. Ooi, Skyline queries against mobile lightweight devices in MANETs, in: ICDE,...

T. Jayram, S. Kale, E. Vee, Efficient aggregation algorithms for probabilistic data, in: SODA,...

T.S. Jayram, A. McGregor, S. Muthukrishan, E. Vee, Estimating statistical aggregrates on probabilistic data streams,...

C. Jin, K. Yi, L. Chen, J.X. Yu, X. Lin, Sliding-window top-k queries on uncertain streams, in: VLDB,...

Cited by (20)

Top k probabilistic skyline queries on uncertain data
2018, Neurocomputing
Citation Excerpt :
The first study about the skyline query over uncertain data, namely p-skyline, was reported in [13]. This pioneering work has inspired many follow-up studies about reverse skyline search in uncertain databases [14], probabilistic skyline queries in distributed environments [15–17], probabilistic group subspace skyline queries [18], probabilistic skyline operator over sliding windows [6], uncertain dynamic skyline queries [19], to name just a few. The p-skyline query returns objects whose skyline probabilities are no less than a specified threshold [13].
Uncertainty of data is inherent in many applications, and query processing over uncertain data has gained widespread attention. The probabilistic skyline query is a powerful tool for managing uncertain data. However, the famous probabilistic skyline query, called p-skyline query, is likely to return unattractive objects which have no advantage in either their attributes or skyline probabilities with comparing to other query results. Moreover, it may return too many objects to offer any meaningful insight for customers. In this paper, we first propose a modified p-skyline (MPS) query based on a strong dominance operator to identify truly attractive results. Then we formulate a top k MPS (TkMPS) query on the basis of a new ranking criterion. We present effective approaches for processing the MPS query, and extend these approaches to process the TkMPS query. To improve the query performance, the reuse technique is adopted. Extensive experiments verify that the proposed algorithms for the MPS and TkMPS queries are efficient and effective, our MPS query can filter out 34.44% unattractive objects from the p-skyline query results at most, and although in some cases the results of the MPS and the p-skyline queries are just the same, our MPS query needs much less CPU, I/O, and memory costs.
Spatial skyline queries over incomplete data for smart cities
2018, Journal of Systems Architecture
Nowadays Internet of Things (IoT) gained a great attention from researchers since it promises a smart human being life. By the technology of the IoT, the world will becomes smart in many aspects such as cities which are the main poles of human and economic activity. Analyzing cities data is very important to improve the city economy as well as the life quality of the citizens. Since location based services and GPS devices can easily connect users located in different positions, it is worthwhile to optimize the efficiency of their shifting to a common location according to their preferences. However, in many real-life applications, uncertain, imprecise and incomplete data inherently exist. By the advent of such applications, the effective processing of advanced analysis queries such as the skyline for imperfect data has become important. Basically, the skyline query finds the interesting objects according to a user preferences. Answering spatial skyline query for a set of query points can find many applications in Geographical Information Systems. However, the traditional skyline queries cannot answer a users group needs and are not sufficient to obtain a good choice for all group members. In this paper, we propose a spatial skyline query over imperfect data related to set of query points located at different positions. We propose an imperfect spatial skyline query for users located in different positions. Detailed experimental analysis is reported. In addition, the theoretical properties developed in this paper help to devise efficient techniques to compute the spatial skyline over uncertain data for a set of users. Our extensive experiments show that the proposed algorithms provide quick initial response time.
Optimizined skyline queries over uncertain data using improved scalable framework
2018, Computers and Electrical Engineering
Citation Excerpt :
The communication and computational cost is reduced effectively with this approach. Sliding windowing technique [10] is applied over probabilistic and continuous skyline queries using possible threshold. This technique mainly retrieves top-k data of skyline queries in high speed streaming real time environment.
Skyline operator is an attractive tool for making decision while accessing skyline queries. Varied settings are allowed over this skyline operator, due to its easy use over skyline queries. Most skyline algorithms operates based on the principle of completeness, in making the value of data points to be known. However, in major case, the presence of incomplete values in the dataset allows algorithms to behave ineffective over skyline queries. Conventional Algorithms failed in redefining the dominance notion and this incomplete data can be handled using proposed approach. To eliminate the incomplete data redundancy, a novel framework model is tested with two algorithms: Dynamic Pivot Sweep Line and Dynamic Pivot Reuse Algorithm. A proportional–integral–derivative (PID) controller framework is tested with realand synthetic datasets proved effective and efficient than conventional ones.
Selecting skyline stars over uncertain databases: Semantics and refining methods in the evidence theory setting
2017, Applied Soft Computing Journal
Citation Excerpt :
Atallah and Qi [29] develop sub-quadratic algorithms to compute skyline probabilities for every object. Zhang et al. [20] tackle the problem of efficiently on-line computing probabilistic skyline over sliding windows. Yong et al. [30] studied the problem of supporting skyline queries for uncertain data with maybe confidence.
In recent years, a great attention has been paid to skyline computation over uncertain data. In this paper, we study how to conduct advanced skyline analysis over uncertain databases where uncertainty is modeled thanks to the evidence theory (a.k.a., belief functions theory). We particularly tackle an important issue, namely the skyline stars (denoted by SKY²) over the evidential data. This kind of skyline aims at retrieving the best evidential skyline objects (or the stars). Efficient algorithms have been developed to compute the SKY². Extensive experiments have demonstrated the efficiency and effectiveness of our proposed approaches that considerably refine the huge skyline. In addition, the conducted experiments have shown that our algorithms significantly outperform the basic skyline algorithms in terms of CPU and memory costs.
The σ-neighborhood skyline queries
2015, Information Sciences
Citation Excerpt :
However, the number of retrieved points is quite large [36], particularly when there are many skyline points in the dataset, thereby hampering the decision-making process. When querying an uncertain database (i.e., skyline points may not exist), users can utilize a probabilistic skyline query [1,10,12,26,38,40] to retrieve other data points that have high probability of being skyline points as a substitution. Unfortunately, this approach is only applicable to uncertain databases, which limits its generalizability.
Skyline queries have recently attracted considerable attention for their ability to return data points from a given dataset that are not dominated by any other points. This study extends the concept of skyline queries in the development of a σ-neighborhood skyline query (σ-N skyline query). In contrast to previous methods, the σ-N skyline query finds skyline points and points that are similar, i.e., close to the skyline points. The σ-N skyline points are useful to the user if a skyline point, compared to its σ-N skyline point, is less competitive. In applications such as decision making, market analysis, and business planning, σ-N skyline can provide more flexible answers. This study defines this problem and proposes a new index tree and efficient algorithms to resolve the problem. We conducted a set of simulations to demonstrate the effectiveness and efficiency of the proposed algorithm.
A Systematic Literature Review of Skyline Query Processing Over Data Stream
2023, IEEE Access

View all citing articles on Scopus

View full text