Elsevier

Information Sciences

Volume 226, 20 March 2013, Pages 23-46
Information Sciences

Probabilistic top-k dominating queries in uncertain databases

https://doi.org/10.1016/j.ins.2012.10.020Get rights and content

Abstract

Due to the existence of uncertain data in a wide spectrum of real applications, uncertain query processing has become increasingly important, which dramatically differs from handling certain data in a traditional database. In this paper, we formulate and tackle an important query, namely probabilistic top-k dominating (PTD) query, in the uncertain database. In particular, a PTD query retrieves k uncertain objects that are expected to dynamically dominate the largest number of uncertain objects. We propose an effective pruning approach to reduce the PTD search space, and present an efficient query procedure to answer PTD queries. Moreover, approximate PTD query processing and the case where the PTD query is issued from an uncertain query object are also discussed. Furthermore, we propose an important query type, that is, the PTD query in arbitrary subspaces (namely SUB-PTD), which is more challenging, and provide an effective pruning method to facilitate the SUB-PTD query processing. Extensive experiments have demonstrated the efficiency and effectiveness of our proposed PTD query processing approaches.

Introduction

Uncertain data exist in many real-world applications such as sensor networks [14], [23], object identification [2], location-based services (LBS) [30], and moving object tracking [8], [6], [28]. Uncertain data objects are usually modeled as uncertainty regions [8], [39], in which objects can reside with any data distribution. In some applications, the probabilistic density function (pdf) of each object is known [9], [2]. In other cases, the pdf is practically not explicitly available [32]. Therefore, a number of instances are collected to mimic such pdf. For example, in sensor networks, sensory data collected at a specific timestamp often contain noise due to environmental factors or device failure. In this case, samples obtained within a short period around that timestamp can be used to represent the distribution of its possible values.

Table 1 depicts an example of uncertain database, which consists of three uncertain objects u, v, and w. Each uncertain object t may have one or more instances, ti, with appearance probability ti · p  [0, 1] (i.e. the probability that an instance appears at a position). For instance, object u has one instance u1 residing at location 〈6, 2〉 with probability u1 · p = 1. Similarly, object v has two instances v1 and v2, with probabilities v1 · p = 0.3 and v2 · p = 0.7, respectively. Furthermore, object w contains two instances w1 and w2 with appearance probabilities w1 · p = 0.5 and w2 · p = 0.5, respectively.

Query processing over uncertain databases has played an increasingly important role in applications like multi-criteria decision making [32], data cleansing [29], and so on. One important query type in the uncertain database is called probabilistic ranked (PRank) query [25], which retrieves uncertain objects that are expected to have the ith rank with the highest probability, for 1  i  k, where k is a user-specified integer and the rank of each object is determined by score computed with a linear function. A clear advantage for the PRank query is that users can control the size of the PRank answer set through parameter k. On the other hand, however, it might not be convenient for users to specify an appropriate ranking function, since ranking scores are sensitive to different scales in different dimensions and moreover there are no explicit guidelines to select ranking functions.

In literature, Pei et al. [32] proposed the probabilistic skyline query, which retrieves all the uncertain objects that have the expected probability of being skyline greater than a threshold. The skyline query has one nice property in the sense that its query processing does not require users to specify a ranking function. Furthermore, its query result is invariable to scales in different dimensions. However, one problem with the skyline definition is that the size of the skyline answer set cannot be flexibly controlled by users. That is, users might be overwhelmed by too many returned skyline objects.

Motivated by the shortcomings of both queries above, in this paper, we formulate an important query, namely probabilistic top-k dominating (PTD) query, in the context of uncertain databases. Specifically, a PTD query obtains k uncertain objects in an uncertain database that are expected to be better than (called dynamically dominate) the largest number of objects with respect to a query point. Note that, the PTD query has the advantages of both probabilistic ranked and skyline queries, that is, invariable to scales in different dimensions, without users’ efforts to specify ranking functions, and with the control on the size of the answer set.

The PTD query has many practical applications. For example, in a coal mine surveillance application [45], [23], sensors are deployed in the mine to collect sensory data (i.e. samples) such as density of oxygen, gas, and dust, as well as temperature and humidity. Dangerous events like fires or gas leakage in the coal mine usually follow some patterns (a.k.a. contour maps [45]) in the data. Thus, once such a dangerous event is detected, workers should evacuate from the mine. On account of environmental factors, device failure, or transmission delay, the collected samples from sensors inherently contain noise. It is therefore important for coal manager to accurately detect dangerous patterns on such uncertain data (note: false alarms may lead to loss of millions of dollars for each evacuation [45]). Here, samples collected from each sensor within a period can be considered as instances of an uncertain object (with attributes like temperature and oxygen density) in the uncertain database. Given a query pattern (e.g. a fire pattern with temperature and oxygen density thresholds), a coal manager can conduct a PTD query to obtain k sensors that are most likely to encounter fire events among all the monitored places. Intuitively, a place has high chance to be on fire, if sensor data reported in many other places are farther away from a fire pattern than data collected in this place on all attributes.

Given another example, assume we want to deploy an equipment in open air for environmental surveillance. However, the equipment has its own required working conditions (e.g. temperature, humidity and light). Now if we have some historical sensory data which are collected from different places (data from each place correspond to samples of an uncertain object) in this area, then we can issue a PTD query over these (noisy) sensor data, and find k potential locations (where data are collected) to place this equipment. Intuitively, the PTD answers are those places that suit with the requirements of the equipment better than the largest numbers of other places.

To our best knowledge, no previous work has studied the PTD problem in uncertain databases. Yiu and Mamoulis [47] explored the top-k dominating query in the “certain” database with a static setting, which is however not directly applicable to uncertain data (otherwise, inaccuracy or even errors may be introduced due to the data uncertainty). Basically, the complexity and challenges of our PTD query processing are twofold. First, rather than static skyline where attribute values of each object are fixed (in the definition of [47], [32]), in this paper, we consider dynamic skyline [31], [12] such that each attribute of objects is dynamically computed with respect to an ad hoc query point. Second, given an uncertain database, our PTD query is equivalent to processing a top-k dominating query over each combination of instances from uncertain objects and then aggregating (condensing) the query results in all combinations to obtain answers with the highest ranks. However, it is inefficient and even infeasible to materialize every possible combinations (due to the exponential size) to conduct queries, which results in our efficiency concerns of PTD query processing. Thus, specific techniques should be designed for efficiently answering PTD queries without materializing all the instance combinations.

Therefore, in this paper, we first formalize the PTD query in the uncertain database. Then, to efficiently answer PTD queries, we propose effective pruning methods to reduce the PTD search space and seamlessly integrate them into an efficient query procedure. Moreover, by further trading the accuracy for efficiency, we propose an approximate approach which utilizes a probabilistic FM-sketch to facilitate the PTD query processing. In addition, the PTD query processing with uncertain query object is also discussed.

Furthermore, we also propose a novel query that considers PTD queries in arbitrary subspaces, namely probabilistic top-k query in the subspace (SUB-PTD), which obtains PTD answers in arbitrary user-specified subspaces. Note that, this type of query is practical and important in applications where users have different attribute preferences on the PTD queries. For example, the fire event may correspond to subspace with two dimensions of temperature and density of gas. Thus, PTD should be performed in this 2D subspace rather than the full space with all attributes. One major challenge of SUB-PTD is on how to handle exponential number of possible queried subspaces, which involves non-trivial issues of indexing and query efficiency. In this work, we will give efficient and effective solutions to the SUB-PTD problem.

In particular, we make the following contributions.

  • 1.

    We formulate and tackle the problem of probabilistic top-k dominating (PTD) query in the context of uncertain databases in Section 3.

  • 2.

    We present heuristics of our pruning methods and propose an efficient approach to retrieve the exact answer to PTD queries in Section 4.

  • 3.

    We propose an efficient approach to obtain approximate PTD answers in Section 5, trading accuracy for efficiency. The PTD variant where query point is uncertain is also discussed in Section 6.

  • 4.

    We propose an efficient solution to a novel problem, namely SUB-PTD query, in Section 7, which retrieves the PTD answers in arbitrary subspaces.

In addition, Section 2 briefly overviews the top-k dominating query processing over precise data and previous works on query processing in uncertain databases. Section 8 demonstrates the performance of PTD query processing through extensive experiments. Section 9 concludes this paper.

Section snippets

Related work

In this section, we overview previous works on the top-k dominating query in the “certain” database and query processing in the context of uncertain databases.

Dynamic dominance relationship

First, we give the definition of dynamic dominance between two points u and v with respect to a query point q.

Definition 3.1

Dynamic Dominance [12]

Given a query point q and two points u and v, point u dynamically dominates point v with respect to q (denoted as u  qv), iff (1) ∣u · Ai  q · Ai  v · Ai  q · Ai∣ for all dimensions 1  i  d and (2) ∣u · Aj  q · Aj < v · Aj  q · Aj∣ for at least one dimension 1  j  d, where u · Ai is the coordinate of point u on the ith dimension.

Example 1

Fig. 1 illustrates a 2D example of dynamic dominance relationship between points. In

Probabilistic top-k dominating search

After formalizing the PTD problem, in this section, we study the efficiency issues of answering the PTD query in an uncertain database D. Naturally, one straightforward way to obtain the PTD answers is as follows. For each uncertain object tD, we sequentially scan instances sj in the entire database and check the dominance relationship between instances ti and sj for each instance combination in the database, meanwhile evaluate the score, score(t), defined in Definition 3.4. Clearly, this

Approximate probabilistic top-k dominating search

In this section, instead of exactly answering PTD queries, we propose an approximate approach that can achieve higher efficiency. Section 5.1 presents the basic idea of our approximate solutions. Section 5.2 discusses details of the approximate version of PTD.

PTD queries with uncertain query object

Up to now, we always assume that the query object for PTD is a precise point. In this section, we consider the problem where the PTD query object is also an uncertain object q containing instances q1, q2,  , qq. In this case, we define the score of an uncertain object as the expected score with different query instances qi. Formally, score(t)=i=1|q|qi.p·score(qi,t), where qi.p is the appearance probability of query instance qi, and score(qi,t) is the score given by Eq. (2) considering qi as

Probabilistic top-k dominating search in arbitrary subspaces

In this section, we discuss the top-k dominating problem with a novel setting, that is, probabilistic top-k dominating query in the subspace (SUB-PTD) as given in Definition 3.5. In real-world applications like coal mine surveillance [45], [23], each uncertain object (e.g. sensory data collected from sensors) may contain many different attributes like the density of gas, oxygen, and dust, the temperature, humidity, light, and so on. Each event, however, may correspond to only a few, instead of

Experimental evaluation

In this section, we demonstrate the efficiency and effectiveness of our proposed approaches to answer probabilistic top-k dominating (PTD) queries.

Data Sets. Specifically, we test the performance of PTD queries over both real and synthetic data sets with different parameter settings. For synthetic data sets, in order to generate an uncertain object t for a d-dimensional uncertain database, we first determine a region centered at location Ct and with proximity radius rt  [rmin, rmax] in a data

Conclusions

Query processing in the uncertain database has become very hot recently due to the wide existence of uncertain data in many real applications. In this paper, we formulate and tackle the probabilistic top-k dominating (PTD) query in the uncertain database. In particular, we formalize the PTD query, which retrieves k uncertain objects in the database that dynamically dominate the highest number of instances for all possible instance combinations. Then, we propose an effective method to reduce the

Acknowledgments

This work is supported in part by National Grand Fundamental Research 973 Program of China under Grant 2012-CB316200, NSFC Key Project 61232018, the Hong Kong RGC GRF Project No. 611411, HP IRP Project 2011, Microsoft Research Asia Grant, MRA11EG05 and HKUST RPC Grant RPC10EG13.

References (50)

  • O. Benjelloun, A. Das Sarma, A.Y. Halevy, J. Widom, ULDBs: databases with uncertainty and lineage, in: Proceedings of...
  • C. Böhm, A. Pryakhin, M. Schubert, The Gauss-tree: efficient object identification in databases of probabilistic...
  • S. Börzsönyi, D. Kossmann, K. Stocker, The skyline operator, in: Proceedings of 17th International Conference on Data...
  • T. Calders, N. Dexters, B. Goethals, Mining frequent itemsets in a stream, in: Proceedings of 2007 IEEE International...
  • J. Chen, R. Cheng, Efficient evaluation of imprecise location-dependent queries, in: Proceedings of 23th International...
  • L. Chen, M.T. Ozsu, V. Oria, Robust and fast similarity search for moving object trajectories, in: Proceedings of ACM...
  • R. Cheng, L. Chen 0002, J. Chen, X. Xie, Evaluating probability threshold k-nearest-neighbor queries over uncertain...
  • R. Cheng, D. Kalashnikov, S. Prabhakar, Querying imprecise data in moving object environments, in: IEEE Transactions on...
  • R. Cheng, D.V. Kalashnikov, S. Prabhakar, Evaluating probabilistic queries over imprecise data, in: Proceedings of ACM...
  • R. Cheng, S. Singh, S. Prabhakar, U-DBMS: a database system for managing constantly-evolving data, in: Proceedings of...
  • G. Cormode, M. Garofalakis, Sketching probabilistic data streams, in: Proceedings of ACM SIGMOD International...
  • E. Dellis, B. Seeger, Efficient computation of reverse skyline queries, in: Proceedings of 33rd International...
  • E. Dellis, A. Vlachou, I. Vladimirskiy, B. Seeger, Y. Theodoridis, Constrained subspace skyline computation, in: CIKM,...
  • A. Faradjian, J. Gehrke, P. Bonnet, Gadt: a probability space ADT for representing and querying the physical world, in:...
  • M. Hua, J. Pei, W. Zhang, X. Lin, Ranking queries on uncertain data: a probabilistic threshold approach, in:...
  • W. Jin et al.

    On efficient processing of subspace skyline queries on high dimensional data

    SSDBM

    (2007)
  • G. Kollios, K. Yi, F. Li, D. Srivastava, Efficient processing of top-k queries in uncertain databases, in: Proceedings...
  • H.-P. Kriegel, P. Kroger, M. Schubert, Z. Zhu, Efficient query processing in arbitrary subspaces using vector...
  • H.-P. Kriegel, P. Kunath, M. Pfeifle, M. Renz, Probabilistic similarity join on uncertain data, in: Proceedings of...
  • H.-P. Kriegel, P. Kunath, M. Renz, Probabilistic nearest-neighbor query on uncertain objects, in: Proceedings of...
  • I. Lazaridis, S. Mehrotra, Progressive approximate aggregate queries with a multi-resolution tree structure, in:...
  • J. Li et al.

    A unified approach to ranking in probabilistic databases

    PVLDB

    (2009)
  • M. Li et al.

    Underground coal mine monitoring with wireless sensor networks

    ACM Transactions on Sensor Networks

    (2009)
  • X. Lian, L. Chen, Monochromatic and bichromatic reverse skyline search over uncertain databases, in: Proceedings of ACM...
  • X. Lian, L. Chen, Probabilistic ranked queries in uncertain databases, in: Proceedings of 11th International Conference...
  • Cited by (0)

    View full text