Elsevier

Information Sciences

Volume 456, August 2018, Pages 174-190
Information Sciences

Continuously maintaining approximate quantile summaries over large uncertain datasets

https://doi.org/10.1016/j.ins.2018.04.070Get rights and content

Highlights

  • We define quantile over uncertain datasets in terms of probabilistic cardinality. We develop a novel algorithm, namely uGK, to compute approximate quantile summaries over uncertain datasets incrementally. We theoretically analysis the complexity of the uGK algorithm, and experimentally verify the efficiency of our algorithm. Using only little space, our uGK algorithm obtains summaries that can support any quantile query within a given error.

Abstract

Quantile summarization is a useful tool for management of massive datasets in the rapidly growing number of applications, and its importance is further enhanced with uncertainty in the data being explored. In this paper, we focus on the problem of computing approximate quantile summaries over large uncertain datasets. On the basis of GK [14] algorithm, we propose a novel online algorithm namely uGK. Using only little space, the proposed uGK algorithm maintains a small set of tuples, each of which contains a point value and the “count” of uncertain elements that are not larger than this value, and supports any quantile query within a given error. Experimental evaluation on both synthetic and real-life datasets illustrates the effectiveness of our uGK algorithm.

Introduction

Over recent years massive datasets generated in the presence of uncertainty have been increasingly common in numerous applications, e.g., sensor networks, environmental monitoring, moving object management, data cleaning, and integration. The uncertainty in these applications results from unreliable data transferring, imprecise measurement, repeated sampling, privacy protection, and so forth [18], [28]. The applications have created the demand for efficiently processing and managing massive uncertain datasets, which has gradually become the first-class issue in modern database systems [6], [8]. To this end, the first and fundamental operation is using effective data reduction techniques to compress large amounts of uncertain data down to summaries with capturing important portraits of the original one [8], [50]. Similar to their deterministic counterpart, such summaries provide the foundation for answering query processing, query plan and optimization, and statistical data analysis over uncertain data [6], [8], [50].

Among various data reduction techniques [1], quantile summarization is the one that is able to efficiently characterize the distribution of a real world dataset. Informally, a quantile is the element at a specified position of a sorted data sequence. Recently, most studies have focused on estimating approximate quantiles, in which error bound is used to improve space and time efficiency. In addition to the fields above, approximate quantile is also very useful to data mining, database parallelization, and management of data streams [14]. Hence, it is interesting and important to compute approximate quantile summaries on massive uncertain datasets. An estimate of such summaries over massive datasets with uncertainty can provide the distribution function induced by such datasets with reasonable precision approximation, supporting aforementioned uses. Some more concrete examples are as follows.

  • In sensor networks, sensors are commonly deployed in an ad-hoc fashion to monitor various physical environments, such as temperature, sound, light intensity, and so forth [19], [40]; for energy conservation, individual sensor nodes are usually designed to transmit the distribution of the uncertain sensor values to base stations, thereby users can pose sophisticated queries or analysis into the sensor data [15], [19].

  • In data stream classification applications, take credit card fraud detection for example, private information of customer such as age, address, and occupation may be masked by imprecise values in publishing for the purpose of data mining; the distribution of an uncertain data stream can be used as sufficient information for constructing classification models, such as very fast decision trees [11].

  • In management of uncertain datasets, the approximation of the distribution of the datasets can be used to implement a query optimization or aggregate query processing, such as estimating cardinality on two uncertain attributes [6], [41].

Despite the practical importance above, there have been very few works proposed to compute summary of uncertain data (we review related work in next section). Existing important works include computing histogram-based, wavelet-based, and essential aggressive-based summaries on probabilistic data [6], [8], [20], [47]. These works only consider categorical or static data; however, uncertain numerical data, which is frequently generated in an incremental or stream fashion [11], [32], is ubiquitous in many application domains. Take the applications above for example, the attributes of interest, such as temperature, sound, age, and income, are generated continuously with uncertainty represented by continuous probability distributions, e.g., uniform or Gaussian distributions [10], [28]. In our previous work [32], to classify uncertain data streams, we gave a somewhat simplistic solution to approximate incrementally the Gaussian distribution of these streams. The solution, however, does not bound the error between obtained distribution and the real one.

Motivated by the above, we in this paper consider the problem of incrementally computing approximate quantile summaries over a large dataset with numerical uncertainty. Generally, the definition of quantile is related to an ordered sequence; however, data elements with uncertainty cannot be sorted. Thus, instead of defining quantile according to the sorted data, we introduce quantile over uncertain data based on probabilistic cardinality. On the basis of GK algorithm [14], we propose an algorithm, namely uGK, for computing quantile summaries over uncertain data online. The proposed uGK algorithm uses similar data structure to store the summary and has similar space complexity to those of GK. In computing process, it makes only a single scan on uncertain data and requires very little space, but obtains a summary that can approximate the real observed distribution within a given error. Our main contributions can be summarized as follows:

  • We give the definition of quantile over uncertain datasets based on probabilistic cardinality, and state the problem of computing quantiles over such datasets.

  • On the basis of GK algorithm, we develop uGK algorithm to compute quantile summaries over uncertain datasets incrementally.

  • We theoretically analyze space and time complexity of our uGK algorithm.

  • We conduct a comprehensive experimental study on both real and synthetic datasets, illustrating the effectiveness of our proposed uGK algorithms.

In the rest of this paper, we firstly discuss related works in Section 2. Formal problem definition is described in Section 3. We discuss the details of our uGK algorithm in Section 4. Later, in Section 5 we analyze the time and space complexity of our uGK algorithm. Finally, the experimental study is presented in Section 6, and we conclude our work in Section 7.

Section snippets

Uncertain data management

The studies of management of uncertain data begun in the end of 1980s. Since then there have been growing interests in this domain, and numerous works have been reported, spanning a wide range of issues from data models [13] and pre-processing [5], [12] to storing and indexing [46], and query processing [29]. In addition, several institutes have developed prototype systems for managing such data, e.g., Trio [51], MystiQ [3], Orion [44], BayesStore [49] and Avatar [22]. In these works, data

Problem definition

In this section, we begin with introducing the data model. Then we discuss some notions regarding quantile on exact datasets, and extend to uncertain data streams (we summarize large uncertain datasets in a stream fashion, hence, we may use large uncertain datasets or uncertain data streams alternatively hereafter). Lastly, we formally define the problem of computing quantile over uncertain data streams.

Quantile summarization algorithm for uncertain data streams

For each element from an uncertain data stream, the proposed uGK algorithm performs inserting and merging operations over a summary of a dedicated design data structure.

Complexity analysis

In this section, we analyze space complexity and time complexity of our proposed uGK algorithm.

Experimental study

In this section we present results from an extensive empirical study over a series of datasets to illustrate the effectiveness of the proposed uGK algorithm. We compare uGK with two naive methods using GK algorithm [14], namely SPL-GK and AVG-GK, and our previous method Gaussian approximation (GA) [32] in terms of tuple numbers and quantile query errors.

  • SPL-GK. This method samples a point from each item of the input uncertain data stream and computes quantile summaries on this deterministic

Conclusion

We studied the problem of computing ϵapproximate ϕquantile in the presence of numerical uncertainty. We defined quantile for uncertain datasets base on probabilistic cardinality, rather than an ordered sequence. On the basis of GK algorithm, we presented a novel algorithm uGK for efficiently maintaining such quantile summaries over large uncertain datasets online. The proposed uGK algorithm used similar data structure and had similar space complexity to those of GK. By making only a single

Acknowledgments

This research is substantially supported by the National Natural Science Foundation of China (61402375) and the National High-tech R&D Program of China (2013AA10230402). The authors would like to thank the anonymous reviewers of this paper. Their valuable and constructive suggestions have played a significant role in improving the quality of this work.

References (53)

  • P.K. Agarwal et al.

    Mergeable Summaries

    Proceedings of ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’12)

    (2012)
  • B.R. Agrawal et al.

    A one-pass space-efficient algorithm for finding quantiles

    Proceedings of International Conference Management of Data (COMAD’99)

    (1999)
  • L. Antova et al.

    MayBMS: managing incomplete information with probabilistic world-set decompositions

    Proceedings of the IEEE International Conference on Data Engineering (ICDE’07)

    (2007)
  • S. Chaudhuri et al.

    Random sampling for histogram construction: how much is enough?

    Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD’98)

    (1998)
  • R. Cheng et al.

    Cleaning uncertain data with quality guarantees

    Proceedings of VLDB Endowment (VLDB’08)

    (2008)
  • G. Cormode et al.

    Probabilistic histograms for probabilistic data

    Proceedings of the VLDB Endowment (VLDB’09)

    (2009)
  • G. Cormode et al.

    Sketching probabilistic data streams

    Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD’07)

    (2007)
  • G. Cormode et al.

    Histograms and wavelets on probabilistic data

    Proceedings of IEEE International Conference on Data Engineering (ICDE’09)

    (2009)
  • G. Cormode et al.

    Effective computation of biased quantiles over data streams

    Proceedings of International Conference on Data Engineering (ICDE’05)

    (2005)
  • P. Domingos et al.

    Mining high-speed data streams

    Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD’00)

    (2000)
  • X.L. Dong et al.

    Data integration with uncertainty

    VLDB J.

    (2009)
  • T.J. Green et al.

    Models for Incomplete and Probabilistic Information

    (2006)
  • M. Greenwald et al.

    Space-efficient online computation of quantile summaries

    Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD’01)

    (2001)
  • M.B. Greenwald et al.

    Power-conserving computation of order-statistics over sensor networks

    Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD’11)

    (2004)
  • S. Guha et al.

    Stream Order and Order Statistics: Quantile Estimation in Random-Order Streams

    (2009)
  • R. Haider et al.

    Fast on-line summarization of RFID probabilistic data streams

    Commun. Comput. Inf. Sci.

    (2012)
  • Cited by (3)

    • Sliding-Window Probabilistic Threshold Aggregate Queries on Uncertain Data Streams

      2020, Information Sciences
      Citation Excerpt :

      Then, they propose an approximate aggregate algorithm based on least square fitting. Recently, Liang et al. [24,25] propose space efficient algorithms to compute approximate quantile summaries. Goman [17] proposes an approximate algorithm that uses a set of parameters based on key moments and quantiles to describe the result distributions.

    View full text