Continuously maintaining approximate quantile summaries over large uncertain datasets
Introduction
Over recent years massive datasets generated in the presence of uncertainty have been increasingly common in numerous applications, e.g., sensor networks, environmental monitoring, moving object management, data cleaning, and integration. The uncertainty in these applications results from unreliable data transferring, imprecise measurement, repeated sampling, privacy protection, and so forth [18], [28]. The applications have created the demand for efficiently processing and managing massive uncertain datasets, which has gradually become the first-class issue in modern database systems [6], [8]. To this end, the first and fundamental operation is using effective data reduction techniques to compress large amounts of uncertain data down to summaries with capturing important portraits of the original one [8], [50]. Similar to their deterministic counterpart, such summaries provide the foundation for answering query processing, query plan and optimization, and statistical data analysis over uncertain data [6], [8], [50].
Among various data reduction techniques [1], quantile summarization is the one that is able to efficiently characterize the distribution of a real world dataset. Informally, a quantile is the element at a specified position of a sorted data sequence. Recently, most studies have focused on estimating approximate quantiles, in which error bound is used to improve space and time efficiency. In addition to the fields above, approximate quantile is also very useful to data mining, database parallelization, and management of data streams [14]. Hence, it is interesting and important to compute approximate quantile summaries on massive uncertain datasets. An estimate of such summaries over massive datasets with uncertainty can provide the distribution function induced by such datasets with reasonable precision approximation, supporting aforementioned uses. Some more concrete examples are as follows.
- •
In sensor networks, sensors are commonly deployed in an ad-hoc fashion to monitor various physical environments, such as temperature, sound, light intensity, and so forth [19], [40]; for energy conservation, individual sensor nodes are usually designed to transmit the distribution of the uncertain sensor values to base stations, thereby users can pose sophisticated queries or analysis into the sensor data [15], [19].
- •
In data stream classification applications, take credit card fraud detection for example, private information of customer such as age, address, and occupation may be masked by imprecise values in publishing for the purpose of data mining; the distribution of an uncertain data stream can be used as sufficient information for constructing classification models, such as very fast decision trees [11].
- •
In management of uncertain datasets, the approximation of the distribution of the datasets can be used to implement a query optimization or aggregate query processing, such as estimating cardinality on two uncertain attributes [6], [41].
Despite the practical importance above, there have been very few works proposed to compute summary of uncertain data (we review related work in next section). Existing important works include computing histogram-based, wavelet-based, and essential aggressive-based summaries on probabilistic data [6], [8], [20], [47]. These works only consider categorical or static data; however, uncertain numerical data, which is frequently generated in an incremental or stream fashion [11], [32], is ubiquitous in many application domains. Take the applications above for example, the attributes of interest, such as temperature, sound, age, and income, are generated continuously with uncertainty represented by continuous probability distributions, e.g., uniform or Gaussian distributions [10], [28]. In our previous work [32], to classify uncertain data streams, we gave a somewhat simplistic solution to approximate incrementally the Gaussian distribution of these streams. The solution, however, does not bound the error between obtained distribution and the real one.
Motivated by the above, we in this paper consider the problem of incrementally computing approximate quantile summaries over a large dataset with numerical uncertainty. Generally, the definition of quantile is related to an ordered sequence; however, data elements with uncertainty cannot be sorted. Thus, instead of defining quantile according to the sorted data, we introduce quantile over uncertain data based on probabilistic cardinality. On the basis of GK algorithm [14], we propose an algorithm, namely uGK, for computing quantile summaries over uncertain data online. The proposed uGK algorithm uses similar data structure to store the summary and has similar space complexity to those of GK. In computing process, it makes only a single scan on uncertain data and requires very little space, but obtains a summary that can approximate the real observed distribution within a given error. Our main contributions can be summarized as follows:
- •
We give the definition of quantile over uncertain datasets based on probabilistic cardinality, and state the problem of computing quantiles over such datasets.
- •
On the basis of GK algorithm, we develop uGK algorithm to compute quantile summaries over uncertain datasets incrementally.
- •
We theoretically analyze space and time complexity of our uGK algorithm.
- •
We conduct a comprehensive experimental study on both real and synthetic datasets, illustrating the effectiveness of our proposed uGK algorithms.
In the rest of this paper, we firstly discuss related works in Section 2. Formal problem definition is described in Section 3. We discuss the details of our uGK algorithm in Section 4. Later, in Section 5 we analyze the time and space complexity of our uGK algorithm. Finally, the experimental study is presented in Section 6, and we conclude our work in Section 7.
Section snippets
Uncertain data management
The studies of management of uncertain data begun in the end of 1980s. Since then there have been growing interests in this domain, and numerous works have been reported, spanning a wide range of issues from data models [13] and pre-processing [5], [12] to storing and indexing [46], and query processing [29]. In addition, several institutes have developed prototype systems for managing such data, e.g., Trio [51], MystiQ [3], Orion [44], BayesStore [49] and Avatar [22]. In these works, data
Problem definition
In this section, we begin with introducing the data model. Then we discuss some notions regarding quantile on exact datasets, and extend to uncertain data streams (we summarize large uncertain datasets in a stream fashion, hence, we may use large uncertain datasets or uncertain data streams alternatively hereafter). Lastly, we formally define the problem of computing quantile over uncertain data streams.
Quantile summarization algorithm for uncertain data streams
For each element from an uncertain data stream, the proposed uGK algorithm performs inserting and merging operations over a summary of a dedicated design data structure.
Complexity analysis
In this section, we analyze space complexity and time complexity of our proposed uGK algorithm.
Experimental study
In this section we present results from an extensive empirical study over a series of datasets to illustrate the effectiveness of the proposed uGK algorithm. We compare uGK with two naive methods using GK algorithm [14], namely SPL-GK and AVG-GK, and our previous method Gaussian approximation (GA) [32] in terms of tuple numbers and quantile query errors.
- •
SPL-GK. This method samples a point from each item of the input uncertain data stream and computes quantile summaries on this deterministic
Conclusion
We studied the problem of computing approximate quantile in the presence of numerical uncertainty. We defined quantile for uncertain datasets base on probabilistic cardinality, rather than an ordered sequence. On the basis of GK algorithm, we presented a novel algorithm uGK for efficiently maintaining such quantile summaries over large uncertain datasets online. The proposed uGK algorithm used similar data structure and had similar space complexity to those of GK. By making only a single
Acknowledgments
This research is substantially supported by the National Natural Science Foundation of China (61402375) and the National High-tech R&D Program of China (2013AA10230402). The authors would like to thank the anonymous reviewers of this paper. Their valuable and constructive suggestions have played a significant role in improving the quality of this work.
References (53)
- et al.
Model-driven Data Acquisition in Sensor Networks
Proceedings of VLDB Endowment (VLDB’04)
(2004) - et al.
Probabilistic skyline queries on uncertain time series
Neurocomputing
(2016) - et al.
A histogram method for summarizing multi-dimensional probabilistic data
Procedia Comput. Sci.
(2013) - et al.
Moving range K nearest neighbor queries with quality guarantee over uncertain moving objects
Inf. Sci.
(2015) - et al.
Network Voronoi diagram on uncertain objects for nearest neighbor queries
Inf. Sci.
(2015) - et al.
Learning very fast decision tree from uncertain data streams with positive and unlabeled samples
Inf. Sci.
(2012) - et al.
Selection and sorting with limited storage
Theor. Comput. Sci.
(1980) - et al.
Trustworthy answers for top-k queries on uncertain Big Data in decision making
Inf. Sci.
(2015) - et al.
Efficient computation for probabilistic skyline over uncertain preferences
Inf. Sci.
(2015) - et al.
Efficient monochromatic and bichromatic probabilistic reverse top-k query processing for uncertain big data
J. Comput. Syst. Sci.
(2017)
Mergeable Summaries
Proceedings of ACM SIGMOD-SIGACT-SIGART Symposium on Principles of Database Systems (PODS’12)
A one-pass space-efficient algorithm for finding quantiles
Proceedings of International Conference Management of Data (COMAD’99)
MayBMS: managing incomplete information with probabilistic world-set decompositions
Proceedings of the IEEE International Conference on Data Engineering (ICDE’07)
Random sampling for histogram construction: how much is enough?
Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD’98)
Cleaning uncertain data with quality guarantees
Proceedings of VLDB Endowment (VLDB’08)
Probabilistic histograms for probabilistic data
Proceedings of the VLDB Endowment (VLDB’09)
Sketching probabilistic data streams
Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD’07)
Histograms and wavelets on probabilistic data
Proceedings of IEEE International Conference on Data Engineering (ICDE’09)
Effective computation of biased quantiles over data streams
Proceedings of International Conference on Data Engineering (ICDE’05)
Mining high-speed data streams
Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (SIGKDD’00)
Data integration with uncertainty
VLDB J.
Models for Incomplete and Probabilistic Information
Space-efficient online computation of quantile summaries
Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD’01)
Power-conserving computation of order-statistics over sensor networks
Proceedings of ACM SIGMOD International Conference on Management of Data (SIGMOD’11)
Stream Order and Order Statistics: Quantile Estimation in Random-Order Streams
Fast on-line summarization of RFID probabilistic data streams
Commun. Comput. Inf. Sci.
Cited by (3)
Sliding-Window Probabilistic Threshold Aggregate Queries on Uncertain Data Streams
2020, Information SciencesCitation Excerpt :Then, they propose an approximate aggregate algorithm based on least square fitting. Recently, Liang et al. [24,25] propose space efficient algorithms to compute approximate quantile summaries. Goman [17] proposes an approximate algorithm that uses a set of parameters based on key moments and quantiles to describe the result distributions.
Online computing quantile summaries over uncertain data streams
2019, IEEE Access