Processing top-k queries from samples
Introduction
Top-k computation is an important data processing tool and constitute a basic aggregation query. In many applications, it is not feasible to examine the whole dataset and therefore approximate query processing is performed using a random sample of the records [4], [8], [13], [20], [14], [1]. These applications arise when the dataset is massive or highly distributed [12] such as the case with IP packet traffic that is both distributed and sampled, and with Netflow records that are aggregated over sampled packet traces and collected distributively. Other applications arise when the value of the attribute we aggregate over is not readily available and determining it for a given record has associated (computational or other) cost. For example, when we aggregate over the domain name that corresponds to a source or destination IP address, the domain name is obtained via a reverse DNS lookups which we may want to perform only on a sample of the records.
A top-k query over some attribute is to determine the k most common values for this attribute and their frequencies (number of occurrences) over a set of records. Examples of such queries are to determine the top-100 Autonomous Systems (AS) destinations, the top-100 applications (web, p2p, other protocols), 10 most popular Web sites, or 20 most common domain names. These queries can be posed in terms of number of IP packets (each packet is considered a record), number of distinct IP flows (each distinct flow is considered a record), or other unit of interest. We are interested in processing top-k queries from a sample of the records. For example, from a sampled packet streams or from a sample of the set of distinct flows. We seek probabilistic or approximate answers that are provided with confidence intervals.
It is interesting to contrast top-k queries with proportion queries. A proportion query is to determine the frequency of a specified attribute value in a dataset. Examples of proportion queries are to estimate the fraction of IP packets or IP flows that belong to p2p applications, originate from a specific AS, or from a specific Web site.
Processing an approximate proportion query from a random sample is a basic and well understood statistical problem. The fraction of sampled records with the given attribute value is an unbiased estimator, and confidence intervals are obtained using standard methods.
Processing top-k queries from samples is more challenging. When the complete dataset is observed, we can compute the frequency of each value and take the top-k most frequent values. When we have a random sample of the records, the natural estimator is the result of performing the same action on the sample. That is, obtaining the k most frequent values in the sample and proportionally scaling them to estimate the frequencies of the top-k values in the real dataset. This estimator, however, is biased upwards: The expectation of the combined frequency of the top-k items in the sample is generally larger than the value of this frequency over the unsampled records. This is a consequence of the basic statistical property that the expectation of the maximum of a set of random variables is generally larger than the maximum of their expectations. While this bias must be accounted for when deriving confidence intervals and when evaluating the relation between the sampled and the actual top-k sets, it is not easy to capture as it depends on the fine structure of the full distribution of frequencies in the unsampled dataset, which is not available to us.
In Sections 3 Basic bounds for top-, 4 The Naive confidence interval, 5 Using a cumulative upper bound, 6 Cross-validation methods, 7 Evaluation results, we devise and evaluate three basic methods to derive confidence intervals for top-k estimates. The main problem which we consider is to estimate the sum of the frequencies of the top-k values.
- •
“Naive” bound: Let be the sum of the frequencies of the top-k elements in the sample. We consider distributions (datasets) for which the probability that in a sample the sum of the frequencies of the top-k elements is at least is at least . Among these distributions we look for those of smallest sum of top-k frequencies, say this sum is x. We use x as the lower end of our confidence interval. By constructing the confidence interval this way we capture both the bias of the sampled top-k frequency and standard proportion error bounds. The definition of the Naive bound requires to consider all distributions, which is not computationally feasible. To compute this interval, we identify a restricted set of distributions such that it is sufficient to consider these distributions. We then construct a precomputed table providing the bound for the desired confidence level and sampled top-k frequency .
- •
CUB bounds: We use the sample distribution to construct a cumulative upper bound (CUB) for the top-i weight for all . We then use the CUB to restrict the set of distributions that must be taken into account in the lower bound construction. Therefore, we can potentially obtain tighter bounds than by the Naive approach. The CUB method, however, is computationally intensive, since we cannot use precomputed values.
- •
Cross-validation bounds: We borrow terminology from hypothesis testing. The sample is split into two parts, one is the “learning” part and the other is a “testing” part. Let S be the sampled top-k set of the learning part. We use the sampled weight of S in the testing part to obtain the “lower end” for our confidence interval. We also consider variations of this method in which the sample is split into more parts.
We evaluate these methods on a collection of datasets that include IP traffic flow records collected from a large ISP and Web request data. We show (precise characterization is provided in the sequel) that in a sense, the hardest distributions, those with the worst confidence bounds for a given sampled top-k weight, are those where there are many large items that are close in size. Real-life distributions, however, are more Zipf-like and therefore, the cross-validation and CUB approaches can significantly outperform the Naive bounds. The Naive bounds, however, require the least amount of computation.
Most previous work addressed applications where the complete dataset can be observed [19], [6], [3], [17], [15] but resources are not sufficient to compute the exact frequency of each item. The challenge in this case is to find approximate most frequent items using limited storage or limited communication. Examples of such settings are a data stream, data that is distributed on multiple servers, distributed data streams [2], or data that resides on external memory. We address applications where we observe random samples rather than the complete dataset. The challenge is to estimate actual top frequencies from the available sample frequencies. These two settings are orthogonal. Our techniques and insights can be extended to a combined setting where the application observes a sample of the actual data and the available storage and communication do not allow us to obtain exact sample frequencies. We therefore need to first estimate sample frequencies from the observed sample, and then use these estimates to obtain estimates of the actual frequencies in the original dataset.
A problem related to the computation of top-k and heavy hitters is estimating the entire size distribution [16], [17] (estimate the number of items of a certain size, for all sizes). This is a more general problem than top-k and heavy hitters queries and sampling can be quite inaccurate for estimating the complete size distribution [8] or even just the number of distinct items [4]. Clearly, sampling is too lossy for estimating the number of items with frequencies that are well under the sampling rate. The problem of finding top flows from sampled packet traffic was considered in [1], where empirical data was used to evaluate the number of samples required until the top-k set in the sample closely matches the top-k set in the actual distribution. Their work did not include methods to obtain confidence intervals. The performance metrics used in [1] are rank-based rather than weight based. That is, the approximation quality is measured by the difference between the actual rank of a flow (i.e., 3rd largest in size) to its rank in the sampled trace (i.e., 10th largest in side), whereas our metrics are based on the weight (size of each flow). That is, if two flows are of very similar size our metric does not penalize for not ranking them properly with respect to each other as two flows that have different weights. As a result, the conclusion in [3], that a fairly high sampling rate is required may not be applicable under weight-based metrics.
We are not aware of other work that focused on deriving confidence intervals for estimating the top-k frequencies and the heavy hitters from samples. Related work applied maximum likelihood (through the Expectation Maximization (EM) algorithm) to estimate the size distribution from samples [8], [17]. Unlike our schemes, these approaches do not provide rigorous confidence intervals.
Some work on distributed top-k was motivated by information retrieval applications and assumed sorted accesses to distributed index list: Each remote server maintains its own top-k list and these lists can only be accessed in this order. Algorithms developed in this model included the well known Threshold Algorithm (TA) [10], [11], TPUT [7], and algorithms with probabilistic guarantees [21]. In this model, the cost is measured by the number of sorted accesses. These algorithms are suited for applications where sorted accesses rather then random samples are readily available as may be the case when the data is a list of results from a search engine.
An extended abstract of this paper has appeared in [5].
Section snippets
Preliminaries
Let I be a set of items with weights for . For , denote . We denote by (top-i set) a set of the i heaviest items in J, and by (bottom-i set) a set of the i lightest items in J. We also denote by the weight of the top-i elements in J and by the weight of the bottom-i elements in J.
We have access to weighted samples, where in each sample, the probability that an item is drawn is proportional to its weight. In the analysis and
Basic bounds for top-k sampling
When estimating a proportion, we use the fraction of positive examples in the sample as our estimator. Using the notation we introduced earlier, we can use the interval from to as a confidence interval. It is also well understood how to obtain the number of samples needed for proportion estimation within some confidence and error bounds when the proportion is at least p.
When estimating the top-k weight from samples, we would like to derive confidence intervals and also to
The Naive confidence interval
Let be the smallest f such that there exists a distribution F, with top-k weight of f, such that when sampling S of size s we have . Similarly, let be the largest f such that there exists a distribution F with top-k weight of f such that when sampling S of size s, .
Let S be a sample of s items from a distribution I and assume that . By a proof similar to that of Lemma 1 one can show that Lemma 6 is a -confidence upper
Using a cumulative upper bound
The derivation of the CUB resembles that of the Naive bound. We use the same upper bound and the difference is in the computation of the lower bound. As with the Naive bound, we look for the distribution with the smallest top-k weight that is at least likely to produce the sampled top-k weight that we observe. The difference is that we do not use the sampled top-k weight only, but extract more information from the sample to further restrict the set of distributions we have to consider,
Cross-validation methods
In this chapter, we borrow some concepts from machine learning and use the method of cross-validation to obtain confidence bounds. Intuitively, the methods we present in this section are based on the following lemma. Lemma 12 Let and be two subsets of the sample S such that . Let T be the top-k subset in S. Then the weight of T in is unbiased estimate of the real weight, , of T. Proof In fact, is a random variable counting the fraction of successes when we toss a coin of bias ,
Evaluation results
The algorithms were evaluated on all datasets, for top-100 and top-1, and confidence levels and . In the evaluation we consider the tightness of the estimates and confidence intervals. For the heuristic r-fold with s we also consider correctness of the lower bounds.
Conclusion and future directions
We developed several rigorous methods to derive confidence intervals and estimators for approximate top-k weight and top-k set queries over a sample of the dataset. Our work provides basic statistical tools for applications that provide only sampled data. The methods we developed vary in the amount of computation required and in the tightness of the bounds. Generally, methods that are able to uncover and exploit more of the structure of the distribution which we sample provide tighter bounds,
Edith Cohen received her B.Sc. and M.Sc. degrees from Tel Aviv University and her Ph.D. degree in Computer Science from Stanford University in 1991. At 1991 she joined AT&T Labs-Research (then, AT&T Bell Laboratories) and remained there ever since. During 1997, she spent time in UC Berkeley as a visiting professor. She is serving on the editorial boards of JCSS and Algorithmica and her research work areas are the design and analysis of algorithms, combinatorial optimization, Web performance,
References (21)
- et al.
Finding frequent items in data streams
Theor. Comput. Sci.
(2004) - et al.
Improved bounds on the sample complexity of learning
J. Comput. Syst. Sci.
(2001) - et al.
Ranking flows from sampled traffic
- et al.
Distributed top-k monitoring
- et al.
Towards estimation error guarantees for distinct values
- et al.
Processing top-k queries from samples
- et al.
What’s hot and what’s not: tracking most frequent items dynamically
- et al.
Efficient top-k query calculation in distributed networks
- et al.
Estimating flow distributions from sampled flow statistics
- et al.
An Introduction to the Bootstrap
(1993)
Cited by (7)
Design of a sliding window scheme for detecting high packet-rate flows via random packet sampling
2011, Computer NetworksCitation Excerpt :Such a function is particularly useful for tier-1 ISPs because the number of measurement points could be a few thousand or more. So far a lot of research efforts have been made to infer flow characteristics [3,7,8], to estimate ranking [9] or top-k queries [10] from sampled data, and to detect high packet-rate/elephant flows [4,11,12] and traffic anomalies [13,14], to name but a few. To the best of our knowledge, however, most of them discuss statistical inference methodologies for a given set of packets sampled in a certain interval of time.
Adaptive sampling for rapidly matching histograms
2018, Proceedings of the VLDB EndowmentSpace-Efficient Estimation of Statistics Over Sub-Sampled Streams
2016, AlgorithmicaSIBA: A fast frequent item sets mining algorithm based on sampling and improved bat algorithm
2016, Proceedings - 2015 Chinese Automation Congress, CAC 2015Continuous distributed top- K monitoring over high-speed rail data stream in cloud computing environment
2013, Advances in Mechanical Engineering
Edith Cohen received her B.Sc. and M.Sc. degrees from Tel Aviv University and her Ph.D. degree in Computer Science from Stanford University in 1991. At 1991 she joined AT&T Labs-Research (then, AT&T Bell Laboratories) and remained there ever since. During 1997, she spent time in UC Berkeley as a visiting professor. She is serving on the editorial boards of JCSS and Algorithmica and her research work areas are the design and analysis of algorithms, combinatorial optimization, Web performance, networking, and data mining. She received the 2007 IEEE Communications Society William R. Bennett prize.
Nadav Grossaug received his B.Sc. in Computer Science from the Hebrew University in Jerusalem and M.Sc. in computer science from Tel Aviv University. This article presents part of Nadav is work towards his M.Sc. degree. Nadav is working as a senior researcher and architect in Zoom Information inc.
Haim Kaplan received his Ph.D. degree from Princeton University at 1997. He was a member of technical stuff at AT&T research from 1996 to 1999. Since 1999 he is a Professor in the School of Computer Science at Tel Aviv University. His research interests are design and analysis of algorithms and data structures.