Elsevier

Computer Networks

Volume 52, Issue 14, 9 October 2008, Pages 2605-2622
Computer Networks

Processing top-k queries from samples

https://doi.org/10.1016/j.comnet.2008.04.021Get rights and content

Abstract

Top-k queries are desired aggregation operations on datasets. Examples of queries on network data include finding the top 100 source Autonomous Systems (AS), top 100 ports, or top domain names over IP packets or over IP flow records. Since the complete dataset is often not available or not feasible to examine, we are interested in processing top-k queries from samples.

If all records can be processed, the top-k items can be obtained by counting the frequency of each item. Even when the full dataset is observed, however, resources are often insufficient for such counting so techniques were developed to overcome this issue. When we can observe only a random sample of the records, an orthogonal complication arises: The top frequencies in the sample are biased estimates of the actual top-k frequencies. This bias depends on the distribution and must be accounted for when seeking the actual value.

We address this by designing and evaluating several schemes that derive rigorous confidence bounds for top-k estimates. Simulations on various datasets that include IP flows data, show that schemes exploiting more of the structure of the sample distribution produce much tighter confidence intervals with an order of magnitude fewer samples than simpler schemes that utilize only the sampled top-k frequencies. The simpler schemes, however, are more efficient in terms of computation.

Introduction

Top-k computation is an important data processing tool and constitute a basic aggregation query. In many applications, it is not feasible to examine the whole dataset and therefore approximate query processing is performed using a random sample of the records [4], [8], [13], [20], [14], [1]. These applications arise when the dataset is massive or highly distributed [12] such as the case with IP packet traffic that is both distributed and sampled, and with Netflow records that are aggregated over sampled packet traces and collected distributively. Other applications arise when the value of the attribute we aggregate over is not readily available and determining it for a given record has associated (computational or other) cost. For example, when we aggregate over the domain name that corresponds to a source or destination IP address, the domain name is obtained via a reverse DNS lookups which we may want to perform only on a sample of the records.

A top-k query over some attribute is to determine the k most common values for this attribute and their frequencies (number of occurrences) over a set of records. Examples of such queries are to determine the top-100 Autonomous Systems (AS) destinations, the top-100 applications (web, p2p, other protocols), 10 most popular Web sites, or 20 most common domain names. These queries can be posed in terms of number of IP packets (each packet is considered a record), number of distinct IP flows (each distinct flow is considered a record), or other unit of interest. We are interested in processing top-k queries from a sample of the records. For example, from a sampled packet streams or from a sample of the set of distinct flows. We seek probabilistic or approximate answers that are provided with confidence intervals.

It is interesting to contrast top-k queries with proportion queries. A proportion query is to determine the frequency of a specified attribute value in a dataset. Examples of proportion queries are to estimate the fraction of IP packets or IP flows that belong to p2p applications, originate from a specific AS, or from a specific Web site.

Processing an approximate proportion query from a random sample is a basic and well understood statistical problem. The fraction of sampled records with the given attribute value is an unbiased estimator, and confidence intervals are obtained using standard methods.

Processing top-k queries from samples is more challenging. When the complete dataset is observed, we can compute the frequency of each value and take the top-k most frequent values. When we have a random sample of the records, the natural estimator is the result of performing the same action on the sample. That is, obtaining the k most frequent values in the sample and proportionally scaling them to estimate the frequencies of the top-k values in the real dataset. This estimator, however, is biased upwards: The expectation of the combined frequency of the top-k items in the sample is generally larger than the value of this frequency over the unsampled records. This is a consequence of the basic statistical property that the expectation of the maximum of a set of random variables is generally larger than the maximum of their expectations. While this bias must be accounted for when deriving confidence intervals and when evaluating the relation between the sampled and the actual top-k sets, it is not easy to capture as it depends on the fine structure of the full distribution of frequencies in the unsampled dataset, which is not available to us.

In Sections 3 Basic bounds for top-, 4 The Naive confidence interval, 5 Using a cumulative upper bound, 6 Cross-validation methods, 7 Evaluation results, we devise and evaluate three basic methods to derive confidence intervals for top-k estimates. The main problem which we consider is to estimate the sum of the frequencies of the top-k values.

  • “Naive” bound: Let fˆ be the sum of the frequencies of the top-k elements in the sample. We consider distributions (datasets) for which the probability that in a sample the sum of the frequencies of the top-k elements is at least fˆ is at least δ. Among these distributions we look for those of smallest sum of top-k frequencies, say this sum is x. We use x as the lower end of our confidence interval. By constructing the confidence interval this way we capture both the bias of the sampled top-k frequency and standard proportion error bounds. The definition of the Naive bound requires to consider all distributions, which is not computationally feasible. To compute this interval, we identify a restricted set of distributions such that it is sufficient to consider these distributions. We then construct a precomputed table providing the bound for the desired confidence level and sampled top-k frequency fˆ.

  • CUB bounds: We use the sample distribution to construct a cumulative upper bound (CUB) for the top-i weight for all i1. We then use the CUB to restrict the set of distributions that must be taken into account in the lower bound construction. Therefore, we can potentially obtain tighter bounds than by the Naive approach. The CUB method, however, is computationally intensive, since we cannot use precomputed values.

  • Cross-validation bounds: We borrow terminology from hypothesis testing. The sample is split into two parts, one is the “learning” part and the other is a “testing” part. Let S be the sampled top-k set of the learning part. We use the sampled weight of S in the testing part to obtain the “lower end” for our confidence interval. We also consider variations of this method in which the sample is split into more parts.

We evaluate these methods on a collection of datasets that include IP traffic flow records collected from a large ISP and Web request data. We show (precise characterization is provided in the sequel) that in a sense, the hardest distributions, those with the worst confidence bounds for a given sampled top-k weight, are those where there are many large items that are close in size. Real-life distributions, however, are more Zipf-like and therefore, the cross-validation and CUB approaches can significantly outperform the Naive bounds. The Naive bounds, however, require the least amount of computation.

Most previous work addressed applications where the complete dataset can be observed [19], [6], [3], [17], [15] but resources are not sufficient to compute the exact frequency of each item. The challenge in this case is to find approximate most frequent items using limited storage or limited communication. Examples of such settings are a data stream, data that is distributed on multiple servers, distributed data streams [2], or data that resides on external memory. We address applications where we observe random samples rather than the complete dataset. The challenge is to estimate actual top frequencies from the available sample frequencies. These two settings are orthogonal. Our techniques and insights can be extended to a combined setting where the application observes a sample of the actual data and the available storage and communication do not allow us to obtain exact sample frequencies. We therefore need to first estimate sample frequencies from the observed sample, and then use these estimates to obtain estimates of the actual frequencies in the original dataset.

A problem related to the computation of top-k and heavy hitters is estimating the entire size distribution [16], [17] (estimate the number of items of a certain size, for all sizes). This is a more general problem than top-k and heavy hitters queries and sampling can be quite inaccurate for estimating the complete size distribution [8] or even just the number of distinct items [4]. Clearly, sampling is too lossy for estimating the number of items with frequencies that are well under the sampling rate. The problem of finding top flows from sampled packet traffic was considered in [1], where empirical data was used to evaluate the number of samples required until the top-k set in the sample closely matches the top-k set in the actual distribution. Their work did not include methods to obtain confidence intervals. The performance metrics used in [1] are rank-based rather than weight based. That is, the approximation quality is measured by the difference between the actual rank of a flow (i.e., 3rd largest in size) to its rank in the sampled trace (i.e., 10th largest in side), whereas our metrics are based on the weight (size of each flow). That is, if two flows are of very similar size our metric does not penalize for not ranking them properly with respect to each other as two flows that have different weights. As a result, the conclusion in [3], that a fairly high sampling rate is required may not be applicable under weight-based metrics.

We are not aware of other work that focused on deriving confidence intervals for estimating the top-k frequencies and the heavy hitters from samples. Related work applied maximum likelihood (through the Expectation Maximization (EM) algorithm) to estimate the size distribution from samples [8], [17]. Unlike our schemes, these approaches do not provide rigorous confidence intervals.

Some work on distributed top-k was motivated by information retrieval applications and assumed sorted accesses to distributed index list: Each remote server maintains its own top-k list and these lists can only be accessed in this order. Algorithms developed in this model included the well known Threshold Algorithm (TA) [10], [11], TPUT [7], and algorithms with probabilistic guarantees [21]. In this model, the cost is measured by the number of sorted accesses. These algorithms are suited for applications where sorted accesses rather then random samples are readily available as may be the case when the data is a list of results from a search engine.

An extended abstract of this paper has appeared in [5].

Section snippets

Preliminaries

Let I be a set of items with weights w(i)0 for iI. For JI, denote w(J)=iJw(i). We denote by Ti(J) (top-i set) a set of the i heaviest items in J, and by Bi(J) (bottom-i set) a set of the i lightest items in J. We also denote by W¯i(J)=w(Ti(J)) the weight of the top-i elements in J and by W̲i(J)=w(Bi(J)) the weight of the bottom-i elements in J.

We have access to weighted samples, where in each sample, the probability that an item is drawn is proportional to its weight. In the analysis and

Basic bounds for top-k sampling

When estimating a proportion, we use the fraction of positive examples in the sample as our estimator. Using the notation we introduced earlier, we can use the interval from L(pˆ,s,δ) to U(pˆ,s,δ) as a 2δ confidence interval. It is also well understood how to obtain the number of samples needed for proportion estimation within some confidence and error bounds when the proportion is at least p.

When estimating the top-k weight from samples, we would like to derive confidence intervals and also to

The Naive confidence interval

Let Lk(fˆ,s,δ) be the smallest f such that there exists a distribution F, with top-k weight of f, such that when sampling S of size s we have PROB{W¯k(S,F)fˆ}δ. Similarly, let Uk(fˆ,s,δ) be the largest f such that there exists a distribution F with top-k weight of f such that when sampling S of size s, PROB{W¯k(S,F)fˆ}δ.

Let S be a sample of s items from a distribution I and assume that fˆ=W¯k(S,I). By a proof similar to that of Lemma 1 one can show that

Lemma 6

Uk(fˆ,s,δ) is a (1-δ)-confidence upper

Using a cumulative upper bound

The derivation of the CUB resembles that of the Naive bound. We use the same upper bound and the difference is in the computation of the lower bound. As with the Naive bound, we look for the distribution with the smallest top-k weight that is at least δ likely to produce the sampled top-k weight that we observe. The difference is that we do not use the sampled top-k weight only, but extract more information from the sample to further restrict the set of distributions we have to consider,

Cross-validation methods

In this chapter, we borrow some concepts from machine learning and use the method of cross-validation to obtain confidence bounds. Intuitively, the methods we present in this section are based on the following lemma.

Lemma 12

Let S and S be two subsets of the sample S such that SS=. Let T be the top-k subset in S. Then the weight of T in S is unbiased estimate of the real weight, w(T), of T.

Proof

In fact, w(S,T) is a random variable counting the fraction of successes when we toss a coin of bias w(T), |S

Evaluation results

The algorithms were evaluated on all datasets, for top-100 and top-1, and confidence levels δ=0.1 and δ=0.01. In the evaluation we consider the tightness of the estimates and confidence intervals. For the heuristic r-fold with s we also consider correctness of the lower bounds.

Conclusion and future directions

We developed several rigorous methods to derive confidence intervals and estimators for approximate top-k weight and top-k set queries over a sample of the dataset. Our work provides basic statistical tools for applications that provide only sampled data. The methods we developed vary in the amount of computation required and in the tightness of the bounds. Generally, methods that are able to uncover and exploit more of the structure of the distribution which we sample provide tighter bounds,

Edith Cohen received her B.Sc. and M.Sc. degrees from Tel Aviv University and her Ph.D. degree in Computer Science from Stanford University in 1991. At 1991 she joined AT&T Labs-Research (then, AT&T Bell Laboratories) and remained there ever since. During 1997, she spent time in UC Berkeley as a visiting professor. She is serving on the editorial boards of JCSS and Algorithmica and her research work areas are the design and analysis of algorithms, combinatorial optimization, Web performance,

References (21)

  • M. Charikar et al.

    Finding frequent items in data streams

    Theor. Comput. Sci.

    (2004)
  • Y. Li et al.

    Improved bounds on the sample complexity of learning

    J. Comput. Syst. Sci.

    (2001)
  • C. Barakat et al.

    Ranking flows from sampled traffic

  • B. Babcock et al.

    Distributed top-k monitoring

  • M. Charikar et al.

    Towards estimation error guarantees for distinct values

  • E. Cohen et al.

    Processing top-k queries from samples

  • G. Cormode et al.

    What’s hot and what’s not: tracking most frequent items dynamically

  • P. Cao et al.

    Efficient top-k query calculation in distributed networks

  • N. Duffield et al.

    Estimating flow distributions from sampled flow statistics

  • B. Efron et al.

    An Introduction to the Bootstrap

    (1993)
There are more references available in the full text version of this article.

Cited by (7)

View all citing articles on Scopus

Edith Cohen received her B.Sc. and M.Sc. degrees from Tel Aviv University and her Ph.D. degree in Computer Science from Stanford University in 1991. At 1991 she joined AT&T Labs-Research (then, AT&T Bell Laboratories) and remained there ever since. During 1997, she spent time in UC Berkeley as a visiting professor. She is serving on the editorial boards of JCSS and Algorithmica and her research work areas are the design and analysis of algorithms, combinatorial optimization, Web performance, networking, and data mining. She received the 2007 IEEE Communications Society William R. Bennett prize.

Nadav Grossaug received his B.Sc. in Computer Science from the Hebrew University in Jerusalem and M.Sc. in computer science from Tel Aviv University. This article presents part of Nadav is work towards his M.Sc. degree. Nadav is working as a senior researcher and architect in Zoom Information inc.

Haim Kaplan received his Ph.D. degree from Princeton University at 1997. He was a member of technical stuff at AT&T research from 1996 to 1999. Since 1999 he is a Professor in the School of Computer Science at Tel Aviv University. His research interests are design and analysis of algorithms and data structures.

View full text