Approximating sliding windows by cyclic tree-like histograms for efficient range queries

doi:10.1016/j.datak.2010.05.002

Data & Knowledge Engineering

Volume 69, Issue 9, September 2010, Pages 979-997

https://doi.org/10.1016/j.datak.2010.05.002 Get rights and content

Abstract

The issue of providing fast approximate answers to range queries on sliding windows with a small consumption of storage space is one of the main challenges in the context of data streams. On the one hand, the importance of this class of queries is widely accepted. They are indeed useful to compute aggregate information over the data stream, allowing us to extract from it more abstract knowledge than point queries. On the other hand, the usage of techniques like synopses based on histograms, sketches, sampling, and so on, makes effective those approaches which require multiple scans on data, which otherwise would be prohibitive from the computational point of view. Among the above techniques, histogram-based approaches are considered one of the most advantageous solutions, at least in case of range queries. It is a matter of fact that histograms show a very good capability of summarizing data preserving quick and accurate answers to range queries. In this paper, we propose a novel histogram-based technique to reduce sliding windows supporting approximate arbitrary range-sum queries. Our histogram, relying on a tree-based structure, is suitable to directly support hierarchical queries and, thus, drill-down and roll-up operations. In addition, the structure well supports sliding window shifting and quick query answering, since it operates in logarithmic time in the sliding window size. A bit-saving approach to encoding tree nodes allows us to compress the sliding window with a little price in terms of accuracy. The contribution of this work is thus not only the proposal of a new specific technique to tackle an important problem but also a deep analysis of the advantages given by the hierarchical approach combined with the bit-saving strategy. A careful experimental analysis validates the method showing its superiority w.r.t. the state of the art.

Introduction

Data streams are indefinite sequences of data which continuously vary in time, often very quickly. There are many application contexts characterized by the presence of data streams. Typically, their on-line analysis may offer knowledge useful for strategic analyses and statistics. This is the case of network monitoring, sensor networks, financial applications, security, telecommunication data management, Web applications, manufacturing, just to mention some examples. The data-stream analysis can be done by means of different types of queries. Queries can be distinguished into (1) point queries [28], (2) range queries [12], [50], [67], and (3) similarity queries [18]. Referring to a data stream of packets crossing a router, examples of the three query types are, respectively: (1) “Return the size of the k-th packet of the data stream”, (2) “Return the total amount of traffic crossing the router in a given time interval”, and (3) “Return true whether a pattern similar to a given dangerous pattern occurs in the data stream”.

Even though answering a query (of any type) requires to store the entire data stream, there are a large number of situations (even in the field of network analysis mentioned above) where the analysis of the most recent part of the data stream is enough to give meaningful information. From a semantic point of view, this means giving more importance to recent knowledge w.r.t. past one, assuming that recent information is more reliable and significant than older information [37]. Therefore, designing techniques able to process queries by considering a suitable portion of the data stream, called sliding window, has assumed particular importance in the recent literature [2], [7], [13], [23], [38], [60], [61], [64].

However, in order to give significance to the sliding window itself, its size (i.e., the number of most recent elements we keep in each instant) should be as large as possible. As a consequence any technique capable of compressing sliding windows yet maintaining a good approximate representation of the data distribution is certainly relevant in the field of data stream [6]. Observe that, reducing sliding windows allows us also to keep simultaneously more than just one approximate sliding window, in order to implement similarity queries like change mining queries [34], useful for trend analysis and, in general, to understand dynamics of the data stream itself. In sum, since in a typical streaming environment only limited memory resources are available [39], [59], reduction is a key factor allowing query processing also in case of multiple scans on data.

There are a number of properties that a sliding window reduction technique should satisfy. First, the reduced sliding window should maintain the semantic nature of the original data, in such a way that meaningful queries can be submitted to the reduced data in place of the original ones. Then, for a given kind of query, the accuracy of the reduced structure should be independent of the position where the query is applied in order to provide the user with an analysis tool supporting arbitrary queries. In addition, the reduction technique should support drill-down and roll-up operations.

Even though the general properties stated above are valid for each type of query, often the approximation techniques are designed for a specific kind of query for which they show a good behavior in terms of efficiency and precision. In this work we focus our attention on range queries. It is worth noting that this kind of query is particularly important from the application point of view. Consider for example an intrusion detection system whose sensors are located at choke points and capture all network traffic. The increase of the traffic coming from a range of source IPs or having a particular range of destination ports can be a sign of an attack. Moreover, similar statistics can be used by a network monitoring system to detect faults and congestion or to improve network load balancing.

In this paper, we propose a histogram-based technique for reducing sliding windows which supports arbitrary range queries and satisfies all the above properties. Our histogram, called c-Tree, differently from the traditional ones, is based on a hierarchical structure, in particular a tree. Its nodes contain, in an aggregation hierarchy, pre-computed range-sum queries, stored by a bit-saving encoding. For this reason, the structure directly supports the estimation of arbitrary range queries (in particular, range queries of type sum). Indeed, range queries are either embedded in the histogram or derivable by linear interpolation. Reduction derives from both the aggregation implemented by leaves of the tree (discretization), and the saving of bits obtained by representing range queries with less than 32 bits.

Our approach relies on a previous proposal presented in [14] and [16] for persistent data. However, histograms presented in the above papers are not applicable to data streams because they do not take into account the continuous updating of data. In contrast, the structure here proposed is efficiently dynamic, in the sense that each update can be executed in logarithmic time (w.r.t. the window size). In addition, answering a range query requires at most logarithmic time too. Observe that the hierarchical structure directly supports querying at different abstraction levels, thus allowing drill-down and roll-up operations. Finally, bucket summarization smoothes each data value by consulting the “neighborhood” values around it, working thus to remove the noise from data. But the main feature of our histogram concerns its high accuracy. This is particularly important since in order for the reduction technique to have a meaningful role in data analysis applications, the error should be either guaranteed or heuristically shown to be small (and this is our case).

Section snippets

Contributions and organization of the paper

The contributions of this work can be summarized as follows. We study a new hierarchical structure to approximate data streams by using a bit-saving encoding allowing good precision with little space consumption. This structure has been carefully analyzed in the paper by showing how its design is driven by theoretical considerations about the scaling error. Under this perspective, the paper gives interesting hints about the advantages of the hierarchical approach to summarizing data.

The second

Related work

Data stream reduction is a very important research issue and a large number of methods exploiting the sliding-window approach have been proposed. As motivated in the previous section, among the numerous papers existing in the literature in the field of data streams, we experimentally demonstrate the relevance of our approach by comparing it with a number of selected methods, namely [17], [41], [45], [48]. Therefore, we start by contextualizing these papers in the literature and then we briefly

Preliminaries

We use the following notations throughout our paper. We model a data stream D at the instant t as a finite data sequence x₁, …, x_t of integer values, where x_i with 1 ≤ i ≤ t is the value received at the instant i. Given an integer 1 ≤ w ≤ t, a sliding window of size w on D at the instant t is the sequence x_{t − w + 1}, …, x_t. Thus, a sliding window represents the sequence including only the w most recent values of the data stream. Like in other approaches [17], [26], [27], [46], we assume that the sliding window

The c-Tree approach

We start the description of our proposal by illustrating briefly the architecture for continuous query processing over data streams we refer to. It is summarized in Fig. 1. Observe that this scheme is widely adopted in the literature [9], [31], [32], [42] and is composed of three modules.

The first one, named Synopsis Creator, receives elements from the data streams and has a limited amount of memory to maintain a concise synopsis for the last w points of each data stream. In contrast to

Advantages of the hierarchical approach

One of the contributions of this paper is showing how the hierarchical approach used in the literature in various forms and different contexts [14], [17], [43], [56] can be profitably adopted to approximate data streams. In this section, we analyze an important aspect related to the above issue since we show that the hierarchical approach enhances the advantages given from the bit-saving approach w.r.t. flat histograms. In order to demonstrate the above claim, we compare a n-level c-Tree with a

Experiments

In this section we report the results of a consistent number of experiments executed on both synthetic and real-life data sets to evaluate the performance of c-Tree with the purpose of comparing it with three selected techniques. The significance of this choice is motivated in Section 3.

Besides c-Tree the examined techniques are (notations used throughout the section are reported in Table 3):

•
HIST: the optimal histogram construction algorithm of [49]. We recall (see Section 3) that HIST provides

Conclusion and future work

Data stream reduction is an important issue since it allows us to make effective approaches requiring multiple scans on data, that, in such a way, may be performed over one or more reduced sliding windows. In many cases, analysis requires to estimate a range query involving data of the sliding window. In order to reach this goal, we designed a tree-like histogram used for reducing sliding windows and supporting fast approximate answers to arbitrary range queries. Our proposal has the important

Acknowledgment

This work was partially funded by the Italian Ministry of Research through the PRIN Project EASE (Entity Aware Search Engines).

Francesco Buccafurri is a full professor of computer science at the University “Mediterranea” of Reggio Calabria, Italy. In 1995 he took the PhD degree in computer science at the University of Calabria. His research interests include deductive-databases, knowledge-representation and non-monotonic reasoning, model checking, information security, data compression, data streams, agents, P2P systems. He has published several papers in top-level international journals and conference proceedings. He

References (71)

H. Akcan et al.
Deterministic algorithms for sampling count data
Data Knowl. Eng.
(2008)
F. Buccafurri et al.
Fast range query estimation by n-level tree histograms
Data & Knowledge Engineering
(2004)
E. Cohen et al.
Maintaining time-decaying stream aggregates
J. Algorithms
(2006)
G. Cormode et al.
An improved data stream summary: the count-min sketch and its applications
Journal of Algorithms
(2005)
A. Dobra et al.
Multi-query optimization for sketch-based estimation
Information Systems
(2009)
S. Guha et al.
Xwave: approximate extended wavelets for streaming data
S. Guha et al.
Rehist: relative error histogram construction algorithms
S. Hong et al.
Histogram-by: a grouping operator for continuous domains
Data & Knowledge Engineering
(2007)
H. Li et al.
Mining non-derivable frequent itemsets over data stream
Data Knowl. Eng.
(2009)
G.S. Manku et al.
Approximate frequency counts over data streams

G. Xiaohu et al.

On the testing for alpha-stable distributions of network traffic

Computer Communications

(2004)

F. Yan et al.

Selectivity estimation of range queries based on data density approximation via cosine series

Data & Knowledge Engineering

(2007)

B. Yu et al.

Processing partially specified queries over high-dimensional databases

Data & Knowledge Engineering

(2007)

S. Acharya et al.

Join synopses for approximate query answering

C. Aggarwal et al.

C.C. Aggarwal

On biased reservoir sampling in the presence of stream evolution

N. Alon et al.

The space complexity of approximating the frequency moments

F. Altiparmak et al.

Incremental maintenance of online summaries over multiple streams

IEEE Trans. on Knowl. and Data Eng.

(2008)

B. Babcock et al.

Models and issues in data stream system

B. Babcock et al.

Sampling from a moving window over streaming data

S. Babu et al.

Countinuous queries over data stream

ACM SIGMOD Record

(2001)

A. Bagchi et al.

Deterministic sampling and range counting in geometric data streams

ACM Trans. Algorithms

(2007)

BC — Ethernet Traces of LAN and WAN Traffic....

A.R. Bharambe et al.

Mercury: supporting scalable multi-attribute range queries

V. Braverman et al.

Optimal sampling from sliding windows

F. Buccafurri et al.

Reducing data stream sliding windows by cyclic tree-like histograms

F. Buccafurri et al.

Enhancing histograms by tree-like bucket indices

The Very Large Data Bases Journal

(2008)

A. Bulut et al.

SWAT: hierarchical stream summarization in large networks

A. Bulut et al.

A unified framework for monitoring data streams in real time

A. Bulut et al.

An adaptive and scalable middleware for distributed indexing of data streams

C. Busch et al.

A deterministic algorithm for summarizing asynchronous streams over a sliding window

M. Charikar et al.

Finding frequent items in data streams

S. Chaudhuri et al.

On random sampling over joins

G. Cormode et al.

Sketching streams through the net: distributed approximate query tracking

G. Cormode et al.

Histograms and wavelets on probabilistic data

Cited by (11)

A hierarchical semantic-based distance for nominal histogram comparison
2013, Data and Knowledge Engineering
Citation Excerpt :
Thus, an ordinal type histogram can model the composition of a shopping cart according to the prices of the articles (see Fig. 1(b)). Measuring the similarity between histograms is a crucial operation in various domains such as clustering [2,3], pattern classification and recognition [4–6], image retrieval [7–10], data summarizing [11], text categorization [12,13] or time series analysis [14]. Indeed, the distance between pairs of histograms enables the similarity of their corresponding statistical properties to be assessed.
We propose a new distance called Hierarchical Semantic-Based Distance (HSBD), devoted to the comparison of nominal histograms equipped with a dissimilarity matrix providing the semantic correlations between the bins. The computation of this distance is based on a hierarchical strategy, progressively merging the considered instances (and their bins) according to their semantic proximity. For each level of this hierarchy, a standard bin-to-bin distance is computed between the corresponding pair of histograms. In order to obtain the proposed distance, these bin-to-bin distances are then fused by taking into account the semantic coherency of their associated level. From this modus operandi, the proposed distance can handle histograms which are generally compared thanks to cross-bin distances. It preserves the advantages of such cross-bin distances (namely robustness to histogram translation and histogram bin size issues), while inheriting the low computational cost of bin-to-bin distances. Validations in the context of geographical data classification emphasize the relevance and usefulness of the proposed distance.
Identifying streaming frequent items in ad hoc time windows
2013, Data and Knowledge Engineering
Citation Excerpt :
The large growth in produced data volumes and the increase in network bandwidth seen in recent years have made it necessary to revisit conventional problems in the data mining field in the context of these advances. The data being mined are now often in the form of streaming data [3,35,6], and an important problem in this area is that of detecting frequent items in a data stream. The problem of frequent item discovery in streaming data has attracted much attention, because it is relevant to many different applications across various domains [18,20,17].
The problem of frequent item discovery in streaming data has attracted a lot of attention, mainly because of its numerous applications in diverse domains, such as network traffic monitoring and e-business transactions analysis.
While the above problem has been studied extensively, and several techniques have been proposed for its solution, these approaches are geared towards the recent values in the stream. Nevertheless, in several situations the users would like to be able to query about the item frequencies in ad hoc windows in the stream history, and compare these values among themselves.
In this paper, we address the problem of finding frequent items in ad hoc windows in a data stream given a small bounded memory, and present novel algorithms to this direction. We propose basic sketch- and count-based algorithms that extend the functionality of existing approaches by monitoring item frequencies in the stream. Subsequently, we present an improved version of the algorithm with significantly better performance (in terms of accuracy, at no extra memory cost). Moreover, we propose an efficient non-linear model to better estimate the frequencies within the query windows.
Finally, we conduct an extensive experimental evaluation with synthetic and real datasets, which demonstrates the merits of the proposed solutions and provides guidelines for the practitioners in the field.
Adaptive optimization for multiple continuous queries
2012, Data and Knowledge Engineering
Citation Excerpt :
However, the query optimization techniques [4–6] used in traditional database management systems (DBMS) are not useful for data stream applications, for the following reasons. First, continuous query evaluation requires novel query processing techniques targeted for memory-resident data; these include grouped filters [1], windowed joins [7] and approximate query processing [8–10]. Second, the execution plan of continuous queries should ideally be re-optimized dynamically in run-time because of the time-varying characteristics of data streams [11].
Because it operates under a strict time constraint, query processing for data streams should be continuous and rapid. To guarantee this constraint, most previous researches optimize the evaluation order of multiple join operations in a set of continuous queries using a greedy optimization strategy so that the order is re-optimized dynamically in run-time due to the time-varying characteristics of data streams. However, this method often results in a sub-optimal plan because the greedy strategy traces only the first promising plan. This paper proposes a new multiple query optimization approach, Adaptive Sharing-based Extended Greedy Optimization Approach (A-SEGO), that traces multiple promising partial plans simultaneously. A-SEGO presents a novel method for sharing the results of common sub-expressions in a set of queries cost-effectively. The number of partial plans can be flexibly controlled according to the query processing workload. In addition, to avoid invoking the optimization process too frequently, optimization is performed only when the current execution plan is relatively no longer efficient. A series of experiments are comparatively analyzed to evaluate the performance of the proposed method in various stream environments.
A novel embedding technique for lossless data hiding in medical images employing histogram shifting method
2014, International Journal of Wavelets, Multiresolution and Information Processing
Tendency on the Application of Drill-Down Analysis in Scientific Studies: A Systematic Review
2023, Technologies
Data stream management
2019, SpringerBriefs in Computer Science

View all citing articles on Scopus

He is also included in the editorial board of a number of international journals and played the role of PC chair in some international conferences.

Gianluca Lax is an assistant professor of computer science at the University “Mediterranea” of Reggio Calabria, Italy. In 2005 he took the PhD degree in computer science at the University of Calabria. His research interests include data reduction, data streams, user modelling, P2P systems, e-commerce and information security. He is also author of a number of papers published in top-level international journals and conference proceedings.

^☆: A shorter abridged version of this paper appeared in Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, J. Boulicaut, F. Esposito, F. Giannotti, D. Pedreschi (Eds.): PKDD 2004, LNAI 3202, pp. 75–86, 2004. © Springer-Verlag Berlin Heidelberg 2004 [15].

View full text

Approximating sliding windows by cyclic tree-like histograms for efficient range queries☆

Abstract

Introduction

Section snippets

Contributions and organization of the paper

Related work

Preliminaries

The c-Tree approach

Advantages of the hierarchical approach

Experiments

Conclusion and future work

Acknowledgment

Data Knowl. Eng.

Data & Knowledge Engineering

J. Algorithms

Journal of Algorithms

Information Systems

Data & Knowledge Engineering

Data Knowl. Eng.

Computer Communications

Data & Knowledge Engineering

Data & Knowledge Engineering

Join synopses for approximate query answering

On biased reservoir sampling in the presence of stream evolution

The space complexity of approximating the frequency moments

Incremental maintenance of online summaries over multiple streams

IEEE Trans. on Knowl. and Data Eng.

Models and issues in data stream system

Sampling from a moving window over streaming data

Countinuous queries over data stream

ACM SIGMOD Record

Deterministic sampling and range counting in geometric data streams

ACM Trans. Algorithms

Mercury: supporting scalable multi-attribute range queries

Optimal sampling from sliding windows

Reducing data stream sliding windows by cyclic tree-like histograms

Enhancing histograms by tree-like bucket indices

The Very Large Data Bases Journal

SWAT: hierarchical stream summarization in large networks

A unified framework for monitoring data streams in real time

An adaptive and scalable middleware for distributed indexing of data streams

A deterministic algorithm for summarizing asynchronous streams over a sliding window

Finding frequent items in data streams

On random sampling over joins

Sketching streams through the net: distributed approximate query tracking

Histograms and wavelets on probabilistic data