Approximating sliding windows by cyclic tree-like histograms for efficient range queries☆
Introduction
Data streams are indefinite sequences of data which continuously vary in time, often very quickly. There are many application contexts characterized by the presence of data streams. Typically, their on-line analysis may offer knowledge useful for strategic analyses and statistics. This is the case of network monitoring, sensor networks, financial applications, security, telecommunication data management, Web applications, manufacturing, just to mention some examples. The data-stream analysis can be done by means of different types of queries. Queries can be distinguished into (1) point queries [28], (2) range queries [12], [50], [67], and (3) similarity queries [18]. Referring to a data stream of packets crossing a router, examples of the three query types are, respectively: (1) “Return the size of the k-th packet of the data stream”, (2) “Return the total amount of traffic crossing the router in a given time interval”, and (3) “Return true whether a pattern similar to a given dangerous pattern occurs in the data stream”.
Even though answering a query (of any type) requires to store the entire data stream, there are a large number of situations (even in the field of network analysis mentioned above) where the analysis of the most recent part of the data stream is enough to give meaningful information. From a semantic point of view, this means giving more importance to recent knowledge w.r.t. past one, assuming that recent information is more reliable and significant than older information [37]. Therefore, designing techniques able to process queries by considering a suitable portion of the data stream, called sliding window, has assumed particular importance in the recent literature [2], [7], [13], [23], [38], [60], [61], [64].
However, in order to give significance to the sliding window itself, its size (i.e., the number of most recent elements we keep in each instant) should be as large as possible. As a consequence any technique capable of compressing sliding windows yet maintaining a good approximate representation of the data distribution is certainly relevant in the field of data stream [6]. Observe that, reducing sliding windows allows us also to keep simultaneously more than just one approximate sliding window, in order to implement similarity queries like change mining queries [34], useful for trend analysis and, in general, to understand dynamics of the data stream itself. In sum, since in a typical streaming environment only limited memory resources are available [39], [59], reduction is a key factor allowing query processing also in case of multiple scans on data.
There are a number of properties that a sliding window reduction technique should satisfy. First, the reduced sliding window should maintain the semantic nature of the original data, in such a way that meaningful queries can be submitted to the reduced data in place of the original ones. Then, for a given kind of query, the accuracy of the reduced structure should be independent of the position where the query is applied in order to provide the user with an analysis tool supporting arbitrary queries. In addition, the reduction technique should support drill-down and roll-up operations.
Even though the general properties stated above are valid for each type of query, often the approximation techniques are designed for a specific kind of query for which they show a good behavior in terms of efficiency and precision. In this work we focus our attention on range queries. It is worth noting that this kind of query is particularly important from the application point of view. Consider for example an intrusion detection system whose sensors are located at choke points and capture all network traffic. The increase of the traffic coming from a range of source IPs or having a particular range of destination ports can be a sign of an attack. Moreover, similar statistics can be used by a network monitoring system to detect faults and congestion or to improve network load balancing.
In this paper, we propose a histogram-based technique for reducing sliding windows which supports arbitrary range queries and satisfies all the above properties. Our histogram, called c-Tree, differently from the traditional ones, is based on a hierarchical structure, in particular a tree. Its nodes contain, in an aggregation hierarchy, pre-computed range-sum queries, stored by a bit-saving encoding. For this reason, the structure directly supports the estimation of arbitrary range queries (in particular, range queries of type sum). Indeed, range queries are either embedded in the histogram or derivable by linear interpolation. Reduction derives from both the aggregation implemented by leaves of the tree (discretization), and the saving of bits obtained by representing range queries with less than 32 bits.
Our approach relies on a previous proposal presented in [14] and [16] for persistent data. However, histograms presented in the above papers are not applicable to data streams because they do not take into account the continuous updating of data. In contrast, the structure here proposed is efficiently dynamic, in the sense that each update can be executed in logarithmic time (w.r.t. the window size). In addition, answering a range query requires at most logarithmic time too. Observe that the hierarchical structure directly supports querying at different abstraction levels, thus allowing drill-down and roll-up operations. Finally, bucket summarization smoothes each data value by consulting the “neighborhood” values around it, working thus to remove the noise from data. But the main feature of our histogram concerns its high accuracy. This is particularly important since in order for the reduction technique to have a meaningful role in data analysis applications, the error should be either guaranteed or heuristically shown to be small (and this is our case).
Section snippets
Contributions and organization of the paper
The contributions of this work can be summarized as follows. We study a new hierarchical structure to approximate data streams by using a bit-saving encoding allowing good precision with little space consumption. This structure has been carefully analyzed in the paper by showing how its design is driven by theoretical considerations about the scaling error. Under this perspective, the paper gives interesting hints about the advantages of the hierarchical approach to summarizing data.
The second
Related work
Data stream reduction is a very important research issue and a large number of methods exploiting the sliding-window approach have been proposed. As motivated in the previous section, among the numerous papers existing in the literature in the field of data streams, we experimentally demonstrate the relevance of our approach by comparing it with a number of selected methods, namely [17], [41], [45], [48]. Therefore, we start by contextualizing these papers in the literature and then we briefly
Preliminaries
We use the following notations throughout our paper. We model a data stream D at the instant t as a finite data sequence x1, …, xt of integer values, where xi with 1 ≤ i ≤ t is the value received at the instant i. Given an integer 1 ≤ w ≤ t, a sliding window of size w on D at the instant t is the sequence xt − w + 1, …, xt. Thus, a sliding window represents the sequence including only the w most recent values of the data stream. Like in other approaches [17], [26], [27], [46], we assume that the sliding window
The c-Tree approach
We start the description of our proposal by illustrating briefly the architecture for continuous query processing over data streams we refer to. It is summarized in Fig. 1. Observe that this scheme is widely adopted in the literature [9], [31], [32], [42] and is composed of three modules.
The first one, named Synopsis Creator, receives elements from the data streams and has a limited amount of memory to maintain a concise synopsis for the last w points of each data stream. In contrast to
Advantages of the hierarchical approach
One of the contributions of this paper is showing how the hierarchical approach used in the literature in various forms and different contexts [14], [17], [43], [56] can be profitably adopted to approximate data streams. In this section, we analyze an important aspect related to the above issue since we show that the hierarchical approach enhances the advantages given from the bit-saving approach w.r.t. flat histograms. In order to demonstrate the above claim, we compare a n-level c-Tree with a
Experiments
In this section we report the results of a consistent number of experiments executed on both synthetic and real-life data sets to evaluate the performance of c-Tree with the purpose of comparing it with three selected techniques. The significance of this choice is motivated in Section 3.
Besides c-Tree the examined techniques are (notations used throughout the section are reported in Table 3):
- •
HIST: the optimal histogram construction algorithm of [49]. We recall (see Section 3) that HIST provides
Conclusion and future work
Data stream reduction is an important issue since it allows us to make effective approaches requiring multiple scans on data, that, in such a way, may be performed over one or more reduced sliding windows. In many cases, analysis requires to estimate a range query involving data of the sliding window. In order to reach this goal, we designed a tree-like histogram used for reducing sliding windows and supporting fast approximate answers to arbitrary range queries. Our proposal has the important
Acknowledgment
This work was partially funded by the Italian Ministry of Research through the PRIN Project EASE (Entity Aware Search Engines).
Francesco Buccafurri is a full professor of computer science at the University “Mediterranea” of Reggio Calabria, Italy. In 1995 he took the PhD degree in computer science at the University of Calabria. His research interests include deductive-databases, knowledge-representation and non-monotonic reasoning, model checking, information security, data compression, data streams, agents, P2P systems. He has published several papers in top-level international journals and conference proceedings. He
References (71)
- et al.
Deterministic algorithms for sampling count data
Data Knowl. Eng.
(2008) - et al.
Fast range query estimation by n-level tree histograms
Data & Knowledge Engineering
(2004) - et al.
Maintaining time-decaying stream aggregates
J. Algorithms
(2006) - et al.
An improved data stream summary: the count-min sketch and its applications
Journal of Algorithms
(2005) - et al.
Multi-query optimization for sketch-based estimation
Information Systems
(2009) - et al.
Xwave: approximate extended wavelets for streaming data
- et al.
Rehist: relative error histogram construction algorithms
- et al.
Histogram-by: a grouping operator for continuous domains
Data & Knowledge Engineering
(2007) - et al.
Mining non-derivable frequent itemsets over data stream
Data Knowl. Eng.
(2009) - et al.
Approximate frequency counts over data streams
On the testing for alpha-stable distributions of network traffic
Computer Communications
Selectivity estimation of range queries based on data density approximation via cosine series
Data & Knowledge Engineering
Processing partially specified queries over high-dimensional databases
Data & Knowledge Engineering
Join synopses for approximate query answering
On biased reservoir sampling in the presence of stream evolution
The space complexity of approximating the frequency moments
Incremental maintenance of online summaries over multiple streams
IEEE Trans. on Knowl. and Data Eng.
Models and issues in data stream system
Sampling from a moving window over streaming data
Countinuous queries over data stream
ACM SIGMOD Record
Deterministic sampling and range counting in geometric data streams
ACM Trans. Algorithms
Mercury: supporting scalable multi-attribute range queries
Optimal sampling from sliding windows
Reducing data stream sliding windows by cyclic tree-like histograms
Enhancing histograms by tree-like bucket indices
The Very Large Data Bases Journal
SWAT: hierarchical stream summarization in large networks
A unified framework for monitoring data streams in real time
An adaptive and scalable middleware for distributed indexing of data streams
A deterministic algorithm for summarizing asynchronous streams over a sliding window
Finding frequent items in data streams
On random sampling over joins
Sketching streams through the net: distributed approximate query tracking
Histograms and wavelets on probabilistic data
Cited by (11)
A hierarchical semantic-based distance for nominal histogram comparison
2013, Data and Knowledge EngineeringCitation Excerpt :Thus, an ordinal type histogram can model the composition of a shopping cart according to the prices of the articles (see Fig. 1(b)). Measuring the similarity between histograms is a crucial operation in various domains such as clustering [2,3], pattern classification and recognition [4–6], image retrieval [7–10], data summarizing [11], text categorization [12,13] or time series analysis [14]. Indeed, the distance between pairs of histograms enables the similarity of their corresponding statistical properties to be assessed.
Identifying streaming frequent items in ad hoc time windows
2013, Data and Knowledge EngineeringCitation Excerpt :The large growth in produced data volumes and the increase in network bandwidth seen in recent years have made it necessary to revisit conventional problems in the data mining field in the context of these advances. The data being mined are now often in the form of streaming data [3,35,6], and an important problem in this area is that of detecting frequent items in a data stream. The problem of frequent item discovery in streaming data has attracted much attention, because it is relevant to many different applications across various domains [18,20,17].
Adaptive optimization for multiple continuous queries
2012, Data and Knowledge EngineeringCitation Excerpt :However, the query optimization techniques [4–6] used in traditional database management systems (DBMS) are not useful for data stream applications, for the following reasons. First, continuous query evaluation requires novel query processing techniques targeted for memory-resident data; these include grouped filters [1], windowed joins [7] and approximate query processing [8–10]. Second, the execution plan of continuous queries should ideally be re-optimized dynamically in run-time because of the time-varying characteristics of data streams [11].
A novel embedding technique for lossless data hiding in medical images employing histogram shifting method
2014, International Journal of Wavelets, Multiresolution and Information ProcessingData stream management
2019, SpringerBriefs in Computer Science
Francesco Buccafurri is a full professor of computer science at the University “Mediterranea” of Reggio Calabria, Italy. In 1995 he took the PhD degree in computer science at the University of Calabria. His research interests include deductive-databases, knowledge-representation and non-monotonic reasoning, model checking, information security, data compression, data streams, agents, P2P systems. He has published several papers in top-level international journals and conference proceedings. He serves as a referee for international journals and he is a member of a number of conference PCs.
He is also included in the editorial board of a number of international journals and played the role of PC chair in some international conferences.
Gianluca Lax is an assistant professor of computer science at the University “Mediterranea” of Reggio Calabria, Italy. In 2005 he took the PhD degree in computer science at the University of Calabria. His research interests include data reduction, data streams, user modelling, P2P systems, e-commerce and information security. He is also author of a number of papers published in top-level international journals and conference proceedings.
- ☆
A shorter abridged version of this paper appeared in Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, J. Boulicaut, F. Esposito, F. Giannotti, D. Pedreschi (Eds.): PKDD 2004, LNAI 3202, pp. 75–86, 2004. © Springer-Verlag Berlin Heidelberg 2004 [15].