Elsevier

Data & Knowledge Engineering

Volume 69, Issue 9, September 2010, Pages 979-997
Data & Knowledge Engineering

Approximating sliding windows by cyclic tree-like histograms for efficient range queries

https://doi.org/10.1016/j.datak.2010.05.002Get rights and content

Abstract

The issue of providing fast approximate answers to range queries on sliding windows with a small consumption of storage space is one of the main challenges in the context of data streams. On the one hand, the importance of this class of queries is widely accepted. They are indeed useful to compute aggregate information over the data stream, allowing us to extract from it more abstract knowledge than point queries. On the other hand, the usage of techniques like synopses based on histograms, sketches, sampling, and so on, makes effective those approaches which require multiple scans on data, which otherwise would be prohibitive from the computational point of view. Among the above techniques, histogram-based approaches are considered one of the most advantageous solutions, at least in case of range queries. It is a matter of fact that histograms show a very good capability of summarizing data preserving quick and accurate answers to range queries. In this paper, we propose a novel histogram-based technique to reduce sliding windows supporting approximate arbitrary range-sum queries. Our histogram, relying on a tree-based structure, is suitable to directly support hierarchical queries and, thus, drill-down and roll-up operations. In addition, the structure well supports sliding window shifting and quick query answering, since it operates in logarithmic time in the sliding window size. A bit-saving approach to encoding tree nodes allows us to compress the sliding window with a little price in terms of accuracy. The contribution of this work is thus not only the proposal of a new specific technique to tackle an important problem but also a deep analysis of the advantages given by the hierarchical approach combined with the bit-saving strategy. A careful experimental analysis validates the method showing its superiority w.r.t. the state of the art.

Introduction

Data streams are indefinite sequences of data which continuously vary in time, often very quickly. There are many application contexts characterized by the presence of data streams. Typically, their on-line analysis may offer knowledge useful for strategic analyses and statistics. This is the case of network monitoring, sensor networks, financial applications, security, telecommunication data management, Web applications, manufacturing, just to mention some examples. The data-stream analysis can be done by means of different types of queries. Queries can be distinguished into (1) point queries [28], (2) range queries [12], [50], [67], and (3) similarity queries [18]. Referring to a data stream of packets crossing a router, examples of the three query types are, respectively: (1) “Return the size of the k-th packet of the data stream”, (2) “Return the total amount of traffic crossing the router in a given time interval”, and (3) “Return true whether a pattern similar to a given dangerous pattern occurs in the data stream”.

Even though answering a query (of any type) requires to store the entire data stream, there are a large number of situations (even in the field of network analysis mentioned above) where the analysis of the most recent part of the data stream is enough to give meaningful information. From a semantic point of view, this means giving more importance to recent knowledge w.r.t. past one, assuming that recent information is more reliable and significant than older information [37]. Therefore, designing techniques able to process queries by considering a suitable portion of the data stream, called sliding window, has assumed particular importance in the recent literature [2], [7], [13], [23], [38], [60], [61], [64].

However, in order to give significance to the sliding window itself, its size (i.e., the number of most recent elements we keep in each instant) should be as large as possible. As a consequence any technique capable of compressing sliding windows yet maintaining a good approximate representation of the data distribution is certainly relevant in the field of data stream [6]. Observe that, reducing sliding windows allows us also to keep simultaneously more than just one approximate sliding window, in order to implement similarity queries like change mining queries [34], useful for trend analysis and, in general, to understand dynamics of the data stream itself. In sum, since in a typical streaming environment only limited memory resources are available [39], [59], reduction is a key factor allowing query processing also in case of multiple scans on data.

There are a number of properties that a sliding window reduction technique should satisfy. First, the reduced sliding window should maintain the semantic nature of the original data, in such a way that meaningful queries can be submitted to the reduced data in place of the original ones. Then, for a given kind of query, the accuracy of the reduced structure should be independent of the position where the query is applied in order to provide the user with an analysis tool supporting arbitrary queries. In addition, the reduction technique should support drill-down and roll-up operations.

Even though the general properties stated above are valid for each type of query, often the approximation techniques are designed for a specific kind of query for which they show a good behavior in terms of efficiency and precision. In this work we focus our attention on range queries. It is worth noting that this kind of query is particularly important from the application point of view. Consider for example an intrusion detection system whose sensors are located at choke points and capture all network traffic. The increase of the traffic coming from a range of source IPs or having a particular range of destination ports can be a sign of an attack. Moreover, similar statistics can be used by a network monitoring system to detect faults and congestion or to improve network load balancing.

In this paper, we propose a histogram-based technique for reducing sliding windows which supports arbitrary range queries and satisfies all the above properties. Our histogram, called c-Tree, differently from the traditional ones, is based on a hierarchical structure, in particular a tree. Its nodes contain, in an aggregation hierarchy, pre-computed range-sum queries, stored by a bit-saving encoding. For this reason, the structure directly supports the estimation of arbitrary range queries (in particular, range queries of type sum). Indeed, range queries are either embedded in the histogram or derivable by linear interpolation. Reduction derives from both the aggregation implemented by leaves of the tree (discretization), and the saving of bits obtained by representing range queries with less than 32 bits.

Our approach relies on a previous proposal presented in [14] and [16] for persistent data. However, histograms presented in the above papers are not applicable to data streams because they do not take into account the continuous updating of data. In contrast, the structure here proposed is efficiently dynamic, in the sense that each update can be executed in logarithmic time (w.r.t. the window size). In addition, answering a range query requires at most logarithmic time too. Observe that the hierarchical structure directly supports querying at different abstraction levels, thus allowing drill-down and roll-up operations. Finally, bucket summarization smoothes each data value by consulting the “neighborhood” values around it, working thus to remove the noise from data. But the main feature of our histogram concerns its high accuracy. This is particularly important since in order for the reduction technique to have a meaningful role in data analysis applications, the error should be either guaranteed or heuristically shown to be small (and this is our case).

Section snippets

Contributions and organization of the paper

The contributions of this work can be summarized as follows. We study a new hierarchical structure to approximate data streams by using a bit-saving encoding allowing good precision with little space consumption. This structure has been carefully analyzed in the paper by showing how its design is driven by theoretical considerations about the scaling error. Under this perspective, the paper gives interesting hints about the advantages of the hierarchical approach to summarizing data.

The second

Related work

Data stream reduction is a very important research issue and a large number of methods exploiting the sliding-window approach have been proposed. As motivated in the previous section, among the numerous papers existing in the literature in the field of data streams, we experimentally demonstrate the relevance of our approach by comparing it with a number of selected methods, namely [17], [41], [45], [48]. Therefore, we start by contextualizing these papers in the literature and then we briefly

Preliminaries

We use the following notations throughout our paper. We model a data stream D at the instant t as a finite data sequence x1, …, xt of integer values, where xi with 1  i  t is the value received at the instant i. Given an integer 1  w  t, a sliding window of size w on D at the instant t is the sequence xt  w + 1, …, xt. Thus, a sliding window represents the sequence including only the w most recent values of the data stream. Like in other approaches [17], [26], [27], [46], we assume that the sliding window

The c-Tree approach

We start the description of our proposal by illustrating briefly the architecture for continuous query processing over data streams we refer to. It is summarized in Fig. 1. Observe that this scheme is widely adopted in the literature [9], [31], [32], [42] and is composed of three modules.

The first one, named Synopsis Creator, receives elements from the data streams and has a limited amount of memory to maintain a concise synopsis for the last w points of each data stream. In contrast to

Advantages of the hierarchical approach

One of the contributions of this paper is showing how the hierarchical approach used in the literature in various forms and different contexts [14], [17], [43], [56] can be profitably adopted to approximate data streams. In this section, we analyze an important aspect related to the above issue since we show that the hierarchical approach enhances the advantages given from the bit-saving approach w.r.t. flat histograms. In order to demonstrate the above claim, we compare a n-level c-Tree with a

Experiments

In this section we report the results of a consistent number of experiments executed on both synthetic and real-life data sets to evaluate the performance of c-Tree with the purpose of comparing it with three selected techniques. The significance of this choice is motivated in Section 3.

Besides c-Tree the examined techniques are (notations used throughout the section are reported in Table 3):

  • HIST: the optimal histogram construction algorithm of [49]. We recall (see Section 3) that HIST provides

Conclusion and future work

Data stream reduction is an important issue since it allows us to make effective approaches requiring multiple scans on data, that, in such a way, may be performed over one or more reduced sliding windows. In many cases, analysis requires to estimate a range query involving data of the sliding window. In order to reach this goal, we designed a tree-like histogram used for reducing sliding windows and supporting fast approximate answers to arbitrary range queries. Our proposal has the important

Acknowledgment

This work was partially funded by the Italian Ministry of Research through the PRIN Project EASE (Entity Aware Search Engines).

Francesco Buccafurri is a full professor of computer science at the University “Mediterranea” of Reggio Calabria, Italy. In 1995 he took the PhD degree in computer science at the University of Calabria. His research interests include deductive-databases, knowledge-representation and non-monotonic reasoning, model checking, information security, data compression, data streams, agents, P2P systems. He has published several papers in top-level international journals and conference proceedings. He

References (71)

  • G. Xiaohu et al.

    On the testing for alpha-stable distributions of network traffic

    Computer Communications

    (2004)
  • F. Yan et al.

    Selectivity estimation of range queries based on data density approximation via cosine series

    Data & Knowledge Engineering

    (2007)
  • B. Yu et al.

    Processing partially specified queries over high-dimensional databases

    Data & Knowledge Engineering

    (2007)
  • S. Acharya et al.

    Join synopses for approximate query answering

  • C. Aggarwal et al.
  • C.C. Aggarwal

    On biased reservoir sampling in the presence of stream evolution

  • N. Alon et al.

    The space complexity of approximating the frequency moments

  • F. Altiparmak et al.

    Incremental maintenance of online summaries over multiple streams

    IEEE Trans. on Knowl. and Data Eng.

    (2008)
  • B. Babcock et al.

    Models and issues in data stream system

  • B. Babcock et al.

    Sampling from a moving window over streaming data

  • S. Babu et al.

    Countinuous queries over data stream

    ACM SIGMOD Record

    (2001)
  • A. Bagchi et al.

    Deterministic sampling and range counting in geometric data streams

    ACM Trans. Algorithms

    (2007)
  • BC — Ethernet Traces of LAN and WAN Traffic....
  • A.R. Bharambe et al.

    Mercury: supporting scalable multi-attribute range queries

  • V. Braverman et al.

    Optimal sampling from sliding windows

  • F. Buccafurri et al.

    Reducing data stream sliding windows by cyclic tree-like histograms

  • F. Buccafurri et al.

    Enhancing histograms by tree-like bucket indices

    The Very Large Data Bases Journal

    (2008)
  • A. Bulut et al.

    SWAT: hierarchical stream summarization in large networks

  • A. Bulut et al.

    A unified framework for monitoring data streams in real time

  • A. Bulut et al.

    An adaptive and scalable middleware for distributed indexing of data streams

  • C. Busch et al.

    A deterministic algorithm for summarizing asynchronous streams over a sliding window

  • M. Charikar et al.

    Finding frequent items in data streams

  • S. Chaudhuri et al.

    On random sampling over joins

  • G. Cormode et al.

    Sketching streams through the net: distributed approximate query tracking

  • G. Cormode et al.

    Histograms and wavelets on probabilistic data

  • Cited by (11)

    • A hierarchical semantic-based distance for nominal histogram comparison

      2013, Data and Knowledge Engineering
      Citation Excerpt :

      Thus, an ordinal type histogram can model the composition of a shopping cart according to the prices of the articles (see Fig. 1(b)). Measuring the similarity between histograms is a crucial operation in various domains such as clustering [2,3], pattern classification and recognition [4–6], image retrieval [7–10], data summarizing [11], text categorization [12,13] or time series analysis [14]. Indeed, the distance between pairs of histograms enables the similarity of their corresponding statistical properties to be assessed.

    • Identifying streaming frequent items in ad hoc time windows

      2013, Data and Knowledge Engineering
      Citation Excerpt :

      The large growth in produced data volumes and the increase in network bandwidth seen in recent years have made it necessary to revisit conventional problems in the data mining field in the context of these advances. The data being mined are now often in the form of streaming data [3,35,6], and an important problem in this area is that of detecting frequent items in a data stream. The problem of frequent item discovery in streaming data has attracted much attention, because it is relevant to many different applications across various domains [18,20,17].

    • Adaptive optimization for multiple continuous queries

      2012, Data and Knowledge Engineering
      Citation Excerpt :

      However, the query optimization techniques [4–6] used in traditional database management systems (DBMS) are not useful for data stream applications, for the following reasons. First, continuous query evaluation requires novel query processing techniques targeted for memory-resident data; these include grouped filters [1], windowed joins [7] and approximate query processing [8–10]. Second, the execution plan of continuous queries should ideally be re-optimized dynamically in run-time because of the time-varying characteristics of data streams [11].

    • A novel embedding technique for lossless data hiding in medical images employing histogram shifting method

      2014, International Journal of Wavelets, Multiresolution and Information Processing
    • Data stream management

      2019, SpringerBriefs in Computer Science
    View all citing articles on Scopus

    Francesco Buccafurri is a full professor of computer science at the University “Mediterranea” of Reggio Calabria, Italy. In 1995 he took the PhD degree in computer science at the University of Calabria. His research interests include deductive-databases, knowledge-representation and non-monotonic reasoning, model checking, information security, data compression, data streams, agents, P2P systems. He has published several papers in top-level international journals and conference proceedings. He serves as a referee for international journals and he is a member of a number of conference PCs.

    He is also included in the editorial board of a number of international journals and played the role of PC chair in some international conferences.

    Gianluca Lax is an assistant professor of computer science at the University “Mediterranea” of Reggio Calabria, Italy. In 2005 he took the PhD degree in computer science at the University of Calabria. His research interests include data reduction, data streams, user modelling, P2P systems, e-commerce and information security. He is also author of a number of papers published in top-level international journals and conference proceedings.

    A shorter abridged version of this paper appeared in Proceedings of the 8th European Conference on Principles and Practice of Knowledge Discovery in Databases, J. Boulicaut, F. Esposito, F. Giannotti, D. Pedreschi (Eds.): PKDD 2004, LNAI 3202, pp. 75–86, 2004. © Springer-Verlag Berlin Heidelberg 2004 [15].

    View full text