Dynamic adaptive data structures for monitoring data streams

https://doi.org/10.1016/j.datak.2007.12.006Get rights and content

Abstract

The monitoring of data streams is a very important issue in many different areas. Aspects such as accuracy, the speed of response, the use of memory and the adaptability to the changing nature of data may vary in importance depending on the situation. Examples such as Web page access monitoring, approximate aggregation in relational queries or IP message routing are clear examples of a varied range of those needs.

There are different data structures that deal with this problem such as the counting bloom filters, the spectral bloom filters and the dynamic count filters. Those data structures range from static to complex dynamic representations of the data stream that keep an approximate count of the number of occurrences for each data value.

In this paper, we focus on three main aspects. First, we analyze the problem in perspective and review the existing static and dynamic solutions. Second, we propose and analyze in depth a simple yet powerful partitioning strategy that reinforces the advantages of the methods proposed up to now solving most of their drawbacks. Finally, using real executions and mathematical models, we evaluate the existing methods alone and in combination with our partitioning strategy. We show that with our partitioning strategy, it is possible to reduce the memory requirements and average response time, improving the adaptiveness to changing data characteristics and leaving the accuracy of the partitioned dynamic data structures intact.

Introduction

Monitoring streamed data is very important in many different scenarios. Examples of such cases are: telecom companies or Internet service providers that need to collect exact or approximate statistics of the access to the services they provide; the large amount of streamed data collected in retail-chain transaction processes that has to be monitored for statistical records; and, in the database management context, cases where it is necessary to approximate aggregate answers to user queries over data stream windows [4], [16], [20], [19], or to estimate the data frequency peaks or join sizes [2], [3].

Thus, the massive streams of data generated by these applications, sometimes, need to be summarized through compact and fast data structures to support queries and mining. One of the typical uses of these summaries is to provide the number of occurrences per data item in the input stream, such that we may provide fast approximate answers to different types of aggregate or join size queries. Moreover, the data structures used with this purpose in mind have to be able to adapt to the heavy-tailed distributions that these streams usually present.

As a whole, this is a complex problem and several data structures have been proposed to keep approximate counts of items using the smallest amount of memory possible and a fast per-item response time. The data structures proposed in this area are the counting bloom filters (CBF) [17], and their dynamic extensions: the spectral bloom filters (SBF) [10] and the dynamic count filters (DCF) [1]. CBF are a counter-oriented extension of bloom filters [6], where each presence bit of a bloom filter is substituted by a fixed size bit-counter. CBF may saturate and fail in their mission of keeping an accurate track of counts with skewed data. As an alternative, the SBF were designed to dynamically adapt the size of the counters to the characteristics of the data. SBF are composed of variable-sized counters that allow for real adaptiveness but fail to provide fast access times because of the indexing structures they require. Finally, DCF allow for dynamic environments with unlimited size counters and fast access times. However, they waste a large percentage of the memory they allocate in the presence of data skew, and fail to provide fast average response times because of the heavy reconstruction phases that they require.

In this paper, apart from reviewing and analyzing all the data structures mentioned above thoroughly, we propose a partitioning strategy that, once applied to a dynamic approach such as SBF and DCF, solves all the problems mentioned above. On one hand, it helps to make SBF and DCF insensitive to changes in the characteristics of the data for the following reasons. First, it minimizes the amount of memory required at any time without limiting the counting possibilities of the methods, facilitating the use of dynamic methods in hardware implementations where memory may be a restriction. Second, it reduces the average response time because it simplifies the painful reconstruction phases of the previous non-partitioned methods, reducing the worst case response times, and adapting better to high data streamed frequencies. Third, it assures and, under some circumstances, it improves the accuracy of previous approaches, making it possible to use it in environments where accuracy is important but memory is again a restriction.

On the other hand, under stress situations, like changes in the intensity of the data flux, our partitioning strategy allows for fast response and high accuracy. The degradation shown by SBF in such situations, is reduced significantly by our partitioned strategy combined with DCF.

In order to make this paper self-contained and to perform a thorough analysis of the methods studied and proposed, we make the following contributions:

  • We analyze the streamed data monitoring problem in perspective, reviewing and understanding the previous literature on the topic, and in particular CBF, SBF and DCF. The analysis allows us to understand the benefits and drawbacks of these structures.

  • We propose a generic partitioning strategy that can be easily adapted to dynamic approaches in general at a low cost, significantly improving their characteristics as mentioned above. Our partitioning strategy tackles the drawbacks of SBF and DCF with clear benefits in all the aspects mentioned above.

  • We present mathematical models that allow for two different analysis: (i) a memory comparison of the four approaches and (ii) a comparison of the complexity of the insert/delete/query/rebuild operations for each of the data structures. The models that we propose are very helpful in order to show the most interesting features of each data structure and to understand the results obtained through our real tests.

  • We evaluate all the strategies analyzed in the paper, including the partitioned and non-partitioned SBF and DCF, which is the first profound evaluation of such monitoring data structures. We do that in a set of very different scenarios to give a complete view of their characteristics.

  • As a general view of our contributions, our results show that our partitioning strategy is robust because it satisfies the constraints imposed by data stream environments, improving significantly in terms of memory space, response time, accuracy and versatility compared to the use of SBF and DCF alone.

This paper is organized as follows: In Section 2 we enumerate the different structures used for monitoring data streams and explain a set of scenarios in which the techniques proposed and evaluated in this paper can be used. In Section 3 we describe the counter-based filter structures proposed in the literature. In Section 4, we describe the partitioning strategy that we propose, and its application to previous strategies. Later, in Section 5, we seek the optimal number of partitions for PSBF and PDCF. In Section 6 we model and compare PSBF, PDCF, SBF, and DCF. In Section 7, we present the experimental results for the different scenarios and, finally, we conclude in Section 8.

Section snippets

Applications and evolution of data set monitoring

Data set monitoring has a wide range of application areas. The need to detect the presence of data items in a set or to approximate aggregate answers for a query, makes it necessary to use structures such as bloom filters [6] or the other counting structures analyzed here. In this section, we explain a set of different situations where the bloom filters and the counting structures are necessary.

The bloom filter has been widely used for membership monitoring purposes: in multi-join queries [9],

Data structures for data set monitoring

In this section, we review the data structures proposed for data set monitoring. In order to make the paper self-contained, we start with a short description of the bloom filters. Then, we continue with descriptions of the counting bloom filters, the spectral bloom filters and the dynamic count filters. Table 1 describes the notation for the common variables used for the different data structures that we explain in this paper. Note that, throughout the paper, we use the terminology data element

Partitioning the dynamic data structures

We propose a simple and powerful partitioning strategy for the dynamic data structures used in monitoring data streams. This is an interesting solution because any dynamic approach may benefit from this. Our strategy solves the space and time penalties of SBF and DCF. We achieve this goal by clustering the streamed data set into γ different partitions. This way, each insert, delete and query operation only interacts with one partition, as opposed to the whole data structure. In addition, only

Optimum number of partitions (γ)

At this point, it is important to define the trade-off between the amount of memory saved by the partitioning strategy, the space occupied by the PV and the amount of buffer area required during the rebuilt operations. This will allow us to obtain the optimum number of partitions γ for a partitioned data structure.

For simplicity, we start assuming a uniform data distribution, and a perfect hash distribution of the data set among the m positions of the bit vectors. We also assume a total number

Comparing SBF, DCF, PSBF, and PDCF

We use models to compare the four data structures from different points of view: the memory resources needed, the time to access and update a counter and the time to rebuild the data structures when a counter overflows. We evaluate the practical issues of these implementations in Section 7. For the analysis, we use the same terminology defined in Section 3.

Experimental results

We are going to evaluate and compare the techniques described and analyzed up to now. The objective is to shed light onto the real practical issues of SBF, DCF, PSBF, and PDCF. The goal of our tests is to show the response of each approach to important constraints in the data stream environment like memory budget, per-item response time, accuracy, and versatility.

All the approaches have been programmed in C. The implementations of SBF and DCF have followed the exact specifications given in [10]

Conclusions

In this paper, we review different data structures for data set monitoring and propose a partitioning scheme oriented to improve the qualities of previously proposed data structures. Low memory budget, fast per-item response time, accuracy and versatility are typical constraints in many data stream environments. Our study allows us to thoroughly compare the existing solutions for the problem we address and shows that thanks to the partitioning scheme that we propose, it is possible to comply

Josep Aguilar-Saborit graduated at the Facultat d’Informatica de Barcelona in 2002 and obtained his PhD from Universitat Politècnica de Catalunya in 2006. He is at present working at the IBM Toronto Laboratory specialized in DB2 run-time query processing. His interests are in the area of query processing and data streams among others.

References (30)

  • J. Aguilar-Saborit et al.

    Dynamic count filters

    SIGMOD Record

    (2006)
  • N. Alon, P.B. Gibbons, Y. Matias, M. Szegedy, Tracking join and self-join sizes in limited storage, in: PODS’99:...
  • N. Alon, Y. Matias, M. Szegedy, The space complexity of approximating the frequency moments, in: STOC’96: Proceedings...
  • S. Babu et al.

    Continuous queries over data streams

    SIGMOD Record

    (2001)
  • P.A. Bernstein et al.

    Using semi-joins to solve relational queries

    J. ACM

    (1981)
  • B.H. Bloom

    Space/time trade-offs in hash coding with allowable errors

    Commun. ACM

    (1970)
  • L. Breslau, P. Cao, F. Fan, G. Phillips, S. Shenker, Web caching and zipf-like distributions: evidence and...
  • A. Broder, M. Mitzenmacher, Network applications of bloom filters: a survey, in: Proceedings of the Allerton...
  • M.-S. Chen et al.

    On applying hash filters to improving the execution of multi-join queries

    VLDB J.

    (1997)
  • S. Cohen, Y. Matias, Spectral bloom filters, in: SIGMOD’03: Proceedings of the ACM SIGMOD International Conference on...
  • G. Cormode, S. Muthukrishnan, Summarizing and mining skewed data streams,...
  • M.E. Crovella, M.S. Taqqu, A. Bestavros, Heavy-tailed probability distributions in the world wide web (1998)...
  • D.J. DeWitt, R.H. Gerber, G. Graefe, M.L. Heytens, K.B. Kumar, M. Muralikrishna, Gamma – a high performance dataflow...
  • D.J. DeWitt, S. Ghanderaizadeh, D. Schneider, A performance analysis of the gamma database machine, in: SIGMOD’88:...
  • S. Dharmapurikar, P. Krishnamurthy, D.E. Taylor, Longest prefix matching using bloom filters, in: SIGCOMM’03:...
  • Cited by (8)

    • GreenC5: An adaptive, energy-aware collection for green software development

      2017, Sustainable Computing: Informatics and Systems
      Citation Excerpt :

      By monitoring energy supply and demand, the platform is able to select the correct tradeoff between energy conservation and application quality. Also, research on dynamic adaptive data structures for monitoring data streams focuses on changing a specific data structure representation for accuracy, speed of response and memory requirements [14]. And lastly, the study in [15] presents a method for auto-tuning programs with algorithmic choice.

    • Predicting data structures for energy efficient computing

      2016, 2015 6th International Green and Sustainable Computing Conference
    • Proposing a algorithm for finding repetitive patterns in web dataflow

      2015, International Journal of Software Engineering and its Applications
    • The directs impact to pre-filtering process to weather dataset

      2011, Journal of Theoretical and Applied Information Technology
    • Hybrid in-memory and on-disk tables for speeding-up table accesses

      2010, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus

    Josep Aguilar-Saborit graduated at the Facultat d’Informatica de Barcelona in 2002 and obtained his PhD from Universitat Politècnica de Catalunya in 2006. He is at present working at the IBM Toronto Laboratory specialized in DB2 run-time query processing. His interests are in the area of query processing and data streams among others.

    Pedro Trancoso received the PhD degree in computer science from the University of Illinois at Urbana-Champaign, Illinois, USA, in 1998. He is currently an Assistant Professor at the Department of Computer Science of the University of Cyprus, Nicosia, Cyprus. His research interest is in the area of computer architecture, with a focus on the memory hierarchy, architecture-aware optimizations for database workloads, multi-core architectures, and the use of graphics processors. He is a member of the HiPEAC Network of Excellence and of the Editorial Board for the International Journal of High-Performance System Architecture. He is the head of the CASPER (Computer Architecture and Systems Performance Evaluation Research) research group – <http://www.cs.ucy.ac.cy/carch/casper>.

    Victor Muntes-Mulero graduated at the Facultat d’Informatica de Barcelona in 2002 and obtained his PhD from Universitat Politècnica de Catalunya in 2007. His interests are in the area of optimization of large join queries, performance of DBMS, graph databases and data privacy among others.

    Josep L. Larriba-Pey graduated at the Facultat d’Informatica de Barcelona in 1989 and obtained his PhD from Universitat Politècnica de Catalunya in 1995. He is at present director of DAMA-UPC (www.dama.upc.edu), a research and technology transfer group working on data quality, relational database performance and data exploration over graph databases.

    View full text