Dynamic adaptive data structures for monitoring data streams
Introduction
Monitoring streamed data is very important in many different scenarios. Examples of such cases are: telecom companies or Internet service providers that need to collect exact or approximate statistics of the access to the services they provide; the large amount of streamed data collected in retail-chain transaction processes that has to be monitored for statistical records; and, in the database management context, cases where it is necessary to approximate aggregate answers to user queries over data stream windows [4], [16], [20], [19], or to estimate the data frequency peaks or join sizes [2], [3].
Thus, the massive streams of data generated by these applications, sometimes, need to be summarized through compact and fast data structures to support queries and mining. One of the typical uses of these summaries is to provide the number of occurrences per data item in the input stream, such that we may provide fast approximate answers to different types of aggregate or join size queries. Moreover, the data structures used with this purpose in mind have to be able to adapt to the heavy-tailed distributions that these streams usually present.
As a whole, this is a complex problem and several data structures have been proposed to keep approximate counts of items using the smallest amount of memory possible and a fast per-item response time. The data structures proposed in this area are the counting bloom filters (CBF) [17], and their dynamic extensions: the spectral bloom filters (SBF) [10] and the dynamic count filters (DCF) [1]. CBF are a counter-oriented extension of bloom filters [6], where each presence bit of a bloom filter is substituted by a fixed size bit-counter. CBF may saturate and fail in their mission of keeping an accurate track of counts with skewed data. As an alternative, the SBF were designed to dynamically adapt the size of the counters to the characteristics of the data. SBF are composed of variable-sized counters that allow for real adaptiveness but fail to provide fast access times because of the indexing structures they require. Finally, DCF allow for dynamic environments with unlimited size counters and fast access times. However, they waste a large percentage of the memory they allocate in the presence of data skew, and fail to provide fast average response times because of the heavy reconstruction phases that they require.
In this paper, apart from reviewing and analyzing all the data structures mentioned above thoroughly, we propose a partitioning strategy that, once applied to a dynamic approach such as SBF and DCF, solves all the problems mentioned above. On one hand, it helps to make SBF and DCF insensitive to changes in the characteristics of the data for the following reasons. First, it minimizes the amount of memory required at any time without limiting the counting possibilities of the methods, facilitating the use of dynamic methods in hardware implementations where memory may be a restriction. Second, it reduces the average response time because it simplifies the painful reconstruction phases of the previous non-partitioned methods, reducing the worst case response times, and adapting better to high data streamed frequencies. Third, it assures and, under some circumstances, it improves the accuracy of previous approaches, making it possible to use it in environments where accuracy is important but memory is again a restriction.
On the other hand, under stress situations, like changes in the intensity of the data flux, our partitioning strategy allows for fast response and high accuracy. The degradation shown by SBF in such situations, is reduced significantly by our partitioned strategy combined with DCF.
In order to make this paper self-contained and to perform a thorough analysis of the methods studied and proposed, we make the following contributions:
- •
We analyze the streamed data monitoring problem in perspective, reviewing and understanding the previous literature on the topic, and in particular CBF, SBF and DCF. The analysis allows us to understand the benefits and drawbacks of these structures.
- •
We propose a generic partitioning strategy that can be easily adapted to dynamic approaches in general at a low cost, significantly improving their characteristics as mentioned above. Our partitioning strategy tackles the drawbacks of SBF and DCF with clear benefits in all the aspects mentioned above.
- •
We present mathematical models that allow for two different analysis: (i) a memory comparison of the four approaches and (ii) a comparison of the complexity of the insert/delete/query/rebuild operations for each of the data structures. The models that we propose are very helpful in order to show the most interesting features of each data structure and to understand the results obtained through our real tests.
- •
We evaluate all the strategies analyzed in the paper, including the partitioned and non-partitioned SBF and DCF, which is the first profound evaluation of such monitoring data structures. We do that in a set of very different scenarios to give a complete view of their characteristics.
- •
As a general view of our contributions, our results show that our partitioning strategy is robust because it satisfies the constraints imposed by data stream environments, improving significantly in terms of memory space, response time, accuracy and versatility compared to the use of SBF and DCF alone.
This paper is organized as follows: In Section 2 we enumerate the different structures used for monitoring data streams and explain a set of scenarios in which the techniques proposed and evaluated in this paper can be used. In Section 3 we describe the counter-based filter structures proposed in the literature. In Section 4, we describe the partitioning strategy that we propose, and its application to previous strategies. Later, in Section 5, we seek the optimal number of partitions for PSBF and PDCF. In Section 6 we model and compare PSBF, PDCF, SBF, and DCF. In Section 7, we present the experimental results for the different scenarios and, finally, we conclude in Section 8.
Section snippets
Applications and evolution of data set monitoring
Data set monitoring has a wide range of application areas. The need to detect the presence of data items in a set or to approximate aggregate answers for a query, makes it necessary to use structures such as bloom filters [6] or the other counting structures analyzed here. In this section, we explain a set of different situations where the bloom filters and the counting structures are necessary.
The bloom filter has been widely used for membership monitoring purposes: in multi-join queries [9],
Data structures for data set monitoring
In this section, we review the data structures proposed for data set monitoring. In order to make the paper self-contained, we start with a short description of the bloom filters. Then, we continue with descriptions of the counting bloom filters, the spectral bloom filters and the dynamic count filters. Table 1 describes the notation for the common variables used for the different data structures that we explain in this paper. Note that, throughout the paper, we use the terminology data element
Partitioning the dynamic data structures
We propose a simple and powerful partitioning strategy for the dynamic data structures used in monitoring data streams. This is an interesting solution because any dynamic approach may benefit from this. Our strategy solves the space and time penalties of SBF and DCF. We achieve this goal by clustering the streamed data set into γ different partitions. This way, each insert, delete and query operation only interacts with one partition, as opposed to the whole data structure. In addition, only
Optimum number of partitions (γ)
At this point, it is important to define the trade-off between the amount of memory saved by the partitioning strategy, the space occupied by the PV and the amount of buffer area required during the rebuilt operations. This will allow us to obtain the optimum number of partitions γ for a partitioned data structure.
For simplicity, we start assuming a uniform data distribution, and a perfect hash distribution of the data set among the m positions of the bit vectors. We also assume a total number
Comparing SBF, DCF, PSBF, and PDCF
We use models to compare the four data structures from different points of view: the memory resources needed, the time to access and update a counter and the time to rebuild the data structures when a counter overflows. We evaluate the practical issues of these implementations in Section 7. For the analysis, we use the same terminology defined in Section 3.
Experimental results
We are going to evaluate and compare the techniques described and analyzed up to now. The objective is to shed light onto the real practical issues of SBF, DCF, PSBF, and PDCF. The goal of our tests is to show the response of each approach to important constraints in the data stream environment like memory budget, per-item response time, accuracy, and versatility.
All the approaches have been programmed in C. The implementations of SBF and DCF have followed the exact specifications given in [10]
Conclusions
In this paper, we review different data structures for data set monitoring and propose a partitioning scheme oriented to improve the qualities of previously proposed data structures. Low memory budget, fast per-item response time, accuracy and versatility are typical constraints in many data stream environments. Our study allows us to thoroughly compare the existing solutions for the problem we address and shows that thanks to the partitioning scheme that we propose, it is possible to comply
Josep Aguilar-Saborit graduated at the Facultat d’Informatica de Barcelona in 2002 and obtained his PhD from Universitat Politècnica de Catalunya in 2006. He is at present working at the IBM Toronto Laboratory specialized in DB2 run-time query processing. His interests are in the area of query processing and data streams among others.
References (30)
- et al.
Dynamic count filters
SIGMOD Record
(2006) - N. Alon, P.B. Gibbons, Y. Matias, M. Szegedy, Tracking join and self-join sizes in limited storage, in: PODS’99:...
- N. Alon, Y. Matias, M. Szegedy, The space complexity of approximating the frequency moments, in: STOC’96: Proceedings...
- et al.
Continuous queries over data streams
SIGMOD Record
(2001) - et al.
Using semi-joins to solve relational queries
J. ACM
(1981) Space/time trade-offs in hash coding with allowable errors
Commun. ACM
(1970)- L. Breslau, P. Cao, F. Fan, G. Phillips, S. Shenker, Web caching and zipf-like distributions: evidence and...
- A. Broder, M. Mitzenmacher, Network applications of bloom filters: a survey, in: Proceedings of the Allerton...
- et al.
On applying hash filters to improving the execution of multi-join queries
VLDB J.
(1997) - S. Cohen, Y. Matias, Spectral bloom filters, in: SIGMOD’03: Proceedings of the ACM SIGMOD International Conference on...
Cited by (8)
GreenC5: An adaptive, energy-aware collection for green software development
2017, Sustainable Computing: Informatics and SystemsCitation Excerpt :By monitoring energy supply and demand, the platform is able to select the correct tradeoff between energy conservation and application quality. Also, research on dynamic adaptive data structures for monitoring data streams focuses on changing a specific data structure representation for accuracy, speed of response and memory requirements [14]. And lastly, the study in [15] presents a method for auto-tuning programs with algorithmic choice.
Efficient Bloom filter for network protocols using AES instruction set
2017, IET CommunicationsPredicting data structures for energy efficient computing
2016, 2015 6th International Green and Sustainable Computing ConferenceProposing a algorithm for finding repetitive patterns in web dataflow
2015, International Journal of Software Engineering and its ApplicationsThe directs impact to pre-filtering process to weather dataset
2011, Journal of Theoretical and Applied Information TechnologyHybrid in-memory and on-disk tables for speeding-up table accesses
2010, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Josep Aguilar-Saborit graduated at the Facultat d’Informatica de Barcelona in 2002 and obtained his PhD from Universitat Politècnica de Catalunya in 2006. He is at present working at the IBM Toronto Laboratory specialized in DB2 run-time query processing. His interests are in the area of query processing and data streams among others.
Pedro Trancoso received the PhD degree in computer science from the University of Illinois at Urbana-Champaign, Illinois, USA, in 1998. He is currently an Assistant Professor at the Department of Computer Science of the University of Cyprus, Nicosia, Cyprus. His research interest is in the area of computer architecture, with a focus on the memory hierarchy, architecture-aware optimizations for database workloads, multi-core architectures, and the use of graphics processors. He is a member of the HiPEAC Network of Excellence and of the Editorial Board for the International Journal of High-Performance System Architecture. He is the head of the CASPER (Computer Architecture and Systems Performance Evaluation Research) research group – <http://www.cs.ucy.ac.cy/carch/casper>.
Victor Muntes-Mulero graduated at the Facultat d’Informatica de Barcelona in 2002 and obtained his PhD from Universitat Politècnica de Catalunya in 2007. His interests are in the area of optimization of large join queries, performance of DBMS, graph databases and data privacy among others.
Josep L. Larriba-Pey graduated at the Facultat d’Informatica de Barcelona in 1989 and obtained his PhD from Universitat Politècnica de Catalunya in 1995. He is at present director of DAMA-UPC (www.dama.upc.edu), a research and technology transfer group working on data quality, relational database performance and data exploration over graph databases.