Computing discounted multidimensional hierarchical aggregates using modified Misra Gries algorithm
Introduction
Data stream processing is an emerging domain of applications where data are modeled not as persistent relations but rather as transient data streams. Examples of such applications include network monitoring, financial applications, telecommunications data management, sensor networks, web applications, large volume datasets, and so on. The data from these sources are often continuous, rapid, time-varying, possibly unpredictable and unbounded in nature. Such applications cannot afford storing or revisiting the data and often require fast and real-time response.
Many application domains have data that contain hierarchical attributes, such as Time (Year, Month, Hour, Minute, Second), Geographic Locations (Continent, Country, State, City), and IP addresses (192.*.*.*, 192.168.*.*, 192.168.1.*, 192.168.1.1). Analyzing such data at multiple aggregation levels simultaneously, when data is arriving in a stream, is much more challenging (and meaningful) than analyzing flat data, which is known as the Hierarchical Heavy Hitters (HHH) problem. Formally, we compute a -HHH summary as follows: consider a two dimensional lattice (shown in Fig. 1) formed by creating nodes for each combination of source IP 1.2.3.4/32 and destination IP 5.6.7.8/32 addresses (see Section 2.2). For a given threshold , we report only those nodes (prefixes of source and destination IPs) as heavy hitters with frequency exceeding , after removing the frequency of all its descendant nodes (descendant prefixes of source and destination IPs) that are also heavy hitters. In other words, a reported -HHH element does not contain the frequency of any other descendant -HHH element, but may contain the frequencies of non-HHH elements.
A HHH-summary of network traffic is of particular interest to a network monitoring team because it may reveal important patterns in the underlying data. For example, the department traffic may be composed of numerous peer-to-peer traffic connections (such as torrents or online gaming) that can be identified as one generalized traffic connection (composed of many lower level packets having a common pattern) in the HHH-summary.
The exact computation of HHH is not possible without space linearly proportional to at least the number of input elements [1], [2], therefore the paradigm of approximation is adopted in a resource constrained environment such as data streams. Consequently, researchers have recently focused on efficient computation (by minimizing memory requirement) of HHH with an acceptable precision, where is a given error tolerance between 0 and 1. The algorithms for computing HHH are compared in terms of space usage and update cost that is often bounded using the error induced by the proposed solution in the estimation.
The contributions of this paper are as follows; (1) we identify a coverage problem in existing HHH techniques (see Section 3); (2) to address the coverage problem, we propose an error estimation technique; (3) we highlight the fundamental problem of using generalized error estimates for window based HHH calculation; (4) addressing the above issues we propose an efficient HHH algorithm using modified Misra Gries technique, which improves on the theoretical performance of existing HHH algorithms in terms of memory required and update cost; (5) experimental results using real IP datasets demonstrate that the proposed algorithm outperforms existing benchmark techniques for one-dimensional data and has comparable results for multidimensional data.
The remainder of this paper is organized as follows: Section 2 formulates the problem and explains the notation used throughout the paper. Section 3 discusses existing HHH algorithms (e.g., Full Ancestry and Partial Ancestry) and their coverage problem. In Section 4 we describe the HHH-MG algorithm and develop theoretical proofs on bounds for space and update costs. The implementation details and experimental results are provided in Section 5. Finally, Section 6 provides insights and analysis on our work and the paper concludes in Section 7.
Section snippets
Notation
Let be a data space of dimensions, where is a continuous stream of records drawn from . In the data space each record is characterized by a set of attributes, such as which is referred to as the attribute space. Next, each attribute is drawn from a hierarchy . In order to compute the -HHH summary of a data stream, following Cormode et al. [3], we conceive of -dimensional data with hierarchical attributes arranged in a mathematical
Full and Partial Ancestry HHH algorithms: addressing the coverage problem
In this section we describe coverage problems of two existing HHH techniques (the Full Ancestry (FA) and Partial Ancestry (PA) algorithms) [3] by providing an example scenario where the algorithms fail to report some of the HHH. Next, we identify the cause of the problem, which is due to the use of an error value in estimation of the frequency. Finally, we provide a solution that addresses the coverage problem. Both FA and PA maintain a data structure called trie. Both the algorithms divide
Proposed algorithms
In the previous section we addressed the coverage issue of FA and PA HHH algorithms. Although the FA and PA algorithms can now give accurate results using the proposed coverage solution, however, for many practical streaming applications their theoretical space and update costs are high. To improve the space and time requirements, in this section we propose a new algorithm (HHH-MG) to compute HHH that has better theoretical bounds for space and time complexities.
Implementation and evaluation
We have compared the HHH-MG algorithm with existing FA (termed Full) and PA (termed Partial) algorithms for both one and two dimensional data for a range of parameters (i.e., , , and varying stream length ). We have implemented all algorithms using the Java programming language. The data structures used are based on hashing techniques, which require one hashing function to lookup a particular record in the data structure.
Datasets: In our experiments we have used real Internet traffic
Related work
Finding hierarchical heavy hitters has seen a great deal of interest in the literature. Zhang et al. [12] and more recently Duffield et al. [13], [14] have introduced techniques for computing HHH that maintains -bit trie data structure which is an extension of multi-bit trie data structure. Each node of the trie data structure has at most children and for each incoming record a search (e.g., static IP lookup) is initiated that finds the best matching node by traversing the trie from the
Conclusion
Constructing a multidimensional hierarchical summary is an important tool for emerging stream processing applications, where data is modeled as transient data streams. Applications like network monitoring, financial applications, telecommunications data management, sensor networks, and web applications use such a summary for decision making and strategic planning. In this paper, we have proposed a new streaming -approximation algorithm (HHH-MG) for computing HHH based on a modified Misra Gries
Zubair Shah received a B.S. degree in Computer Science from University of Peshawar (Pakistan) in 2005, and an M.S. degree in Computer System Engineering from Politecnico di Milano (Italy), in 2012. He is currently working toward a Ph.D. degree at the School of Engineering and IT (SEIT), University of New South Wales (Canberra, Australia). He had been a system engineer from 2006 to 2009 in NADRA (Pakistan). He worked as a lecturer from March 2013 to September 2014 in City University of Science
References (25)
- et al.
Finding repeated elements
Sci. Comput. Program.
(1982) - et al.
Online identification of hierarchical heavy hitters: algorithms, evaluation, and applications
- et al.
Finding hierarchical heavy hitters in data streams
- et al.
New directions in traffic measurement and accounting
SIGCOMM Comput. Commun. Rev.
(2002) - et al.
Finding frequent items in data streams
- et al.
Finding hierarchical heavy hitters in streaming data
ACM Trans. Knowl. Discov. Data
(2008) - et al.
Space-optimal heavy hitters with strong error bounds
ACM Trans. Database Syst.
(2010) - et al.
Efficient computation of frequent and top-k elements in data streams
- et al.
Space complexity of hierarchical heavy hitters in multi-dimensional data streams
- et al.
Finding the frequent items in streams of data
Commun. ACM
(2009)
Exploiting hierarchical domain structure to compute similarity
ACM Trans. Inf. Syst. (TOIS)
Cited by (3)
Top-k frequent items and item frequency tracking over sliding windows of any size
2019, Information SciencesA Spatiotemporal Data Summarization Approach for Real-Time Operation of Smart Grid
2020, IEEE Transactions on Big DataComputing hierarchical summary from two-dimensional big data streams
2018, IEEE Transactions on Parallel and Distributed Systems
Zubair Shah received a B.S. degree in Computer Science from University of Peshawar (Pakistan) in 2005, and an M.S. degree in Computer System Engineering from Politecnico di Milano (Italy), in 2012. He is currently working toward a Ph.D. degree at the School of Engineering and IT (SEIT), University of New South Wales (Canberra, Australia). He had been a system engineer from 2006 to 2009 in NADRA (Pakistan). He worked as a lecturer from March 2013 to September 2014 in City University of Science and Technology (Peshawar, Pakistan). His research interests include data mining and summarization techniques for scalable network traffic analysis, big data analytics, and machine learning techniques.
Abdun Naser Mahmood received the Ph.D. degree from the University of Melbourne, Australia, in 2008; the M.Sc. degree in Computer Science and the B.Sc. degree in Applied Physics and Electronics from the University of Dhaka, Bangladesh, in 1999 and 1997, respectively. He has been working as an academic in Computer Science since 1999. He is currently in the School of Engineering and IT, University of New South Wales. Previously, he has been a lecturer since 2000, an Assistant Professor since 2003 at the University of Dhaka. Between 2008 and 2011, he has been a Postdoctoral Research Fellow at the Royal Melbourne Institute of Technology. His research interests include data mining techniques for scalable network traffic analysis, anomaly detection, and industrial SCADA security. He has published his work in various IEEE Transactions and A-tier international journals and conferences.
Spike (Michael) Barlow is a computer scientist. He received his Ph.D. from UNSW Australia in 1991. Following which he worked at the University of Queensland (Australia) and Nippon Telegraph & Telephone’s Human Communication Laboratories in Japan. In 1996 he returned to UNSW Canberra in a teaching and research role; where he still works today within the School of Engineering and Information Technology. His research interests straddle the areas of simulation, virtual environments, machine learning, serious games, and human computer interaction. He has produced more than 100 papers in quality international journals and conferences.