Elsevier

Performance Evaluation

Volume 91, September 2015, Pages 170-186
Performance Evaluation

Computing discounted multidimensional hierarchical aggregates using modified Misra Gries algorithm

https://doi.org/10.1016/j.peva.2015.06.011Get rights and content

Abstract

Finding the “Top k” list or heavy hitters is an important function in many computing applications, including database joins, data warehousing (e.g., OLAP), web caching and hits, network usage monitoring, and detecting DDoS attacks. While most applications work on traditional “flat” data, many domains contain data that has attributes which can take values from different hierarchies e.g., time and geographic location, source and destination IP addresses, etc. In addition, hierarchical heavy hitters generated from hierarchical attributes offer insightful and a generalized view of the data. When data arrives in a stream, infrequent data elements need to be deleted due to space constraints. However, unlike traditional heavy hitters which do not consider deleted elements, in a hierarchical heavy hitter some of these deleted elements could form a heavy hitter at a higher level of the hierarchy. Furthermore, unlike traditional heavy hitters, hierarchical heavy hitters have ancestor–descendant relationship which requires the count of descendant heavy hitters to be discounted at the higher levels of the hierarchy. This is particularly challenging in a streaming environment because all the incoming data items cannot be stored or revisited. This problem is generally addressed by accepting an error constraint ϵ on the precision of the count of hierarchical heavy hitters, since accurate count cannot be guaranteed under the constrained memory requirement posed by the streaming environment. In this work, we propose a new streaming ϵ-approximation algorithm (HHH-MG) for computing Hierarchical Heavy Hitters based on a modified Misra Gries heavy hitter algorithm. The proposed algorithm guarantees ϵ-approximation precision with improved worst-case time and space bounds compared to previous algorithms. It requires overall O(ηϵ) space, and O(η) updates per element of the data, where η is a small constant. We provide theoretical proofs for the space and time requirements. We have also experimentally compared the proposed algorithm with benchmark techniques. Experimental results demonstrate that the proposed algorithm requires fewer updates per element of data, and on average requires less memory. For the experimental validation, we have used both synthetic data derived from an open source generator, and real benchmark datasets from an international Internet Service Provider.

Introduction

Data stream processing is an emerging domain of applications where data are modeled not as persistent relations but rather as transient data streams. Examples of such applications include network monitoring, financial applications, telecommunications data management, sensor networks, web applications, large volume datasets, and so on. The data from these sources are often continuous, rapid, time-varying, possibly unpredictable and unbounded in nature. Such applications cannot afford storing or revisiting the data and often require fast and real-time response.

Many application domains have data that contain hierarchical attributes, such as Time (Year, Month, Hour, Minute, Second), Geographic Locations (Continent, Country, State, City), and IP addresses (192.*.*.*, 192.168.*.*, 192.168.1.*, 192.168.1.1). Analyzing such data at multiple aggregation levels simultaneously, when data is arriving in a stream, is much more challenging (and meaningful) than analyzing flat data, which is known as the Hierarchical Heavy Hitters (HHH) problem. Formally, we compute a ϕ-HHH summary as follows: consider a two dimensional lattice (shown in Fig. 1) formed by creating nodes for each combination of source IP 1.2.3.4/32 and destination IP 5.6.7.8/32 addresses (see Section  2.2). For a given threshold ϕ, we report only those nodes (prefixes of source and destination IPs) as heavy hitters with frequency exceeding ϕN, after removing the frequency of all its descendant nodes (descendant prefixes of source and destination IPs) that are also heavy hitters. In other words, a reported ϕ-HHH element does not contain the frequency of any other descendant ϕ-HHH element, but may contain the frequencies of non-HHH elements.

A HHH-summary of network traffic is of particular interest to a network monitoring team because it may reveal important patterns in the underlying data. For example, the department traffic may be composed of numerous peer-to-peer traffic connections (such as torrents or online gaming) that can be identified as one generalized traffic connection (composed of many lower level packets having a common pattern) in the HHH-summary.

The exact computation of HHH is not possible without space linearly proportional to at least the number of input elements  [1], [2], therefore the paradigm of approximation is adopted in a resource constrained environment such as data streams. Consequently, researchers have recently focused on efficient computation (by minimizing memory requirement) of HHH with an acceptable ϵ precision, where ϵ is a given error tolerance between 0 and 1. The algorithms for computing HHH are compared in terms of space usage and update cost that is often bounded using the error ϵ induced by the proposed solution in the estimation.

The contributions of this paper are as follows; (1) we identify a coverage problem in existing HHH techniques (see Section  3); (2) to address the coverage problem, we propose an error estimation technique; (3) we highlight the fundamental problem of using generalized error estimates for window based HHH calculation; (4) addressing the above issues we propose an efficient HHH algorithm using modified Misra Gries technique, which improves on the theoretical performance of existing HHH algorithms in terms of memory required and update cost; (5) experimental results using real IP datasets demonstrate that the proposed algorithm outperforms existing benchmark techniques for one-dimensional data and has comparable results for multidimensional data.

The remainder of this paper is organized as follows: Section  2 formulates the problem and explains the notation used throughout the paper. Section  3 discusses existing HHH algorithms (e.g., Full Ancestry and Partial Ancestry) and their coverage problem. In Section  4 we describe the HHH-MG algorithm and develop theoretical proofs on bounds for space and update costs. The implementation details and experimental results are provided in Section  5. Finally, Section  6 provides insights and analysis on our work and the paper concludes in Section  7.

Section snippets

Notation

Let d be a data space of d dimensions, where S={R1,R2,R3RN} is a continuous stream of N records drawn from d. In the data space d each record Ri is characterized by a set of d attributes, such as Ri={a1,a2,a3ad} which is referred to as the attribute space. Next, each attribute ai is drawn from a hierarchy hi. In order to compute the ϕ-HHH summary of a data stream, following Cormode et al.  [3], we conceive of d-dimensional data with hierarchical attributes arranged in a mathematical

Full and Partial Ancestry HHH algorithms: addressing the coverage problem

In this section we describe coverage problems of two existing HHH techniques (the Full Ancestry (FA) and Partial Ancestry (PA) algorithms)  [3] by providing an example scenario where the algorithms fail to report some of the HHH. Next, we identify the cause of the problem, which is due to the use of an error value in estimation of the frequency. Finally, we provide a solution that addresses the coverage problem. Both FA and PA maintain a data structure called trie. Both the algorithms divide

Proposed algorithms

In the previous section we addressed the coverage issue of FA and PA HHH algorithms. Although the FA and PA algorithms can now give accurate results using the proposed coverage solution, however, for many practical streaming applications their theoretical space and update costs are high. To improve the space and time requirements, in this section we propose a new algorithm (HHH-MG) to compute HHH that has better theoretical bounds for space and time complexities.

Implementation and evaluation

We have compared the HHH-MG algorithm with existing FA (termed Full) and PA (termed Partial) algorithms for both one and two dimensional data for a range of parameters (i.e., ϵ, ϕ, and varying stream length N). We have implemented all algorithms using the Java programming language. The data structures used are based on hashing techniques, which require one hashing function to lookup a particular record in the data structure.

Datasets: In our experiments we have used real Internet traffic

Related work

Finding hierarchical heavy hitters has seen a great deal of interest in the literature. Zhang et al.  [12] and more recently Duffield et al.  [13], [14] have introduced techniques for computing HHH that maintains m-bit trie data structure which is an extension of multi-bit trie data structure. Each node of the trie data structure has at most 2m children and for each incoming record a search (e.g., static IP lookup) is initiated that finds the best matching node by traversing the trie from the

Conclusion

Constructing a multidimensional hierarchical summary is an important tool for emerging stream processing applications, where data is modeled as transient data streams. Applications like network monitoring, financial applications, telecommunications data management, sensor networks, and web applications use such a summary for decision making and strategic planning. In this paper, we have proposed a new streaming ϵ-approximation algorithm (HHH-MG) for computing HHH based on a modified Misra Gries

Zubair Shah received a B.S. degree in Computer Science from University of Peshawar (Pakistan) in 2005, and an M.S. degree in Computer System Engineering from Politecnico di Milano (Italy), in 2012. He is currently working toward a Ph.D. degree at the School of Engineering and IT (SEIT), University of New South Wales (Canberra, Australia). He had been a system engineer from 2006 to 2009 in NADRA (Pakistan). He worked as a lecturer from March 2013 to September 2014 in City University of Science

References (25)

  • J. Misra et al.

    Finding repeated elements

    Sci. Comput. Program.

    (1982)
  • Y. Zhang et al.

    Online identification of hierarchical heavy hitters: algorithms, evaluation, and applications

  • G. Cormode et al.

    Finding hierarchical heavy hitters in data streams

  • C. Estan et al.

    New directions in traffic measurement and accounting

    SIGCOMM Comput. Commun. Rev.

    (2002)
  • M. Charikar et al.

    Finding frequent items in data streams

  • G. Cormode et al.

    Finding hierarchical heavy hitters in streaming data

    ACM Trans. Knowl. Discov. Data

    (2008)
  • R. Berinde et al.

    Space-optimal heavy hitters with strong error bounds

    ACM Trans. Database Syst.

    (2010)
  • A. Metwally et al.

    Efficient computation of frequent and top-k elements in data streams

  • J. Hershberger et al.

    Space complexity of hierarchical heavy hitters in multi-dimensional data streams

  • G. Cormode et al.

    Finding the frequent items in streams of data

    Commun. ACM

    (2009)
  • J. Micheel, I. Graham, N. Brownlee, The auckland data set: an access link observed, in: Proc. of Access Networks and...
  • P. Ganesan et al.

    Exploiting hierarchical domain structure to compute similarity

    ACM Trans. Inf. Syst. (TOIS)

    (2003)
  • Cited by (3)

    • Computing hierarchical summary from two-dimensional big data streams

      2018, IEEE Transactions on Parallel and Distributed Systems

    Zubair Shah received a B.S. degree in Computer Science from University of Peshawar (Pakistan) in 2005, and an M.S. degree in Computer System Engineering from Politecnico di Milano (Italy), in 2012. He is currently working toward a Ph.D. degree at the School of Engineering and IT (SEIT), University of New South Wales (Canberra, Australia). He had been a system engineer from 2006 to 2009 in NADRA (Pakistan). He worked as a lecturer from March 2013 to September 2014 in City University of Science and Technology (Peshawar, Pakistan). His research interests include data mining and summarization techniques for scalable network traffic analysis, big data analytics, and machine learning techniques.

    Abdun Naser Mahmood received the Ph.D. degree from the University of Melbourne, Australia, in 2008; the M.Sc. degree in Computer Science and the B.Sc. degree in Applied Physics and Electronics from the University of Dhaka, Bangladesh, in 1999 and 1997, respectively. He has been working as an academic in Computer Science since 1999. He is currently in the School of Engineering and IT, University of New South Wales. Previously, he has been a lecturer since 2000, an Assistant Professor since 2003 at the University of Dhaka. Between 2008 and 2011, he has been a Postdoctoral Research Fellow at the Royal Melbourne Institute of Technology. His research interests include data mining techniques for scalable network traffic analysis, anomaly detection, and industrial SCADA security. He has published his work in various IEEE Transactions and A-tier international journals and conferences.

    Spike (Michael) Barlow is a computer scientist. He received his Ph.D. from UNSW Australia in 1991. Following which he worked at the University of Queensland (Australia) and Nippon Telegraph & Telephone’s Human Communication Laboratories in Japan. In 1996 he returned to UNSW Canberra in a teaching and research role; where he still works today within the School of Engineering and Information Technology. His research interests straddle the areas of simulation, virtual environments, machine learning, serious games, and human computer interaction. He has produced more than 100 papers in quality international journals and conferences.

    View full text