A grand spread estimator using a graphics processing unit

doi:10.1016/j.jpdc.2013.10.007

Journal of Parallel and Distributed Computing

Volume 74, Issue 2, February 2014, Pages 2039-2047

https://doi.org/10.1016/j.jpdc.2013.10.007 Get rights and content

Highlights

•
Spread now can be estimated in a PC with a GPU, providing above 100 Gbps throughput.
•
An optimized GSE outperforms traditional SRAM-based estimators.
•
Novel CTH can filter out duplicate packets for PCIe acceleration.
•
Spread estimation now can be done by an inexpensive commodity PC.

Abstract

The spread of a source is defined as the number of distinct destinations to which the source has sent packets during a measurement period. Spread estimation is essential in traffic monitoring, measurement, intrusion detection, to mention a few. To support high speed networking, recent research suggests implementing a spread estimator in fast but small on-chip memory such as SRAM. A state-of-the-art estimator can hold succinct information about 10 million distinct packets using 1 MB SRAM. This implies that a measurement period should restart whenever every 10 million distinct packets fill up the SRAM. Spread estimation is a challenging problem because two spread values from different measurement periods cannot be aggregated to derive the total value. Therefore, current spread estimators have a serious limitation concerning the length of the measurement period because SRAM is available a few megabytes at most. In this paper, we propose a spread estimator that utilizes a large memory space of a graphics processing unit on a commodity PC. The proposed estimator utilizes a 1 GB memory, a hundred times larger than those of current spread estimators, and its throughput is still around 160 Gbps. According to our experiments, the proposed scheme can cover a measurement period of a few dozen hours while the current state-of-the-art can cover only one hour. To the best of our knowledge, this has not been achieved by any spread estimators thus far.

Introduction

The $s p r e a d$ of a source is the number of distinct destinations contacted by the source during a measurement period [6]. Spread estimation is to approximately count the number of distinct destinations per source, which is an important tool for traffic monitoring, measurement, intrusion detection, assessment of contents popularity, applications of big data, to mention a few [13], [6], [1], [15], [12], [10]. A spread estimator is a software/hardware module on a router (or firewall) that inspects the arrival packets and estimates the spread of each source. In this paper, we study the problem of implementing a spread estimator on a commodity PC that utilizes 1 GB memory and the throughput is around 160 Gbps. To the best of our knowledge, this has never been achieved by any spread estimator.

We define a contact as a source–destination pair, for which the source sends a packet to the destination. The source or destination can be an IP address, a port number, or a combination of them together with other fields in the packet header. Multiple packets may have the same contact value, which is the case for real traffic traces. In this sense, the spread of a source is the number of distinct contacts including that source during a measurement period. A spread estimator must implement two functions: (1) It stores the contact information extracted from the arrival packets for a measurement period. (2) It estimates the spread of each source on the basis of the collected information [13]. Note that the first function involves the processing of every arrival packet while the second function is performed infrequently in a batch mode. Therefore, efficient designing of the first function is a major challenge in spread estimation, which we study in this paper.

The throughput of high-speed links has increased considerably from hundreds of megabits per second to multi-gigabits and even terabits per second [6]. In such an environment, a spread estimator needs to process arrival packets at line speed. To meet the speed requirement, recent research suggests implementing a spread estimator in fast and small on-chip memory such as SRAM [13], [6], [1], [15], [12]. As SRAM is available only a few megabytes, it should store a summary of packets, more exactly speaking, contacts, during a measurement period. If the number of contacts exceeds a certain threshold, the memory becomes full and the estimation accuracy may deteriorate to the level where estimation results become useless. Therefore, the maximum measurement period should be restricted by the number of distinct contacts during the measurement period [13], [15].

A challenging problem is that the spread of a source cannot be exactly estimated if it has sent packets over multiple measurement periods. This is because spread values measured from different periods cannot be aggregated to obtain the total value because some contacts over the distinct periods are duplicates. To the best of our knowledge, no spread estimator solves this problem of short measurement periods in high-speed networking. For example, the compact spread estimator (CSE) is the best estimator ever proposed [13], but its estimation becomes inaccurate if the packets in a given period contain more than 10 million distinct contacts when the SRAM size is 1 MB. Suppose that 10 million distinct contacts are collected within five minutes on average, and a source sends packets over two hours. In that case, the CSE will fail to estimate the spread of this source. Moreover, some attackers may exploit this limitation of current spread estimators [6]. Suppose that an intrusion detection system raises an alarm when the spread of any source exceeds the threshold. Then, the attackers may send attack packets to distinct destinations slow and steadily over a long period of time, say, over a few hours or even days to evade detection.

We observe that two requirements should be satisfied in order to extend a measurement period for spread estimation. First, a spread estimator needs to utilize a large memory space, at least hundreds of megabytes, to hold a summary. We call this a space requirement. Note that existing spreaders utilize only a few megabytes in SRAM. Second, packets should be processed at a high speed, which is the reason why recent research relies on fast but small SRAM. In this paper, we set the target throughput to be greater than 100 Gbps, which we call a throughput requirement. Although this cannot support terabits-per-second links, we believe that these two requirements encompass most edge links. Our target platform is a commodity PC, and recent software-based routers on a commodity PC achieve less than 100 Gbps throughput [2]. Therefore, our target throughput is reasonably configured. To the best of our knowledge, no previous work has satisfied both space and throughput requirements.

In this paper, we propose a spread estimator that satisfies both the space and throughput requirements. We are particularly interested in storing contacts. Rather than designing a new estimator from scratch, we use the data structure of CSE from [13]. We first implement it as a software in a commodity PC. The software is assigned 1 GB memory from DRAM, which achieves 60 Gbps throughput, far less than the throughput requirement. Next, we implement CSE by using a graphics processing unit (GPU). The summary of contacts is held in 1 GB GPU memory, and many cores of the GPU are utilized in parallel. The throughput increases up to 100 Gbps. We further optimize GPU configuration such as cache size, and the throughput is thus elevated to 150 Gbps. Finally, we devise a novel filter and run it in SRAM that prevents duplicate contacts from being processed at the GPU, which increases the throughput up to 160 Gbps. According to our experiments, the proposed scheme extends the measurement period by at least a few dozen times, which means that the new spread estimator can cover a few dozen hours without any interruption. In terms of network security, this means that we can measure the spread of a port scanner that operates slowly and steadily for more than a few dozen hours at 160 Gbps lines without relying on other tricks such as sampling. No previous work has ever achieved this. The main contributions of this paper are summarized as follows:

•
We fit a spread estimator, for the first time, in 1 GB GPU memory while maintaining a throughput of up to 160 Gbps, which cannot be achieved by any existing spread estimator.
•
We devise a new filtering module, collision-tolerant hash table (CTH), that prevents duplicate contacts from entering the GPU. This improves the main bottleneck around PCIe, a commonly observed problem in networking applications with a GPU. This filter can be used for other general purposes as well.
•
We implement the spread estimator on a commodity PC with a GPU. This is completely different from the traditional approach in which small but fast SRAM is used. We do not require any special hardware, and therefore our scheme is software-based that can be purchased less than 2000 US dollar.

The rest of this paper is organized as follows. Section 2 discusses related work. The motivation of this work is given in Section 3. Section 4 describes our spread estimator. Section 5 shows the experimental evaluation, and Section 6 draws conclusions.

Section snippets

Related work

A simple method to implement a spread estimator is to maintain a list of destinations per source, which was actually adopted by Snort [8]. However, this method is not scalable to high-speed links. Recent research suggests that a spread estimator be implemented in fast but small on-chip memory to support today’s line speed [13], [6], [1], [15]. Yoon et al. propose a compact spread estimator that provides good accuracy in a small memory. The accuracy and efficiency of this estimator are the

Motivation

In this paper, we present the motivation for a grand spread estimator (GSE) that satisfies both space and throughput requirements. As GSE is closely related with CSE [13], we first briefly review CSE and then discuss the motivation for GSE.

Design of grand spread estimator (GSE)

We present a new spread estimator, grand spread estimator (GSE), which satisfies both space and throughput requirements. We improve the throughput by optimizing GPU configuration, and devise a new filtering module to further accelerate the processing speed.

Experiments

We evaluate GSE through experiments. We implement GSE in a commodity PC with a GPU. We develop software-based CSE and different versions of GSE for comparative purposes. Real Internet traffic traces are used for our experiments, and the results show that GSE can accommodate 1 GB memory while the throughput is improved up to 160 Gbps, even better than the best performance of CSE in Fig. 5.

Conclusion

We propose a spread estimator that provides good accuracy and a long estimation period in large memory. The proposed scheme provides a throughput of up to 160 Gbps while it is allocated memory space as large as 1 GB. No existing spread estimators achieve this accuracy and throughput since only a few megabytes of SRAM are available to them. This improvement is obtained from the new design and implementation of a spread estimator based on a modern GPU. According to our experiments, the

Acknowledgments

This research was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MEST) (NRF-2011-0016246). This research was supported by the MOTIE (The Ministry of Trade, Industry, and Energy), Korea, under the Global Collaborative R&D program supervised by the KIAT (M002300089).

Seon-Ho Shin received his B.S. and M.S. degree from Department of Computer Engineering, Kookmin University, Seoul, Korea, in 2011 and 2012, respectively. Currently, he is a Ph.D. student at the same university under the guidance of Dr. MyungKeun Yoon. His research areas include network algorithms and security.

References (15)

C. Estan et al.
Bitmap algorithms for counting active flows on high-speed links
IEEE/ACM Trans. Netw.
(2006)
S. Han, K. Jang, K. Park, S. Moon, PacketShader: a GPU-accelerated software router, in: Proc. of ACM SIGCOMM, August,...
...
...
D. Kirk et al.
Programming Massively Parallel Processors
(2010)
T. Li, S. Chen, W. Luo, M. Zhang, Scan detection in high-speed networks based on optimal dynamic bit sharing, in: Proc....
NVIDIA CUDA C Programming Guide Version 4.0, 6,...

There are more references available in the full text version of this article.

Cited by (22)

VATE: A trade-off between memory and preserving time for high accurate cardinality estimation under sliding time window
2019, Computer Communications
Host cardinality is one of the most important attributes in the field of network research. The cardinality estimation under sliding time window has become a research hotspot in recent years. This kind of algorithms preserve the time information of sliding time window by introducing more powerful counters. The more counters used in these algorithms, the higher the estimation accuracy of these algorithms will be. However, the available number of sliding counters is limited due to their large memory footprint or long state-preserving time. To solve this problem, a new sliding counter, asynchronous time stamp (AT), is designed in this paper. AT has the advantages of small memory consumption and low state-preserving time. It can directly replace the counter used in the existing algorithms. On the same platform, higher accuracy can be achieved by adopting more AT. Furthermore, this paper designs a new per host cardinality estimation algorithm, virtual AT estimator (VATE), based on AT. VATE is also a parallel algorithm that can be deployed on GPU. With the parallel processing capability of GPU, VATE can estimate cardinalities of hosts in a 40 Gb/s high-speed network in real time at the time granularity of 1 s. In our experiments, VATE increases the state-preserving speed by 4 to 400 times at the cost of 11.11% more memory compared with a state-of-the-art algorithm.
High accuracy network cardinalities estimation by step sampling revision on GPU
2020, Computers, Materials and Continua
Identification of Network-wide Super Nodes in High Speed Networks
2019, 2019 4th IEEE International Conference on Big Data Analytics, ICBDA 2019
SRLA: A Real Time Sliding Time Window Super Point Cardinality Estimation Algorithm for High Speed Network Based on GPU
2019, Proceedings - 20th International Conference on High Performance Computing and Communications, 16th International Conference on Smart City and 4th International Conference on Data Science and Systems, HPCC/SmartCity/DSS 2018
GPU based Real-time Super Hosts Detection at Distributed Edge Routers
2019, arXiv
A Super Point Detection Algorithm under Sliding Time Windows Based on Rough and Linear Estimators
2019, IEEE Access

View all citing articles on Scopus

Eun-Jin Im is a professor in Department of Computer Engineering at Kookmin University, Korea. She received B.S. and M.S. degrees in computer engineering from Seoul National University, Korea, in 1991 and 1993, respectively. She received her Ph.D. degree in electrical engineering and computer science from the University of California at Berkeley in 2000. Her research interests include high performance computing, embedded system, and performance tuning.

MyungKeun Yoon is an assistant professor in Department of Computer Engineering at Kookmin University, Korea. He received the B.S. and M.S. degrees in computer science from Yonsei University, Korea, in 1996 and 1998, respectively. He received the Ph.D. degree in computer engineering from the University of Florida in 2008. He worked for the Korea Financial Telecommunications and Clearings Institute from 1998 to 2010. His research interests include computer & network security, network algorithm, cloud computing, and mobile network.

View full text