Elsevier

Computer Networks

Volume 212, 20 July 2022, 109006
Computer Networks

Scheduling of multiple network packet processing applications using Pythia

https://doi.org/10.1016/j.comnet.2022.109006Get rights and content

Abstract

Modern commodity computing systems are composed by a number of different heterogeneous processing units, each of which has its own unique performance and energy characteristics. However, the majority of current network packet processing frameworks targets only a specific processing unit (either the CPU or accelerator), leaving the remaining computational resources under-utilized or even idle.

In this paper, we propose an adaptive scheduling approach for network packet processing applications, that supports any heterogeneous and asymmetric architectures that can be found in a commodity high-end hardware setup. Our scheduler not only distributes the workloads to the appropriate devices in the system to achieve the desired performance results, but also enables the multiplexing of diverse network packet processing applications that execute concurrently, eliminating the interference effects introduced at runtime. The evaluation results show that our scheduler is able to tackle interferences in the shared hardware resources as well to respond quickly to dynamic fluctuations (e.g., application overloads, traffic bursts, infrastructural changes, etc.) that may occur at real time.

Introduction

The advent of high-end commodity heterogeneous systems (i.e., systems that utilize multiple processing units, typically CPUs along with different types of GPUs) has motivated the networking community to exploit alternative architectures. In fact, many recent approaches utilize them appropriately to build high-performance and parallel network packet processing systems [1], [2], [3], as well as power efficient systems [4]. Unfortunately, the majority of these approaches often target a single computational device,1 such as a multi-core main processor or a powerful high-end GPU, excluding the remaining devices, leaving them completely idle. Developing a network processing application framework that can utilize each and every available device effectively, efficiently and consistently, between a wide range of diverse workloads running concurrently, is highly challenging.

Heterogeneous systems that consist of multiple devices, typically provide system designers with different optimization opportunities that could eventually introduce inherent constraints and trade-offs between energy consumption and other performance characteristics — in our case, forwarding throughput and latency. The challenge to fully utilize a heterogeneous system, is to map the requested computations to the processing devices that interfere the less, and do it in the most automated way possible. Previous works focused on developing load-balancing frameworks that automatically partition the workload across the devices [5], [6]. These approaches either assume that all devices can provide equal performance [5] or perform a series of small processing trials to determine their relative performance [6]. The major disadvantage of these approaches is that they have been designed for solo applications, i.e., only a single application is executing each time. However, this is not the case for networking middleboxes, in which the complexity constantly increases, requiring more and more networking applications to execute at the same time. In addition, the majority of these approaches take as input a constant stream of data, a limiting factor that force them to adapt poorly when the input stream rates vary. This makes them hard to apply to network environments, where the traffic variability [7], [8] and overloads [9] can significantly affect the utilization and performance of network applications.

In this paper, we describe Pythia [10], which is a scheduling approach for network packet processing applications that can be executed concurrently in a highly heterogeneous commodity base system. More specifically, our proposed scheduler is designed to explicitly focus on the heterogeneity that is introduced in (i) the underlying hardware architectures, (ii) the applications and (iii) the input network traffic rate. The scheduler can dynamically respond to dynamic performance fluctuations that can occur at any time during the runtime, such as traffic bursts, overloads and system changes. Finally, we note that our scheduler is device-agnostic; even though we use a fixed set of processors and co-processors in this paper (which are representative though in a typical heterogeneous production system), our system can similarly operate when any other processors or co-processors are present (i.e., FPGAs, NPUs, or DSPs).

The contributions of this work are the following:

  • We extensively characterize the performance of typical software network packet processing applications, as well as the interference effects when multiple instances are executed on parallel on a variety of heterogeneous, off-the-shelf hardware devices. We show that the performance results have wide variations when executing diverse types of network processing applications. For several cases, a specific device can be the best fit for one application type, while at the same time, it can be the worst choice for another.

  • Motivated by the current gap in the state-of-the-art, we present a scheduling approach that, given a set of network packet processing applications, can effectively and efficiently utilize the most appropriate device or group of devices, based on the current system and network conditions, using a predefined policy that specifies the performance goal. The scheduler is able to dynamically respond to system and performance fluctuations and provide consistently good performance for concurrently running applications.

  • We propose optimization strategies to scale-up the architectural design and sustain energy-efficient line traffic rates at low latency, even for highly intensive network applications or when network traffic characteristics exhibit large fluctuations.

Section snippets

Background

Traditional commodity hardware setups offer a three-level heterogeneity: (i) the x86 CPU architecture, (ii) the integrated GPU that is packed on the same processor chip, and (iii) a discrete high-end GPU. All these different hardware architectures offer unique performance rankings and diverse energy characteristics. CPU cores perform overall better under branch-intensive workloads, while discrete GPUs perform efficiently in data-parallel tasks. An integrated GPU offers low power consumption

System setup

In this section we describe the hardware platform that we use in our measurements, as well as our power consumption profiling tool. In addition, we discuss the network packet processing applications that we use in this work.

Architecture

In this section we describe two different architectural models for our system. The main difference between the two models is the way each one handles the incoming network traffic and distributes it to the computational devices for further processing.

Implementation

To uniformly execute all the implemented packet processing applications (Section 3.4) across every device in our testbed machine, we take advantage of the OpenCL framework. Our system runs Linux 4.19.34-1-lts. We use the Intel OpenCL 2.1 SDK for the Intel devices (i.e. the i7-8700K CPU and the UHD Graphics 630 GPU) and the OpenCL from the NVIDIA CUDA Toolkit 10.1 for the GTX 1080 Ti GPU.

Each of the three applications is implemented as a unique kernel. In OpenCL, an instance of a kernel is

Real-time scheduling

Our scheduling system consists of two phases, the offline analysis part, and the online adaptive scheduling.

Evaluation

In this section, we evaluate the performance of our scheduling algorithm, using packet processing applications described in Section 3.4. Specifically, due to space constraints, we evaluate our scheduler using two diverse applications, AES and DPI. In Figs. 5(a), 5(b), 5(c) and 5(d) we present the input rates and achieved throughput, the power consumption of the device combination made by the scheduler, in dynamic conditions: i.e. (i) fluctuating incoming network traffic rate and (ii) policy

Scaling up strategies

Even though our scheduler is able to adapt to dynamic fluctuations that may occur at real time (such as application overloads, traffic bursts, and infrastructural changes), the hardware setup that we used for the evaluation is not efficient at handling small-sized packets. To address these limitations, in this section we demonstrate further strategies that can scale up the architectural design that has been described in Section 4. These strategies are based on hardware features found in server

Related work

Recently, GPUs have become very popular due to a substantial performance boost that provide to many individual network traffic inspection applications, such as including intrusion detection [2], [2], [11], [36], [37], [38], [39], [40], cryptography [41], and IP routing [1]. In addition, there have been proposed several programmable network traffic processing frameworks, such as Snap [29] and GASPP [13], [30], that manage to simplify the development of GPU-accelerated network traffic processing

Conclusions

In this paper we propose an adaptive and highly dynamic scheduling solution tailored explicitly for network packet processing applications. Our approach enables real-time application multiplexing across heterogeneous and asymmetric architectures that can be found on commodity, off-the-shelf hardware setups. In this work, we manage to improve the overall efficiency of the tested applications, since our scheduler is able to choose the configuration that results to the optimal performance each

Giannis Giakoumakis is a Cybersecurity Research and Development Engineer at ICS-FORTH. He received his Bachelor’s and Master’s degrees in computer science from the University of Crete.

References (53)

  • S. Han, K. Jang, K. Park, S. Moon, PacketShader: a GPU-accelerated software router, in: Proceedings of SIGCOMM,...
  • VasiliadisG. et al.

    Midea: A multi-parallel intrusion detection architecture

  • L. Rizzo, netmap: A novel framework for fast packet I/O, in: Proceedings of the 2012 USENIX Annual Technical...
  • L. Niccolini, G. Iannaccone, S. Ratnasamy, J. Chandrashekar, L. Rizzo, Building a power-proportional software router,...
  • KimJ. et al.

    Achieving a single compute device image in opencl for multiple GPUs

  • M. Boyer, K. Skadron, S. Che, N. Jayasena, Load balancing in a changing world: dealing with heterogeneity and...
  • G. Maier, A. Feldmann, V. Paxson, M. Allman, On dominant characteristics of residential broadband internet traffic, in:...
  • BensonT. et al.

    Understanding data center traffic characteristics

    SIGCOMM CCR

    (2010)
  • S.A. Crosby, D.S. Wallach, Denial of service via algorithmic complexity attacks, in: Proceedings of the 12th Conference...
  • GiakoumakisG. et al.

    Pythia: Scheduling of concurrent network packet processing applications on heterogeneous devices

  • VasiliadisG. et al.

    Gnort: High performance network intrusion detection using graphics processors

  • JamshedM. et al.

    Kargus: a highly-scalable software-based intrusion detection system

  • G. Vasiliadis, L. Koromilas, M. Polychronakis, S. Ioannidis, GASPP: A GPU-accelerated stateful packet processing...
  • OpenCL, Available:...
  • AhoA.V. et al.

    Efficient string matching: an aid to bibliographic search

    Commun. ACM

    (1975)
  • The Snort IDS/IPS, Available:...
  • A. Anand, A. Gupta, A. Akella, S. Seshan, S. Shenker, Packet caches on routers: the implications of universal redundant...
  • B. Aggarwal, A. Akella, A. Anand, A. Balachandran, P. Chitnis, C. Muthukrishnan, R. Ramjee, G. Varghese, EndRE: an...
  • OpenSSL Project, Available:...
  • P. Kulkarni, F. Douglis, J. LaVoie, J.M. Tracey, Redundancy elimination within large collections of files, in:...
  • M. Dobrescu, N. Egi, K. Argyraki, B.-G. Chun, K. Fall, G. Iannaccone, A. Knies, M. Manesh, S. Ratnasamy, RouteBricks:...
  • PapadogiannakiE. et al.

    Efficient software packet processing on heterogeneous and asymmetric hardware architectures

    IEEE/ACM Trans. Netw.

    (2017)
  • J. Shen, J. Fang, H. Sips, A.L. Varbanescu, Performance traps in OpenCL for CPUs, in: Proceedings of the 2013 21st...
  • Optimizing Applications for NUMA,...
  • White paper: NVIDIA Turing GPU Architecture, Available:...
  • Intel Data Direct I/O Technology (Intel DDIO): A Primer, Available:...
  • Cited by (0)

    Giannis Giakoumakis is a Cybersecurity Research and Development Engineer at ICS-FORTH. He received his Bachelor’s and Master’s degrees in computer science from the University of Crete.

    Mrs. Evangelia Papadogiannaki is currently a Graduate Research Fellow at the Distributed Computing Systems laboratory in the Institute of Computer Science of Foundation for Research and Technology, Hellas. She is currently pursuing a Ph.D. at the Computer Science Department in University of Crete. Eva holds a B.Sc. and M.Sc. degree from the same department.

    Dr. Giorgos Vasiliadis has been elected Assistant Professor at the Hellenic Menideterranean University and he is also a collaborating researcher at FORTH-ICS. Before that, he was a Scientist at Qatar Computing Research Institute (2016–2017), and a research intern at Symantec Research Labs, USA (2013). He received his B.S. (’06), M.Sc. (’08), and Ph.D. (’15) degrees in Computer Science from the University of Crete, Greece. He is the recipient of the Symantec Research Labs Graduate Fellowship and the Maria M. Manassaki Bequest Scholarship.

    Dr. Sotiris Ioannidis is an Associate Professor at the School of Electrical and Computer Engineering of the Technical University of Crete (TUC) and a collaborating researcher at FORTH-ICS. Before that, he was a Research Director at FORTH-ICS. He received a B.Sc. degree in Mathematics and an M.Sc. degree in Computer Science from the University of Crete in 1994 and 1996 respectively. In 1998 he received an M.Sc. degree in Computer Science from the University of Rochester and in 2005 he received his Ph.D. from the University of Pennsylvania.

    View full text