Analyzing performance and power efficiency of network processing over 10 GbE

doi:10.1016/j.jpdc.2012.02.016

Journal of Parallel and Distributed Computing

Volume 72, Issue 11, November 2012, Pages 1442-1449

https://doi.org/10.1016/j.jpdc.2012.02.016 Get rights and content

Abstract

Ethernet continues to be the most widely used network architecture today for its low cost and backward compatibility with the existing Ethernet infrastructure. Driven by increasing networking demands of cloud workloads, network speed rapidly migrates from 1 to 10 Gbps and beyond. Ethernet’s ubiquity and its continuously increasing rate motivate us to fully understand high speed network processing performance and its power efficiency.

In this paper, we begin with per-packet processing overhead breakdown on Intel Xeon servers with 10 GbE networking. We find that besides data copy, the driver and buffer release, unexpectedly take 46% of the processing time for large I/O sizes and even 54% for small I/O sizes. To further understand the overheads, we manually instrument the 10 GbE NIC driver and OS kernel along the packet processing path using hardware performance counters (PMU). Our fine-grained instrumentation pinpoints the performance bottlenecks, which were never reported before. In addition to detailed performance analysis, we also examine power consumption of network processing over 10 GbE by using a power analyzer. Then, we use an external Data Acquisition System (DAQ) to obtain a breakdown of power consumption for individual hardware components such as CPU, memory and NIC, and obtain several interesting observations. Our detailed performance and power analysis guides us to design a more processing- and power-efficient server I/O architecture for high speed networks.

Highlights

► Our per-packet processing overhead breakdown points out three major overheads. ► Our fine-grained instrumentation in OS locates bottlenecks and reveals several observations. ► We find that unlike 1 GbE NICs (∼1 W), 10 GbE NICs have almost 10 W idle power dissipation. ► Our measurement shows that network processing over 10 GbE has high power consumption. ► The power breakdown reveals that the major contributor is CPU, followed by main memory.

Introduction

Ethernet continues to be the most widely used network architecture today for its low cost and backward compatibility with the existing Ethernet infrastructure. It dominates in large scaled data centers and is even competing with specialized fabrics such as InfiniBand [13], Quadrics [25], Myrinet [5] and Fiber Channel [7] in high performance computers. As of 2011, Gigabit Ethernet-based clusters make up 44.2% of the top-500 supercomputers [30]. Driven by increasing networking demands of cloud workloads such as video servers, Internet search, web hosting etc., network speed rapidly migrates from 1 Gbps to 10 Gbps and beyond [9]. Ethernet’s ubiquity and its continuously increasing rates require us to clearly understand performance of network processing (or TCP/IP packet processing) over high speed networks and its power efficiency.

It is well recognized that network processing consumes a significant amount of time in network servers, particularly in high speed networks [4], [17], [18], [19], [22], [28], [33]. It was reported that network processing in the receiving side over 10 Gbps Ethernet network (10 GbE) easily saturates two cores of an Intel Xeon Quad-Core processor [18]. Assuming ideal scalability over multiple cores, network processing over upcoming 40 GbE and 100 GbE will saturate 8 and 20 cores, respectively. In the past decade, a wide spectrum of research has been done in this topic to understand its overheads [17], [18], [19], [22], [23], [28], [33]. Nahum et al. [23] used a cache simulator to study cache behavior of the TCP/IP protocol and showed that instruction cache has the greatest effect on network performance. Zhao et al. [33] revealed that packets and DMA descriptors exhibit no temporal locality. Makineni and Iyer [22] conducted architectural characterization of TCP/IP processing on the Pentium $M$ with 1 GbE. However, they built their studies on cache simulators or used low speed networks, and did not conduct a system-wide architectural analysis for high speed network processing on mainstream server platforms.

Besides a lack of detailed performance analysis, high speed network processing has not been investigated carefully from the power perspective. As the concern on power and energy management has been arousing great interests in data centers with thousands of interconnected servers in recent years [1], [11], [27], it also becomes critical to understand the power efficiency of network processing over high speed networks like 10 GbE on mainstream platforms.

In this paper, we begin with per-packet processing overhead breakdown by running a network benchmark over 10 GbE on Intel Xeon Quad-Core processor based servers. We find that besides data copy, the driver and buffer release, unexpectedly take 46% of processing time for large I/O sizes and even 54% for small I/O sizes. To understand the overheads, we manually instrument the driver and OS kernel along the packet processing path using hardware performance counters (PMU) [14]. Unlike existing profiling tools attributing CPU cost such as retired cycles or cache misses to functions [24], our instrumentation is implemented at the fine-grained level and can pinpoint data incurring the cost. Through the above studies, we obtain several new findings: (1) the major network processing bottlenecks lie in the driver (>26%), data copy (up to 34% depending on I/O sizes) and buffer release (>20%), rather than the TCP/IP protocol itself; (2) in contrast to the generally accepted notion that long latency Network Interface Card (NIC) register access results in the driver overhead [3], [4], our results show that the overhead comes from memory stalls to network buffer data structures; (3) releasing network buffers in OS results in memory stalls to in-kernel page data structures, contributing to the buffer release overhead; (4) besides memory stalls to packets, data copy implemented as a series of load/store instructions, also has significant time on L1 cache misses and instruction execution. Moreover, keeping packets in caches after data copy, which will not be reused, pollutes caches. Prevailing platform optimizations for data copy like Direct Cache Access (DCA) are insufficient for addressing the copy issue.

In addition to the above anatomized performance analysis, we also examine power consumption of network processing over 10 GbE across a range of I/O sizes on Intel Xeon platforms by using a power analyzer [26]. To further understand insights of the power consumption, we set up our experimental environment with an external Data Acquisition System (DAQ) [8] and obtain the power consumption of individual hardware components (e.g. CPU, main memory, NIC). We find that unlike 1 GbE NICs, which has a typical power consumption of about 1 W, 10 GbE NICs have almost 10 W idle power dissipation. Our measurement shows that network processing over 10 GbE has significant dynamic power consumption. Up to 23 W and 25 W are dissipated in the receiving and transmitting processes without any computation from the application. The power breakdown demonstrates that CPU is the largest contributor to the power consumption and its power consumption reduces as the I/O size increases. Following CPU, memory is the second contributor but its power consumption grows as the I/O size increases. Compared to CPU and memory, the NIC has small dynamic power consumption. All of these point out that: (1) improving CPU efficiency of network processing has the highest priority, particularly for small I/O sizes; (2) a rate-adaptive energy management scheme is needed for modern high speed NICs.

The remainder of this paper is organized as follows. We revisit network processing in Section 2 and then present a detailed processing performance overhead analysis over 10 GbE in Section 3, followed by detailed power studies in Section 4. Finally, we cover related literature and conclude our paper in Sections 5 Related work, 6 Conclusion, respectively.

Section snippets

Network processing

Unlike traditional CPU-intensive applications, network processing is I/O-intensive. It involves several platform components (e.g. NIC, PCI-E, I/O Controller or Bridge, memory, CPU) and system components (e.g. NIC driver, OS). In this section, we explain how network processing works on mainstream platforms. A high-level overview of network receiving process is illustrated in Fig. 1. In the receiving side, the packet processing begins when the NIC hardware receives an Ethernet frame from the

Understanding network processing overhead over 10 GbE

In this section, we analyze prevalent I/O architecture and present detailed CPU/NIC interaction in mainstream servers. Then we conduct extensive experiments to obtain per-packet processing overhead breakdown on Intel Xeon servers over 10 GbE across a range of I/O sizes in Section 3.2. In Section 3.3, we manually instrument the 10 GbE NIC driver and OS kernel along the packet processing path to locate the processing bottlenecks. Since the processing in the receiving side is more complicated than

Understanding power efficiency of network processing

The above section performs a detailed performance analysis of network processing on mainstream servers over 10 GbE. In this section, we extend our studies to another important aspect: power consumption. In Section 4.1, we present our experimental methodology to measure power consumption of network processing. In Section 4.2, we show extensive power results.

Related work

It is well documented that Internet servers spend a significant portion of time processing packets [4], [17], [18], [19], [22], [28], [33]. In the past decade, a wide spectrum of research has been done in network processing to uncover its characteristics and to optimize the processing efficiency [2], [3], [4], [12], [17], [18], [19], [22], [23], [28], [32], [33]. Nahum et al. [23] used a cache simulator to study cache behavior of the TCP/IP protocol and showed that instruction cache has the

Conclusion

As Ethernet network becomes ubiquitous and its speed continues to grow rapidly, it becomes critical for us to clearly study high speed network processing on mainstream servers from two important perspective: processing performance and power consumption. In this paper, we first studied the per-packet processing overhead on mainstream servers with 10 GbE and pinpointed three major performance overheads: data copy, the driver and buffer release. Then, we carefully instrumented the driver and OS

Guangdeng Liao received the B.E. and M.E. degrees in Computer Science and Engineering from Shanghai Jiaotong University, and the Ph.D. degree in Computer Science from the University of California, Riverside. He currently works as a Research Scientist, at Intel Labs. His research interests include I/O architecture, operating system, virtualization and data centers.

References (33)

Dennis Abt, Mike Marty, Philip Wells, et al. Energy-proportional datacenter networks, in: International Symposium on...
Accelerating high-speed networking with intel I/O acceleration technology....
N.L. Binkert, L.R. Hsu, A.G. Saidi, et al. Performance analysis of system overheads in TCP/IP workloads, PACT...
N.L. Binkert, A.G. Saidi, S.K. Reinhardt, Integrated network interfaces for high-bandwidth TCP/IP, in: ASPLOS...
N.J. Boden, D. Cohen, R.E. Felderman, et al. Myrinet: a gigabit-per-second local area network, IEEE MICRO...
J. Bonwick, The slab allocator: an object-caching kernel memory allocator, in: USENIX Technical Conference,...
L. Cherkasova, V. Kotov, T. Rokichi, et al. Fiber channel fabrics: evaluation and design, in: 29th...
Data acquisiton equipment fluke....
S. GadelRab
10-gigabit Ethernet connectivity for computer servers
IEEE Micro
(2007)
L. Grossman, Large receive Offload implementation in Neterion 10 GbE Ethernet driver, in: Ottawa Linux Symposium,...

James Hamilton, Cost of power in large-scale data centers, in....

R. Huggahalli, R. Iyer, S. Tetrick, Direct cache access for high bandwidth network I/O, ISCA,...

Infiniband trade association....

Inside intel core micro-architecture: setting new standards for energy-efficient performance....

Intel 82597....

Iperf....

Cited by (0)

Laxmi Bhuyan is a Professor of Computer Science and Engineering at the University of California, Riverside. His research interests include network processor architecture, Internet routers, and parallel and distributed processing. Bhuyan has a Ph.D. from Wayne State University. He is a Fellow of the IEEE, the ACM, and the American Association for the Advancement of Science.

View full text