Survey on link layer congestion management of lossless switching fabric

https://doi.org/10.1016/j.csi.2017.11.002Get rights and content

Highlights

  • Summary of the standard congestion management mechanisms of the current lossless switching fabric such as enhanced Ethernet, InfiniBand and Fibre Channel, including their history of evolution and enabling technologies.

  • A comparative discussion of the methods to prevent packets loss at link layer.

  • List of the complement congestion management mechanisms, which address the accompanying problems of employing the methods to prevent packets loss.

Abstract

With the I/O bottleneck being mitigated by solid-state disk and in-memory computing, and the increase of network bandwidth, the packet processing of the traditional TCP/IP stack becomes a new bottleneck due to the large processing latency and CPU consumption. To reduce the processing latency and CPU consumption, it becomes popular to access the network interface cards directly via the kernel bypass and remote direct memory access (RDMA) technique. Accordingly, the congestion management is expected to be deployed at the link layer. On the other hand, it is more simple and efficient to deploy a unified link layer congestion management scheme to ensure low latency and zero packet loss for all of the block-based storage traffic, the InterProcess Communication traffic and the applications such as e-commerce platforms, in-memory key-value store, web object caching in datacenters and clusters. In this paper, we review the existing standard link layer congestion management mechanisms of lossless switching fabrics, including the enhanced Ethernet, InfiniBand and Fibre Channel. We focus on their history of evolution and enabling technologies, and discuss the current challenges and opportunities. The summary and comparison of these standard link layer congestion management mechanisms may serve as a foundation for future research in this area.

Introduction

Recently, the standard of 40/100 Gbps Ethernet has been released [1] and the standard of 400 Gbps Ethernet is being developed [2]. At the same time, the classic disk I/O bottleneck is greatly improved by employing solid-state disk [3] or in-memory computing [4]. With the increase of the bandwidth and the improvement of disk I/O, the traditional problems of the TCP/IP stack, including large CPU consumption and packet processing latency, become significant [5], [6], and can’t be tolerated by the current latency-sensitive applications like trading systems, e-commerce platforms, in-memory key-value stores and web object caching. To reduce the packet processing latency and the CPU consumption, as well as to take advantage of the emerging hardware like multi-core processors and multi-queue network adapters [5], [7], it becomes popular to access the Network Interface Cards (NICs) directly via the kernel bypass and Remote Direct Memory Access (RDMA) technique [8], [9]. Accordingly, the congestion management is expected to be employed by the network adapter, i.e., deployed at the link layer. Note that RDMA is designed to work on a lossless environment, and thus requires lossless packet delivery at the link layer [10].

At the same time, different types of traffic are transferred by different switching fabrics. Traditionally, the block-based storage traffic, which requires zero packet loss, is often carried by Fibre Channel. The InterProcess Communication (IPC) traffic, which requires extremely low latency, is typically carried over InfiniBand. Ethernet is used to carry the TCP/IP-based communication traffic. As a result, each device probably involves several network adapters, and correspondingly the design and management of these networks together becomes costly and complex [11]. It would be more simple and efficient to deploy a high speed unified switch fabric to simultaneously carry all of these types of traffic [12]. Correspondingly, the link layer congestion management scheme is indispensable to achieve the low latency and zero packet loss requirements.

In this paper, we review the existing link layer congestion management mechanisms of lossless switching fabrics. Specifically, we show the standard congestion management mechanisms of the current mainstream switching fabric, including Ethernet, InfiniBand and Fibre Channel. We focus on their history of evolution and enabling technologies, and discuss the current challenges and opportunities. We haven’t referred to the large number of investigations on the congestion control in history, for either Internet or data centers, because the link layer congestion management is different from the traditional congestion control at the transport layer in the following respects.

  • There is no per packet at the link layer. Thus, the round trip time (RTT) is unknown and the delayed-based congestion control algorithm is unemployable at the link layer. Moreover, the congestion control algorithm is also hard to be self-clocked as TCP.

  • The traffic is burst, namely the injecting rate can be potentially the same as the speed of NICs.

  • The switch buffer is shallow compared to the router buffer.

  • The special requirement of no packet loss makes the loss-based congestion control algorithm useable at the link layer.

Although the existing congestion control algorithms have already been summarized in [13], [14], they focus on the detailed mechanisms instead of the way to guarantee no packet loss. In contrast, we focus on the congestion management, including both the way of guarantee no packet loss and the congestion control algorithms which cooperates with the mechanism for losslessness. We believe the summary and comparison of these standard link layer congestion management mechanisms of lossless switching fabrics serve as a foundation for future research in this area.

Section snippets

Evolution and enabling technologies

Historically, the congestion control is involved in TCP according to the end-to-end argument, and Ethernet runs without any congestion control at the link layer.

Since memory was expensive a decade ago, the Pause mechanism was developed by IEEE 802.3x work group to prevent packets from being dropped due to transient congestion [15], so as to enable memory-constrained switches for a lower cost. In the Pause mechanism, a switch or receiving server notifies its previous hop to stop injecting

Evolution and enabling technologies

Fibre Channel was developed in 1988, and standardized in 1994 by an ANSI-accredited standards committee named International Committee for Information Technology Standards (INCITS) [36]. Fibre Channel was primarily used in supercomputers, but has become a common connection type for storage area networks (SAN) in enterprise storages.

At the link layer, Fibre Channel employs the following Credit-Based Flow Control (CBFC) mechanism [37]. As illustrated in Fig. 4, the sender maintains two variables:

Evolution and enabling technologies

Combining research results of the Future I/O project and the Next Generation I/O project, ITBA produced the first version of InfiniBand specification in 2000 [14], [39]. InfiniBand combines all the bottom four layers of the OSI model, i.e., the physical, data link, network and transport layers, into a single architecture, and implement them in Channel Adapters.

To control the traffic between the Channel Adapters of nodes (either servers or switches), the buffer-to-buffer CBFC mechanism is

Summary and comparison

Nowadays, Ethernet, Fibre Channel and InfiniBand are respectively the dominating switching fabric in TCP/IP networks, storage area networks and High Performance Computing clusters. They all employ link layer congestion management schemes to guarantee the sophisticated quality of service, such as low latency and zero packet loss. Wherein, the enhanced Ethernet is the most powerful competitor of the unified switch fabric in the current situation.

In general, two mechanisms are developed to ensure

Conclusion

In this work, we present the existing mainstream link layer congestion management mechanisms of wired networks, including the PFC and QCN mechanisms of Ethernet, the CBFC mechanism of Fibre Channel and InfiniBand, and the CCA of InfiniBand, as well as the corresponding challenges and opportunities. We hope the summary and comparison of these link layer congestion management mechanisms serve as a foundation for future research in this area.

Acknowledgments

The authors gratefully acknowledge the anonymous reviewers for their constructive comments. This work is supported in part by National Natural Science Foundation of China (NSFC) under Grant no.61502539, China Postdoctoral Science Foundation under Grant no.2015M582344 and no.2016T90761, and the Projects of Hunan Province Science and Technology Plan in China under Grant no. 2016JC2009.

References (48)

  • IEEE, 802.3ba 40Gb/s and 100Gb/s Ethernet Task Force,...
  • IEEE, 400 Gb/s Ethernet Study Group,...
  • H. Lim et al.

    Silt: a memory-efficient, high-performance key-value store

    SOSP

    (2011)
  • J. Ousterhout et al.

    The case for ramclouds: scalable high-performance storage entirely in dram

    SIGOPS Operating Systems Review

    (2010)
  • H. Lim et al.

    Mica: a holistic approach to fast in-memory key-value storage

    NSDI

    (2014)
  • I. Marinos et al.

    Network stack specialization for performance

    SIGCOMM

    (2014)
  • T. Marian et al.

    Netslices: scalable multi-core packet processing in user-space

    ANCS

    (2012)
  • A. Kalia et al.

    Using RDMA Efficiently for key-value services

    SIGCOMM

    (2014)
  • A. Dragojević et al.

    Farm: Fast remote memory

    11th USENIX Symposium on Networked Systems Design and Implementation (NSDI 14)

    (2014)
  • Y. Zhu et al.

    Congestion control for large-scale RDMA deployments

    SIGCOMM

    (2015)
  • White Paper, Unified fabric: Cisco’s innovation for data center networks, 2009,...
  • Data center bridging task group,...
  • H. Mliki et al.

    A comprehensive survey on carrier ethernet congestion management mechanism

    J. Netw. Comput. Appl.

    (2015)
  • InfiniBand Trade Association, InfiniBand architecture specification: release 1.3,...
  • R. Seifert et al.

    The Complete Guide to LAN Switching Technology

    (2008)
  • Fredy D. NeeserNikolaos I. Chrysos, Mitch Gusat, Rolf Clauberg, Cyriel Minkenberg, Kenneth M. Valk and Claude Basso

    Occupancy sampling for terabit CEE switches resolving output and fabric-internal congestion

    IEEE Hot Interconnects

    (2012)
  • IEEE, IEEE 802.1Qbb: priority-based flow control, Working Draft,...
  • IEEE 802.1Qau: End-to-end congestion management, Working Draft,...
  • FULCRUM, FocalPoint FM6000 series, Product brief,...
  • Juniper, QFX3500 SWITCH, datasheet,...
  • D. Zats et al.

    DeTail: reducing the flow completion time tail in datacenter networks

    SIGCOMM

    (2012)
  • D. Crisan et al.

    Got loss? Get zOVN!

    SIGCOMM

    (2013)
  • The InfiniBand Trade Association Specification, version 1.3, Annex A 16: RoCE, 2010,...
  • C. DeSanti et al.

    FCoE in perspective

    International Conference on Advanced Infocomm Technology

    (2008)
  • Cited by (0)

    View full text