Adaptive load balancing based on accurate congestion feedback for asymmetric topologies
Introduction
Datacenter networks typically adopt multi-rooted topologies, such as fat tree and leaf spine, to provide high bisection bandwidth. The multipath existing in these topologies provides several alternative routing paths between any two end hosts which are connected by different switches. Balancing load in multiple paths to fully utilize the network resource can improve throughput and reduce latency for datacenter applications. But various uncertainties, such as traffic dynamics and topology asymmetries, pose great challenges for designing efficient load balancing schemes. Production datacenters present a network environment with dynamic traffics [1], where applications that are sensitive to bandwidth (e.g. MapReduce) and sensitive to flow completion time (e.g. Memcached) exist. And asymmetry is common in datacenter networks [2] because of adding racks, heterogenous network devices, cutting links and switch malfunctions [3], [4]. Efficient load balancing mechanisms usually adapt to above uncertainties, which should accurately detect path conditions and distribute traffic among multipaths based on path conditions.
However, Equal Cost Multiple Path (ECMP) forwarding [5], as the standard strategy used today for load balancing in datacenter networks, performs poorly. It randomly assigns flows to different paths permanently according to a hash function using certain tuples from the packet header. Because it accounts for neither path conditions nor flow size, it can waste over 50% of the bisection bandwidth [6].
Therefore, prior solutions (e.g. CONGA [7], CLOVE-ECN [8], FlowBender [9], Hermes [2]) have made a great deal of effort to improve performance, but they still have some drawbacks. Some distributed load balancing schemes (e.g. CONGA, HULA [10], LetFlow [11]) residing in custom switches are hard to deploy in general datacenter networks, although they achieve significant improvement for throughput and latency. Centralized solutions (e.g. Hedera [6]) globally schedule large flows by collecting network information in a controller. But they have long scheduling intervals, which are not adaptive to the traffic volatility of datacenter networks and harmful for small flows.
The last category of solutions (e.g. CLOVE-ECN, Hermes) are deployed at network edges (e.g. hypervisor) or end hosts to keep practical. Some of them are designed to be congestion-aware. However, they depend too much on the rough congestion feedback (e.g. ECN and coarse-grained RTT measurements) to sense congestion. CLOVE-ECN learns congestion along network paths from ECN signals and uses a weighted round-robin (WRR) algorithm to dynamically route flowlets [12] on multiple paths. Hermes also exploits ECN signals and coarse-grained RTT measurements to decide the flow path at the host side. RTT measurements lump latencies in both directions along the network path. In order to use RTTs to capture congestion of the forward path, prior mechanisms (e.g. Hermes, TIMELY [13]) classify pure ACK packets in the reverse path into the higher priority queue. Inaccurate ECN signals and coarse-grained RTT measurements can degrade the performance gains in asymmetric topologies though they schedule flows using excellent algorithms. The ECN-based congestion detection cannot accurately characterize the degree of congestion among multiple paths in an asymmetric network due to its inherent oversimplified feedback. The coarse-grained RTT measurement introduces end host network stack delay, which can be believable if and only if a small enough RTT is discovered. Thus it cannot accurately represent the degree of path congestion. Furthermore, ECN is a passive and delayed mechanism for informing congestion level in multiple paths. Thus it can hardly help achieve timely load balancing.
Actually the end-to-end latency effectively indicates whether the path has been congested. Fortunately with the rapid growth of cloud computing and network functions virtualization (NFV), the advances in widely used NIC hardware and efficient packet IO frameworks (e.g. DPDK [14]) have made the measurement of end-to-end latency possible with microsecond accuracy. Latency-based implicit feedback is accurate enough to reveal path congestion [15]. DPDK now supports all major CPU architectures and NICs from multiple vendors (e.g. Intel, Emulex, Mellanox and Cisco). A tuned DPDK solution (e.g. TRex [16]) only introduces 5–10µs overhead [17]. With the help of DPDK, the end-to-end latency can be measured with sufficient precision to sense path conditions. Several latency-based congestion control protocols for datacenter networks have emerged (e.g. TIMELY, DX [15]). But latency-based implicit feedback has hardly been applied to load balancing schemes.
Moreover, current load balancing solutions also create new inaccurate congestion feedback in transport protocols. The end hosts in present datacenters commonly adopt ECN-based transport protocols (e.g. DCTCP [18]). Congestion control algorithms of transport protocols usually adjust the rate (window) of a flow based on the congestion state of the current path. When rerouting events happen, outdated ACKs with no ECE mark of the other path may improperly increase the sending rate (window), while the ones with an ECE mark will mistakenly decrease the sending rate. This problem hinders the utilization of link bandwidth especially under asymmetric topologies, because network asymmetry creates different network conditions more easily among different routing paths.
According to above observation, we find inaccurate congestion feedback causes inaccurate detection to path conditions and incorrect flow rate adjustment. This problem is bound to affect the performance of load balancing (Sections 2.2 and 2.3). Therefore, we ask the following question: can we design a congestion-aware load balancing scheme that can achieve accurate congestion feedback and keep practical? Finally we present ALB to answer this question, which is an adaptive load balancing solution implemented at end hosts. ALB employs accurate latency-based measurement to detect network path congestion. The latency-based congestion detection enables ALB to accurately reroute flows. And an ACK correction method is used to avoid blindly adjusting the flow rate at source hosts.
We make following contributions in this paper:
- •
We analyze that inaccurate congestion feedback can degrade performance under asymmetry in load balancing.
- •
We present ALB, an adaptive load balancing mechanism based on accurate congestion feedback running at end hosts, which is resilient to asymmetry and readily-deloyable with commodity switches in large-scale datacenters.
- •
In large-scale simulations we show that ALB achieves up to 13% and 48% better flow completion time than CONGA and CLOVE-ECN under asymmetry, respectively. Under the impact of dynamic network changes, ALB improves the overall average FCT by 5–42% compared to CLOVE-ECN. And ALB always keeps the best and stable performance for small flows under high bursty traffic. Compared with Hermes, ALB requires no complicated parameter settings and provides competitive performance.
Some preliminary results of this paper were published in the Proceedings of the IEEE/ACM International Symposium on Quality of Service (IWQoS, 2018) [19]. In this paper, we describe our motivation with more detailed theoretical and empirical analyses, improve the latency-based congestion detection mechanism (Section 3.2) and extend the evaluations for dynamic datacenter network changes (Section 4.2.2).
The rest of this paper is organized as follows. In next section, we introduce the background and motivation of designing ALB. Then we detail ALB in Section 3. And we evaluate ALB and show the superiority of ALB compared to other solutions in Section 4. Finally we briefly introduce the related work in Section 5 and summarize our work in Section 6.
Section snippets
Background and motivation
In this section, we describe network asymmetries and traffic dynamics pose challenges to load balancing and inaccurate congestion feedback exacerbates the performance loss. These problems motivate us to design ALB.
Overview
We present ALB’s framework in Fig. 5. ALB contains two modules, which are MDCTCP and ALB core. We design MDCTCP by slightly modifying DCTCP. MDCTCP is an ECN-based network protocol. And other three functions, namely source routing, latency-based congestion detection and accurate flowlet switching, work in the ALB core. The ALB core is implemented in software in hypervisor vSwitch (e.g. Open vSwitch), which is common for current multi-tenant datacenters to manager numerous virtual machines.
The
Evaluation
We evaluate ALB via the discrete-event network simulator NS3 [22]. Our evaluation seeks to answer the following questions:
How does each design component contributes to performance? (Section 4.1) ALB implements a novel latency-based congestion detection and an ACK correction method at end hosts. We evaluate the benefits brought by these two methods separately. Results show that both of them contribute to around 10% overall performance improvements under heavy loads.
How does ALB perform under
Related work
We briefly discuss related work that has informed and inspired our design.
Hedera [6], MicroTE [28] and FastPass [29] use a centralized scheduler to monitor global network state and schedules flows evenly in multiple paths. They cannot achieve timely reaction to latency-sensitive application requests and have difficulties handling traffic volatility.
Presto [24], DRB [23] and Flowbender [9] are per-flowcell/packet/flow based, congestion-oblivious load balancing solutions. They cannot effectively
Conclusion
We propose ALB, an adaptive load balancing mechanism based on accurate congestion feedback running at end hosts with commodity switches, which is resilient to asymmetry. ALB leverages the latency-based congestion detection to precisely route flowlets to lighter load paths, and an ACK correction method to avoid inaccurate flow rate adjustment. We evaluate ALB through large-scale simulations. Our results show that compared to schemes which require custom switch hardware for implementation, ALB
Acknowledgments
This work is supported in part by NSFC No. 61772216, National Defense Preliminary Research Project (31511010202), the National High Technology Research and Development Program (863 Program) of China under Grant no. 2013AA013203; Hubei Province Technical Innovation Special Project (2017AAA129), Wuhan Application Basic Research Project (2017010201010103), Project of Shenzhen Technology Scheme (JCYJ20170307172248636), Fundamental Research Funds for the Central Universities. This work is also
Qingyu Shi He received the BE degree in computer science and technology from the Huazhong University of Science and Technology (HUST), Wuhan, China, in 2014. He is currently a Ph.D. student majoring in Computer Architecture in Wuhan National Laboratory for Optoelectronics (WNLO). His current research interests include software-defined networking and load balancing for datacenter networks. He has a publication in international conference: IWQoS.
References (29)
- et al.
Network traffic characteristics of data centers in the wild
Proceeding of the ACM IMC
(2010) - et al.
Resilient datacenter load balancing in the wild
Proceeding of the ACM SIGCOMM
(2017) - et al.
Understanding network failures in data centers: measurement, analysis, and implications
Proceeding of the ACM SIGCOMM
(2011) - C. Guo, L. Yuan, D. Xiang, Y. Dang, R. Huang, D. Maltz, Z. Liu, V. Wang, B. Pang, H. Chen, Z.-W. Lin, V. Kurien,...
Analysis of an equal-cost multi-path algorithm
RFC 2992
(2000)- et al.
Hedera: dynamic flow scheduling for data center networks
Proceeding of the USENIX NSDI
(2010) - et al.
CONGA: distributed congestion-aware load balancing for datacenters
Proceeding of the ACM SIGCOMM
(2014) - et al.
Clove: congestion-aware load balancing at the virtual edge
Proceeding of the ACM CoNEXT
(2017) - et al.
Flowbender: flow-level adaptive routing for improved latency and throughput in datacenter networks
Proceeding of the ACM CoNEXT
(2014) - et al.
HULA: scalable load balancing using programmable data planes
Proceeding of the ACM SOSR
(2016)
Let it flow: resilient asymmetric load balancing with flowlet switching
Proceeding of the USENIX NSDI
Dynamic load balancing without packet reordering
ACM SIGCOMM Comput. Commun. Rev.
TIMELY: RTT-based congestion control for the datacenter
Proceeding of the ACM SIGCOMM
Cited by (0)
Qingyu Shi He received the BE degree in computer science and technology from the Huazhong University of Science and Technology (HUST), Wuhan, China, in 2014. He is currently a Ph.D. student majoring in Computer Architecture in Wuhan National Laboratory for Optoelectronics (WNLO). His current research interests include software-defined networking and load balancing for datacenter networks. He has a publication in international conference: IWQoS.
Fang Wang She received her BE degree and Master degree in computer science in 1994, 1997, and Ph.D. degree in computer architecture in 2001 from Huazhong University of Science and Technology (HUST), China. She is a professor of computer science and engineering at HUST. Her interests include distribute file systems, parallel I/O storage systems and graph processing systems. She has more than 50 publications in major journals and international conferences, including FGCS, ACM TACO, SCIENCE CHINA Information Sciences, Chinese Journal of Computers and HiPC, ICDCS, HPDC, ICPP.
Dan Feng She received the BE, ME, and Ph.D. degrees in Computer Science and Technology in 1991, 1994, and 1997, respectively, from Huazhong University of Science and Technology (HUST), China. She is a professor and vice dean of the School of Computer Science and Technology, HUST. Her research interests include computer architecture, massive storage systems, and parallel file systems. She has more than 100 publications in major journals and international conferences, including IEEE-TC, IEEE-TPDS, ACM-TOS, JCST, FAST, USENIX ATC, ICDCS, HPDC, SC, ICS, IPDPS, and ICPP. She serves on the program committees of multiple international conferences, including SC 2011, 2013 and MSST 2012. She is a member of IEEE and a member of ACM.
Weibin Xie He received the BE degree in Energy and Power Engineering from the China University of Mining and Technology (CUMT), Xuzhou, China, in 2011. He is currently a Ph.D. student majoring in Computer Architecture in HUST. His current research interests include Computer networks and protocols and distributed storage systems. He has several publications in major journals and international conferences, including CN and IWQoS.