Elsevier

Future Generation Computer Systems

Volume 127, February 2022, Pages 126-141
Future Generation Computer Systems

Load balancing with traffic isolation in data center networks

https://doi.org/10.1016/j.future.2021.09.002Get rights and content

Highlights

  • We conduct extensive simulation-based studies to show how flow collisions negatively impact the performances of short and long flows.

  • We propose an isolation-based load balancing scheme, called ILB, to avoid flow collisions for providing low latency for short flows and sustained throughput for long flows simultaneously.

  • By using large-scale NS3 simulations, we demonstrate show that ILB respectively reduces the average and tail flow completion times of short flows by up to 30% and 70%, and increases the throughputs of long flows by about 1.68x over the state-of-the-art datacenter load balancing schemes.

Abstract

The topologies of current data center networks are typically multi-rooted trees (e.g. leaf–spine) with rich parallel paths between any pair of hosts. Recent progress has demonstrated that effective load balancing can fully utilize these parallel paths to speed up data transfer. Nonetheless, the existing load balancing designs are agnostic to the heterogeneous datacenter traffic, i.e., a mass of delay-sensitive short flows mix with a handful of throughput-oriented long flows, and casually reroute these flows regardless of path condition, thus resulting in frequent flow collisions. The short flows suffer from the problem of large queuing delay due to colliding with long flows, while the frequent collisions between long flows lead to low network utilization and serious throughput degradation. To address these inefficiencies, we propose an isolation-based load balancing scheme, namely ILB, which perceives flow collisions and dynamically isolates the long flows from short flows. Specifically, when a long flow collides with short flows in the same path, it immediately switches to another path unused by short flows to help the short ones in the previous path complete quickly. When short flows disappear, the long flow quickly occupies all its available paths to achieve high throughput. Moreover, ILB is only deployed on the leaf switch without modifying the end-host and spine switch. Experimental results of NS3 simulations show that ILB respectively reduces the average and tail flow completion time for short flows by up to 30% and 70%, as well as increases the throughput of long flows by about 1.68x over the state-of-the-art datacenter load balancing schemes.

Introduction

With the rapid development of cloud computing and big data, modern data centers have become the cornerstones of the computing infrastructure, and host a great diversity of distributed processing applications, including web search, advertising, social collaboration, recommending system, etc [1], [2], [3]. These applications generate heterogeneous traffic (i.e., a mix of short and long flows) in the data center network (DCN). Among these flows, most of them are short flows requiring low latency to provide soft real-time performance to users, while the remained ones are usually throughput-sensitive long flows delivering large amounts of data [3], [4], [5], [6]. Without loss of generality, both long and short flows should be efficiently transmitted in DCN, aiming to provide small predictable latency for short flows and large sustained throughput for long flows simultaneously [1], [3], [7], [8], [9], [10], [11], [12], [13].

Besides, modern DCNs are typically organized in multi-rooted tree topologies such as leaf–spine, which provides multiple paths between any host pairs [14], [15], [16], [17], [18]. Recent progress has demonstrated that designing effective load balancing scheme is a promising way to meet the above challenges. The existing load balancing schemes often strive to transmit data flows via all the parallel paths. Thereinto, Equal Cost MultiPath (ECMP) [19] uses a hash taken over packet headers to assign flows to different paths, and has been used as the standard load-balancing mechanism in production data centers due to its simplicity. However, it suffers from the well-known hash collision problem and the inability to reroute traffic flexibly. To address this issue, researchers have proposed more fine-grained mechanisms: per-packet and per-flowlet/flowcell solutions.

Typically, LetFlow [15] and Presto [20] switch path based on flowlet and flowcell, respectively. Though flows can utilize more parallel paths without causing serious packet reordering, both of them are still not flexible enough when rerouting, thus leading to link under-utilization. Random Packet Spraying (RPS) [21], DRILL [22] and Hermes [23] split and reroute traffic at packet level, significantly improving the link utilization in symmetric network topology. Nonetheless, production DCNs have lots of uncertainties such as dynamic traffic and link/switch failures [23], which inevitably cause the symmetric network topology to become asymmetric. Consequently, these packet-level schemes suffer from the serious packet reordering, leading to the non-trivial degradation of network performance.

Not only that, none of the above solutions is aware of the traffic feature that abundant delay-sensitive short flows and a few of throughput-oriented long flows are mixed and transmitted in the same paths. They casually reroute these heterogeneous flows regardless of path condition, leading to frequent flow collisions. As a result, both short and long flows suffer from large queuing delay, packet reordering, and low link utilization, which severely damage the network performance.

In this paper, we propose a load balancing scheme ILB to address the above inefficiencies. ILB perceives flow collisions, and dynamically assigns paths to long flows for avoiding collision with short flows. When short flows emerge, the long flows immediately change their transmission paths to free up valuable bandwidth resources for short flows. When short flows disappear, the long flows quickly occupy all the available paths to achieve high link utilization. By this way, ILB greatly reduces the queuing delay for short flows while achieving high throughput for long flows.

The rest of the paper is organized as follows. We investigate the problems of load balancing under flow collisions, and summarize our contributions in Section 2. We present the basic idea, overview, design details, algorithm and model analysis of ILB in Section 3. We evaluate the performance of ILB with numerous NS3 simulation tests and real experiments in Section 4, and discuss the related works in Section 5. Finally, we offer concluding remarks in Section 6.

Section snippets

Design motivation

In this section, we first investigate the impact of flow collisions between the mixed heterogeneous flows under representative datacenter load balancing schemes. Then, we summarize the causation of performance degradation and present our design objectives.

ILB design

In this section, we first present the basic insight and overview of ILB. Then, we elaborate its design details and algorithm, as well as discuss why ILB is effective. Finally, we build a mathematical model to analyze how ILB benefits from avoiding flow collision and how to determine its parameters.

Evaluation

In this section, we conduct numerous NS3 simulation tests to evaluate the performance of ILB. Firstly, we redo the micro-benchmark in Section 2.2 to observe whether ILB performs as expected. Then, we evaluate the performance of ILB in the asymmetric scenario. After that, we construct a large-scale simulated scenario and install several typical and realistic datacenter workloads to make a comprehensive evaluation [24], [36]. Finally, we investigate the implementation overhead of ILB based on

Related work

In recent years, although various transport control protocols [1], [3], [4], [7], [8], [9], [10], [11], [13], [29], [42], [43], [44], [45], [46], [47] have been proposed to reduce flow completion time, they fail to effectively make full use of network bandwidth resources and inevitably degrade network performance. Therefore, researchers have designed various load balancing mechanisms for data center networks and wireless networks [48], [49], [50] to facilitate parallel data transmission across

Conclusion

This work presents the design of an isolation-based load balancing scheme ILB for avoiding collision between mixed heterogeneous datacenter flows. Based on identifying flow types, ILB perceives flow collisions and dynamically assigns different paths to long and short flows. When tiny flows collide with long flows in the same path, ILB immediately forces the latter ones reroute to other paths to help the former ones to complete quickly. When tiny flows disappear, the long flows quickly occupy

CRediT authorship contribution statement

Tao Zhang: Designs the whole system, Conducts the experiment while writing the research paper. Qianqiang Zhang: Designs and conducts the experiment. Yasi Lei: Designs and conducts the experiment. Shaojun Zou: Designs the algorithm, Writes the research paper. Juan Huang: Designs and conducts the experiment. Fangmin Li: Designs the whole system and experiment while writing the research paper.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work is supported in part by the National Natural Science Foundation of China under Grants 61872403, 61772088, and 62102047; in part by the Hunan Province Key Laboratory of Industrial Internet Technology and Security, China under Grant 2019TP1011; in part by the Natural Science Foundation of Hunan Province, China under Grant 2020JJ6064.

Tao Zhang received his Ph.D. degree in the School of Computer Science and Engineering, Central South University, China. He is now an associate professor in Hunan Province Key Laboratory of Industrial Internet Technology and Security, Changsha University, China. His research interests include congestion control, load balancing, performance modeling, analysis, and data center networking.

References (55)

  • ZouS. et al.

    Achieving high utilization of flowletbased load balancing in data center networks

    Future Gener. Comput. Syst.

    (2020)
  • G. Kumar, N. Dukkipati, K. Jang, H.M.G. Wassel, X. Wu, B. Montazeri, Y. Wang, K. Springborn, C. Alfeld, M. Ryan, D....
  • Y. Jiang, L. Sivalingam, S. Nath, R. Govindan, Webperf: evaluating what-if scenarios for cloud-hosted web applications,...
  • M. Alizadeh, A. Greenberg, D.A. Maltz, J. Padhye, P. Patel, B. Prabhakar, S. Sengupta, M. Sridharan, Data Center TCP...
  • H. Xu, B. Li, RepFlow: minimizing flow completion times with replicated flows in data centers, in: Proc. IEEE INFOCOM,...
  • T. Benson, A. Akella, D. Maltz, Network traffic characteristics of data centers in the wild, in: Proc. ACM IMC, 2010,...
  • BaiW. et al.

    PIAS: Practical information-agnostic flow scheduling for commodity data centers

    IEEE/ACM Trans. Netw.

    (2017)
  • S. Hu, W. Bai, G. Zeng, Z. Wang, B. Qiao, K. Chen, K. Tan, Y. Wang, Aeolus: A building block for proactive transport in...
  • A. Saeed, V. Gupta, P. Goyal, M. Sharif, R. Pan, M. Ammar, E. Zegura, K. Jang, M. Alizadeh, A. Kabbani, A. Vahdat,...
  • G. Zeng, W. Bai, G. Chen, K. Chen, D. Han, Y. Zhu, L. Cui, Congestion control for cross-datacenter networks, in: Proc....
  • R. Mittal, V.T. Lam, N. Dukkipati, E. Blem, H. Wassel, M. Ghobadi, A. Vahdat, Y. Wang, D. Wetherall, D. Zats, TIMELY:...
  • LeeC. et al.

    DX: Latency-based congestion control for datacenters

    IEEE/ACM Trans. Netw.

    (2017)
  • J. Zhang, W. Bai, K. Chen, Enabling ECN for datacenter networks with RTT variations, in: ACM CoNEXT, 2019, pp....
  • C. Raiciu, S. Barre, C. Pluntke, A. Greenhalgh, Improving datacenter performance and robustness with multipath TCP, in:...
  • M. Alizadeh, T. Edsall, S. Dharmapurikar, R. Vaidyanathan, K. Chu, A. Fingerhut, V.T. Lam, F. Matus, R. Pan, N. Yadav,...
  • E. Vanini, R. Pan, M. Alizadeh, P. Taheri, T. Edsall, Let it flow: resilient asymmetric load balancing with flowlet...
  • HuangJ. et al.

    Mitigating packet reordering for random packet spraying in data center networks

    IEEE/ACM Trans. Netw.

    (2021)
  • HuJ. et al.

    CAPS: Coding-based adaptive packet spraying to reduce flow completion time in data center

    IEEE/ACM Trans. Netw.

    (2019)
  • C.E. Hopps, Analysis of an equal-cost multi-path algorithm, in: RFC...
  • K. He, E. Rozner, K. Agarwal, W. Felter, J. Carter, A. Akellay, Presto: edge-based load balancing for fast datacenter...
  • A. Dixit, P. Prakash, Y.C. Hu, R.R. Kompella, On the impact of packet spraying in data center networks, in: Proc. IEEE...
  • S. Ghorbani, Z. Yang, P. Godfrey, Y. Ganjali, A. Firoozshahian, DRILL: micro load balancing for low-latency data center...
  • H. Zhang, J. Zhang, W. Bai, C. Kai, M. Chowdhury, Resilient datacenter load balancing in the wild, in: Proc. ACM...
  • I. Cho, K. Jang, D. Han, Credit-scheduled delay-bounded congestion control for datacenters, in: Proc. ACM SIGCOMM,...
  • J. Ye, L. Ma, C. Pang, Q. Xiao, W. Jiang, Inferring coflow size based on broad learning system in data center network,...
  • D. Gross, J.F. Shortle, J.M. Thompson, C.M. Harris, Fundamentals of Queueing Theory, in: Proc. Wiley-Interscience,...
  • M. Mathis, J. Semke, J. Mahdavi, T. Ott, The macroscopic behavior of the TCP congestion avoidance algorithm, in: Proc....
  • Cited by (7)

    • Future data center energy-conservation and emission-reduction technologies in the context of smart and low-carbon city construction

      2023, Sustainable Cities and Society
      Citation Excerpt :

      The digital industry has emphasized the need for computing power in DCs (Stanley, 2015), which is derived from chips (Hamza, Deogun, & Alexander, 2016), as shown in Fig. 6(c), and can be used to evaluate the DC performance using various computing power indicators (Helali & Omri, 2021). Among these, general computing controls the data flow (Jiang, Qiu, & Gao, 2019), high-performance computing can quickly solve complex problems (Buyya et al., 2010; Delimitrou & Kozyrakis, 2012; Dong, 2011; Fainman & Porter, 2013; Garimella et al., 2013; Hammadi & Mhamdi, 2014; Hamza et al., 2016; Harris, 2005; Helali & Omri, 2021; Hrouga et al., 2022; Hu & Deng, 2019; Jiang et al., 2019; Nath et al., 2006; Stanley, 2015; Stokel-Walker, 2022; Tang et al., 2017; Wei et al., 2019; Xu et al., 2018; Zeng and Veeravalli, 2014; T. Zhang et al., 2022), storage performance is highly related to security (HajiRassouliha, Taberner, & Nash, 2018), and network capability is measured by bandwidth and network latency (Elgendy, Zhang, & Tian, 2019). The computing power environment is supported by the Internet and 5 G mobile base stations, enabling services such as edge computing and data transmission (Brewer, Katz, & Chawathe, 1998).

    • Load Balancing Techniques in Cloud Environment - A Big Picture Analysis

      2022, 2022 1st International Conference on Computational Science and Technology, ICCST 2022 - Proceedings
    View all citing articles on Scopus

    Tao Zhang received his Ph.D. degree in the School of Computer Science and Engineering, Central South University, China. He is now an associate professor in Hunan Province Key Laboratory of Industrial Internet Technology and Security, Changsha University, China. His research interests include congestion control, load balancing, performance modeling, analysis, and data center networking.

    Qianqiang Zhang is currently working toward the B.Sc. Degree in the School of Computer Engineering and Applied Mathematics, Changsha University, China. His current research interests include load balancing and data center networks.

    Yasi Lei is currently working toward the B.Sc. Degree in the School of Computer Engineering and Applied Mathematics, Changsha University, China. Her current research interests include network programming, load balancing, and data center networks.

    Shaojun Zou received his Ph.D. degree in the School of Computer Science and Engineering, Central South University, China. His current research interests include congestion control, load balancing, and data center networks.

    Juan Huang received the M.Sc. degree from Central South University, Changsha, China, majoring in computer science. She is currently a lecturer in the School of Computer Engineering and Applied Mathematics, Changsha University, China. Her research interests include programming, load balancing, and data center networks.

    Fangmin Li received the B.Sc. degree from the Huazhong University of Science and Technology, Wuhan, China, in 1990, the M.Sc. Degree from the National University of Defense Technology, Changsha, China, in 1997, and the Ph.D. degree from Zhejiang University, Hangzhou, China, in 2001, all in computer science. He is currently the chair of and a professor in Hunan Province Key Laboratory of Industrial Internet Technology and Security, Changsha University, China. His current research interests include congestion control, data center networking, wireless communications and networks security, computer systems and architectures, and embedded systems.

    View full text