Load balancing with traffic isolation in data center networks
Introduction
With the rapid development of cloud computing and big data, modern data centers have become the cornerstones of the computing infrastructure, and host a great diversity of distributed processing applications, including web search, advertising, social collaboration, recommending system, etc [1], [2], [3]. These applications generate heterogeneous traffic (i.e., a mix of short and long flows) in the data center network (DCN). Among these flows, most of them are short flows requiring low latency to provide soft real-time performance to users, while the remained ones are usually throughput-sensitive long flows delivering large amounts of data [3], [4], [5], [6]. Without loss of generality, both long and short flows should be efficiently transmitted in DCN, aiming to provide small predictable latency for short flows and large sustained throughput for long flows simultaneously [1], [3], [7], [8], [9], [10], [11], [12], [13].
Besides, modern DCNs are typically organized in multi-rooted tree topologies such as leaf–spine, which provides multiple paths between any host pairs [14], [15], [16], [17], [18]. Recent progress has demonstrated that designing effective load balancing scheme is a promising way to meet the above challenges. The existing load balancing schemes often strive to transmit data flows via all the parallel paths. Thereinto, Equal Cost MultiPath (ECMP) [19] uses a hash taken over packet headers to assign flows to different paths, and has been used as the standard load-balancing mechanism in production data centers due to its simplicity. However, it suffers from the well-known hash collision problem and the inability to reroute traffic flexibly. To address this issue, researchers have proposed more fine-grained mechanisms: per-packet and per-flowlet/flowcell solutions.
Typically, LetFlow [15] and Presto [20] switch path based on flowlet and flowcell, respectively. Though flows can utilize more parallel paths without causing serious packet reordering, both of them are still not flexible enough when rerouting, thus leading to link under-utilization. Random Packet Spraying (RPS) [21], DRILL [22] and Hermes [23] split and reroute traffic at packet level, significantly improving the link utilization in symmetric network topology. Nonetheless, production DCNs have lots of uncertainties such as dynamic traffic and link/switch failures [23], which inevitably cause the symmetric network topology to become asymmetric. Consequently, these packet-level schemes suffer from the serious packet reordering, leading to the non-trivial degradation of network performance.
Not only that, none of the above solutions is aware of the traffic feature that abundant delay-sensitive short flows and a few of throughput-oriented long flows are mixed and transmitted in the same paths. They casually reroute these heterogeneous flows regardless of path condition, leading to frequent flow collisions. As a result, both short and long flows suffer from large queuing delay, packet reordering, and low link utilization, which severely damage the network performance.
In this paper, we propose a load balancing scheme ILB to address the above inefficiencies. ILB perceives flow collisions, and dynamically assigns paths to long flows for avoiding collision with short flows. When short flows emerge, the long flows immediately change their transmission paths to free up valuable bandwidth resources for short flows. When short flows disappear, the long flows quickly occupy all the available paths to achieve high link utilization. By this way, ILB greatly reduces the queuing delay for short flows while achieving high throughput for long flows.
The rest of the paper is organized as follows. We investigate the problems of load balancing under flow collisions, and summarize our contributions in Section 2. We present the basic idea, overview, design details, algorithm and model analysis of ILB in Section 3. We evaluate the performance of ILB with numerous NS3 simulation tests and real experiments in Section 4, and discuss the related works in Section 5. Finally, we offer concluding remarks in Section 6.
Section snippets
Design motivation
In this section, we first investigate the impact of flow collisions between the mixed heterogeneous flows under representative datacenter load balancing schemes. Then, we summarize the causation of performance degradation and present our design objectives.
ILB design
In this section, we first present the basic insight and overview of ILB. Then, we elaborate its design details and algorithm, as well as discuss why ILB is effective. Finally, we build a mathematical model to analyze how ILB benefits from avoiding flow collision and how to determine its parameters.
Evaluation
In this section, we conduct numerous NS3 simulation tests to evaluate the performance of ILB. Firstly, we redo the micro-benchmark in Section 2.2 to observe whether ILB performs as expected. Then, we evaluate the performance of ILB in the asymmetric scenario. After that, we construct a large-scale simulated scenario and install several typical and realistic datacenter workloads to make a comprehensive evaluation [24], [36]. Finally, we investigate the implementation overhead of ILB based on
Related work
In recent years, although various transport control protocols [1], [3], [4], [7], [8], [9], [10], [11], [13], [29], [42], [43], [44], [45], [46], [47] have been proposed to reduce flow completion time, they fail to effectively make full use of network bandwidth resources and inevitably degrade network performance. Therefore, researchers have designed various load balancing mechanisms for data center networks and wireless networks [48], [49], [50] to facilitate parallel data transmission across
Conclusion
This work presents the design of an isolation-based load balancing scheme ILB for avoiding collision between mixed heterogeneous datacenter flows. Based on identifying flow types, ILB perceives flow collisions and dynamically assigns different paths to long and short flows. When tiny flows collide with long flows in the same path, ILB immediately forces the latter ones reroute to other paths to help the former ones to complete quickly. When tiny flows disappear, the long flows quickly occupy
CRediT authorship contribution statement
Tao Zhang: Designs the whole system, Conducts the experiment while writing the research paper. Qianqiang Zhang: Designs and conducts the experiment. Yasi Lei: Designs and conducts the experiment. Shaojun Zou: Designs the algorithm, Writes the research paper. Juan Huang: Designs and conducts the experiment. Fangmin Li: Designs the whole system and experiment while writing the research paper.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work is supported in part by the National Natural Science Foundation of China under Grants 61872403, 61772088, and 62102047; in part by the Hunan Province Key Laboratory of Industrial Internet Technology and Security, China under Grant 2019TP1011; in part by the Natural Science Foundation of Hunan Province, China under Grant 2020JJ6064.
Tao Zhang received his Ph.D. degree in the School of Computer Science and Engineering, Central South University, China. He is now an associate professor in Hunan Province Key Laboratory of Industrial Internet Technology and Security, Changsha University, China. His research interests include congestion control, load balancing, performance modeling, analysis, and data center networking.
References (55)
- et al.
Achieving high utilization of flowletbased load balancing in data center networks
Future Gener. Comput. Syst.
(2020) - G. Kumar, N. Dukkipati, K. Jang, H.M.G. Wassel, X. Wu, B. Montazeri, Y. Wang, K. Springborn, C. Alfeld, M. Ryan, D....
- Y. Jiang, L. Sivalingam, S. Nath, R. Govindan, Webperf: evaluating what-if scenarios for cloud-hosted web applications,...
- M. Alizadeh, A. Greenberg, D.A. Maltz, J. Padhye, P. Patel, B. Prabhakar, S. Sengupta, M. Sridharan, Data Center TCP...
- H. Xu, B. Li, RepFlow: minimizing flow completion times with replicated flows in data centers, in: Proc. IEEE INFOCOM,...
- T. Benson, A. Akella, D. Maltz, Network traffic characteristics of data centers in the wild, in: Proc. ACM IMC, 2010,...
- et al.
PIAS: Practical information-agnostic flow scheduling for commodity data centers
IEEE/ACM Trans. Netw.
(2017) - S. Hu, W. Bai, G. Zeng, Z. Wang, B. Qiao, K. Chen, K. Tan, Y. Wang, Aeolus: A building block for proactive transport in...
- A. Saeed, V. Gupta, P. Goyal, M. Sharif, R. Pan, M. Ammar, E. Zegura, K. Jang, M. Alizadeh, A. Kabbani, A. Vahdat,...
- G. Zeng, W. Bai, G. Chen, K. Chen, D. Han, Y. Zhu, L. Cui, Congestion control for cross-datacenter networks, in: Proc....
DX: Latency-based congestion control for datacenters
IEEE/ACM Trans. Netw.
Mitigating packet reordering for random packet spraying in data center networks
IEEE/ACM Trans. Netw.
CAPS: Coding-based adaptive packet spraying to reduce flow completion time in data center
IEEE/ACM Trans. Netw.
Cited by (7)
Intelligent queue management of open vSwitch in multi-tenant data center
2023, Future Generation Computer SystemsFuture data center energy-conservation and emission-reduction technologies in the context of smart and low-carbon city construction
2023, Sustainable Cities and SocietyCitation Excerpt :The digital industry has emphasized the need for computing power in DCs (Stanley, 2015), which is derived from chips (Hamza, Deogun, & Alexander, 2016), as shown in Fig. 6(c), and can be used to evaluate the DC performance using various computing power indicators (Helali & Omri, 2021). Among these, general computing controls the data flow (Jiang, Qiu, & Gao, 2019), high-performance computing can quickly solve complex problems (Buyya et al., 2010; Delimitrou & Kozyrakis, 2012; Dong, 2011; Fainman & Porter, 2013; Garimella et al., 2013; Hammadi & Mhamdi, 2014; Hamza et al., 2016; Harris, 2005; Helali & Omri, 2021; Hrouga et al., 2022; Hu & Deng, 2019; Jiang et al., 2019; Nath et al., 2006; Stanley, 2015; Stokel-Walker, 2022; Tang et al., 2017; Wei et al., 2019; Xu et al., 2018; Zeng and Veeravalli, 2014; T. Zhang et al., 2022), storage performance is highly related to security (HajiRassouliha, Taberner, & Nash, 2018), and network capability is measured by bandwidth and network latency (Elgendy, Zhang, & Tian, 2019). The computing power environment is supported by the Internet and 5 G mobile base stations, enabling services such as edge computing and data transmission (Brewer, Katz, & Chawathe, 1998).
Load Balancing With Deadline-Driven Parallel Data Transmission in Data Center Networks
2023, IEEE Internet of Things JournalResearch on Data Center Load Balancing Based on Particle Swarm Optimization Fusion Ant Colony Optimization Algorithm
2023, Proceedings of SPIE - The International Society for Optical EngineeringLoad Balancing Techniques in Cloud Environment - A Big Picture Analysis
2022, 2022 1st International Conference on Computational Science and Technology, ICCST 2022 - ProceedingsCoarse-Grained Load Balancing with Traffic-Aware Marking in Data Center Networks
2022, Security and Communication Networks
Tao Zhang received his Ph.D. degree in the School of Computer Science and Engineering, Central South University, China. He is now an associate professor in Hunan Province Key Laboratory of Industrial Internet Technology and Security, Changsha University, China. His research interests include congestion control, load balancing, performance modeling, analysis, and data center networking.
Qianqiang Zhang is currently working toward the B.Sc. Degree in the School of Computer Engineering and Applied Mathematics, Changsha University, China. His current research interests include load balancing and data center networks.
Yasi Lei is currently working toward the B.Sc. Degree in the School of Computer Engineering and Applied Mathematics, Changsha University, China. Her current research interests include network programming, load balancing, and data center networks.
Shaojun Zou received his Ph.D. degree in the School of Computer Science and Engineering, Central South University, China. His current research interests include congestion control, load balancing, and data center networks.
Juan Huang received the M.Sc. degree from Central South University, Changsha, China, majoring in computer science. She is currently a lecturer in the School of Computer Engineering and Applied Mathematics, Changsha University, China. Her research interests include programming, load balancing, and data center networks.
Fangmin Li received the B.Sc. degree from the Huazhong University of Science and Technology, Wuhan, China, in 1990, the M.Sc. Degree from the National University of Defense Technology, Changsha, China, in 1997, and the Ph.D. degree from Zhejiang University, Hangzhou, China, in 2001, all in computer science. He is currently the chair of and a professor in Hunan Province Key Laboratory of Industrial Internet Technology and Security, Changsha University, China. His current research interests include congestion control, data center networking, wireless communications and networks security, computer systems and architectures, and embedded systems.