RILNET: A Reinforcement Learning Based Load Balancing Approach for Datacenter Networks

Lin, Qinliang; Gong, Zhibo; Wang, Qiaoling; Li, Jinlong

doi:10.1007/978-3-030-19945-6_4

Qinliang Lin¹⁷,
Zhibo Gong¹⁷,
Qiaoling Wang¹⁷ &
…
Jinlong Li¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 11407))

Included in the following conference series:

International Conference on Machine Learning for Networking

1646 Accesses
3 Citations

Abstract

Modern datacenter networks are facing various challenges, e.g., highly dynamic workloads, congestion, topology asymmetry. ECMP, as a traditional load balancing mechanism which is widely used in today’s datacenters, can balance load poorly and lead to congestion. Variety of load balancing schemes are proposed to address the problems of ECMP. However, these traditional schemes usually make load balancing decision only based on network knowledge for a snapshot or a short time past. In this paper, we propose a Reinforcement Learning (RL) based approach, called RILNET (ReInforcement Learning NETworking), aiming at load balancing for datacenter networks. RILNET employs RL to learn a network and control it based on the learned experience. To achieve a higher granularity of control, RILNET is constructed to route flowlet rather than flows. Moreover, RILNET makes routing decisions for aggregation flows (an aggregation flow is a flow set that includes all flows flowing from the same source edge switch to the same destination edge switch) instead of a single flow. In order to test performance of RILNET, we propose a flow-level simulation and a packet-level simulation, and the both results show that RILNET can balance traffic load much more effectively than ECMP and another load balancing solution, i.e., DRILL. Compared with DRILL, RILNET outperforms DRILL in data loss and maximal link delay. Specifically, the maximal link data loss and the maximal link delay of RILNET are 44.4% and 25.4% smaller than DRILL, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Note that RILNET can be used for multiple purposes, including load balancing, reducing data loss, reducing flow completion time, etc. In this paper, we focus on load balancing and leave the other purposes in our future work.

References

Al-Fares, M., Loukissas, A., Vahdat, A.: A scalable, commodity data center network architecture. In: ACM SIGCOMM Computer Communication Review, vol. 38, pp. 63–74. ACM (2008)
Google Scholar
Al-Fares, M., Radhakrishnan, S., Raghavan, B., Huang, N., Vahdat, A.: Hedera: dynamic flow scheduling for data center networks. In: NSDI, vol. 10, p. 19 (2010)
Google Scholar
Alizadeh, M., et al.: CONGA: distributed congestion-aware load balancing for datacenters. In: ACM SIGCOMM Computer Communication Review, vol. 44, pp. 503–514. ACM (2014)
Google Scholar
Benson, T., Akella, A., Maltz, D.A.: Network traffic characteristics of data centers in the wild. In: Proceedings of the 10th ACM SIGCOMM Conference on Internet Measurement, pp. 267–280. ACM (2010)
Google Scholar
Chavula, J., Densmore, M., Suleman, H.: Using SDN and reinforcement learning for traffic engineering in UbuntuNet alliance. In: 2016 International Conference on Advances in Computing and Communication Engineering (ICACCE), pp. 349–355. IEEE (2016)
Google Scholar
Ghorbani, S., Godfrey, B., Ganjali, Y., Firoozshahian, A.: micro load balancing in data centers with drill. In: Proceedings of the 14th ACM Workshop on Hot Topics in Networks, p. 17. ACM (2015)
Google Scholar
Gill, P., Jain, N., Nagappan, N.: Understanding network failures in data centers: measurement, analysis, and implications. In: ACM SIGCOMM Computer Communication Review, vol. 41, pp. 350–361. ACM (2011)
Google Scholar
Grondman, I., Busoniu, L., Lopes, G.A., Babuska, R.: A survey of actor-critic reinforcement learning: standard and natural policy gradients. IEEE Trans. Syst. Man Cybern. Part C (Appl. Rev.) 42(6), 1291–1307 (2012)
Article Google Scholar
Guo, C., et al.: BCube: a high performance, server-centric network architecture for modular data centers. ACM SIGCOMM Comput. Commun. Rev. 39(4), 63–74 (2009)
Article Google Scholar
Guo, C., et al.: Pingmesh: a large-scale system for data center network latency measurement and analysis. ACM SIGCOMM Comput. Commun. Rev. 45(4), 139–152 (2015)
Article Google Scholar
He, K., Rozner, E., Agarwal, K., Felter, W., Carter, J., Akella, A.: Presto: edge-based load balancing for fast datacenter networks. ACM SIGCOMM Comput. Commun. Rev. 45(4), 465–478 (2015)
Article Google Scholar
Kandula, S., Katabi, D., Sinha, S., Berger, A.: Dynamic load balancing without packet reordering. ACM SIGCOMM Comput. Commun. Rev. 37(2), 51–62 (2007)
Article Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lillicrap, T.P., et al.: Continuous control with deep reinforcement learning. arXiv preprint arXiv:1509.02971 (2015)
Lin, S.C., Akyildiz, I.F., Wang, P., Luo, M.: QoS-aware adaptive routing in multi-layer hierarchical software defined networks: a reinforcement learning approach. In: 2016 IEEE International Conference on Services Computing (SCC), pp. 25–33. IEEE (2016)
Google Scholar
Popa, L., Kumar, G., Chowdhury, M., Krishnamurthy, A., Ratnasamy, S., Stoica, I.: FairCloud: sharing the network in cloud computing. In: Proceedings of the ACM SIGCOMM 2012 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, pp. 187–198. ACM (2012)
Google Scholar
Rasley, J., et al.: Planck: millisecond-scale monitoring and control for commodity networks. In: ACM SIGCOMM Computer Communication Review, vol. 44, pp. 407–418. ACM (2014)
Google Scholar
The MAWI Working Group: MAWI working group traffic archive. http://mawi.wide.ad.jp/mawi/. Accessed 21 June 2018
Varga, A.: OMNeT++ user manual version 4.6. OpenSim Ltd (2014)
Google Scholar
Zhang, H., Zhang, J., Bai, W., Chen, K., Chowdhury, M.: Resilient datacenter load balancing in the wild. In: Proceedings of the Conference of the ACM Special Interest Group on Data Communication, pp. 253–266. ACM (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Network Technology Laboratory, 2012 Labs, Huawei Technologies Co., Ltd., Shenzhen, China
Qinliang Lin, Zhibo Gong, Qiaoling Wang & Jinlong Li

Authors

Qinliang Lin
View author publications
You can also search for this author in PubMed Google Scholar
Zhibo Gong
View author publications
You can also search for this author in PubMed Google Scholar
Qiaoling Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jinlong Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Qinliang Lin .

Editor information

Editors and Affiliations

Télécom SudParis, Évry, France
Éric Renault
Inria, Paris, France
Paul Mühlethaler
CNAM/CEDRIC, Paris, France
Selma Boumerdassi

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lin, Q., Gong, Z., Wang, Q., Li, J. (2019). RILNET: A Reinforcement Learning Based Load Balancing Approach for Datacenter Networks. In: Renault, É., Mühlethaler, P., Boumerdassi, S. (eds) Machine Learning for Networking. MLN 2018. Lecture Notes in Computer Science(), vol 11407. Springer, Cham. https://doi.org/10.1007/978-3-030-19945-6_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-19945-6_4
Published: 10 May 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-19944-9
Online ISBN: 978-3-030-19945-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics