skip to main content
10.1145/3098822.3098841acmconferencesArticle/Chapter ViewAbstractPublication PagescommConference Proceedingsconference-collections
research-article
Free Access

Resilient Datacenter Load Balancing in the Wild

Authors Info & Claims
Published:07 August 2017Publication History

ABSTRACT

Production datacenters operate under various uncertainties such as traffic dynamics, topology asymmetry, and failures. Therefore, datacenter load balancing schemes must be resilient to these uncertainties; i.e., they should accurately sense path conditions and timely react to mitigate the fallouts. Despite significant efforts, prior solutions have important drawbacks. On the one hand, solutions such as Presto and DRB are oblivious to path conditions and blindly reroute at fixed granularity. On the other hand, solutions such as CONGA and CLOVE can sense congestion, but they can only reroute when flowlets emerge; thus, they cannot always react timely to uncertainties. To make things worse, these solutions fail to detect/handle failures such as blackholes and random packet drops, which greatly degrades their performance.

In this paper, we introduce Hermes, a datacenter load balancer that is resilient to the aforementioned uncertainties. At its heart, Hermes leverages comprehensive sensing to detect path conditions including failures unattended before, and it reacts using timely yet cautious rerouting. Hermes is a practical edge-based solution with no switch modification. We have implemented Hermes with commodity switches and evaluated it through both testbed experiments and large-scale simulations. Our results show that Hermes achieves comparable performance to CONGA and Presto in normal cases, and well handles uncertainties: under asymmetries, Hermes achieves up to 10% and 20% better flow completion time (FCT) than CONGA and CLOVE; under switch failures, it outperforms all other schemes by over 32%.

Skip Supplemental Material Section

Supplemental Material

resilientdatacenterloadbalancinginthewild.webm

webm

128.7 MB

References

  1. DCTCP in Linux Kernel 3.18. "http://kernelnewbies.org/Linux3.18".Google ScholarGoogle Scholar
  2. In-band Network Telemetry (INT). "http://p4.org/wp-content/uploads/fixed/INT/INT-current-spec.pdf".Google ScholarGoogle Scholar
  3. Mohammad Al-Fares, Alexander Loukissas, and Amin Vahdat. A Scalable, Commodity Data Center Network Architecture. In SIGCOMM 2008.Google ScholarGoogle Scholar
  4. Mohammad Al-Fares, Sivasankar Radhakrishnan, Barath Raghavan, Nelson Huang, and Amin Vahdat. Hedera: Dynamic Flow Scheduling for Data Center Networks. In NSDI 2010.Google ScholarGoogle Scholar
  5. Mohammad Alizadeh, Tom Edsall, Sarang Dharmapurikar, Ramanan Vaidyanathan, Kevin Chu, Andy Fingerhut, Francis Matus, Rong Pan, Navindra Yadav, George Varghese, and others. CONGA: Distributed Congestion-Aware Load balancing for Datacenters. In SIGCOMM 2014.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Mohammad Alizadeh, Albert Greenberg, David A Maltz, Jitendra Padhye, Parveen Patel, Balaji Prabhakar, Sudipta Sengupta, and Murari Sridharan. Data Center TCP (DCTCP). In SIGCOMM 2010.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Wei Bai, Li Chen, Kai Chen, Dongsu Han, Chen Tian, and Hao Wang. Information-Agnostic Flow Scheduling for Commodity Data Centers. In NSDI 2015.Google ScholarGoogle Scholar
  8. Wei Bai, Li Chen, Kai Chen, and Haitao Wu. Enabling ECN in Multi-Service Multi-Queue Data Centers. In NSDI 2016.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Hitesh Ballani, Paolo Costa, Christos Gkantsidis, Matthew P Grosvenor, Thomas Karagiannis, Lazaros Koromilas, and Greg O'Shea. Enabling End-host Network Functions. In SIGCOMM 2015.Google ScholarGoogle Scholar
  10. Peter Bodík, Ishai Menache, Mosharaf Chowdhury, Pradeepkumar Mani, David A Maltz, and Ion Stoica. Surviving Failures in Bandwidth-Constrained Datacenters. In SIGCOMM 2012.Google ScholarGoogle Scholar
  11. Pat Bosshart, Dan Daly, Glen Gibb, Martin Izzard, Nick McKeown, Jennifer Rexford, Cole Schlesinger, Dan Talayco, Amin Vahdat, George Varghese, and others. 2014. P4: Programming Protocol-Independent Packet Processors. SIGCOMM CCR 44, 3 (2014), 87--95. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Jiaxin Cao, Rui Xia, Pengkun Yang, Chuanxiong Guo, Guohan Lu, Lihua Yuan, Yixin Zheng, Haitao Wu, Yongqiang Xiong, and Dave Maltz. Per-packet Load-balanced, Low-latency Routing for Clos-based Data Center Networks. In CoNEXT 2013.Google ScholarGoogle Scholar
  13. Advait Dixit, Pawan Prakash, Y Charlie Hu, and Ramana Rao Kompella. On the Impact of Packet Spraying in Data Center Networks. In INFOCOM 2013.Google ScholarGoogle Scholar
  14. Vanini Erico, Pan Rong, Alizadeh Mohammad, Taheri Parvin, and Edsall Tom. Let it FLow: Resilient Asymmetric Load Balancing with Flowlet Switching. In NSDI 2017.Google ScholarGoogle Scholar
  15. Yilong Geng, Vimalkumar Jeyakumar, Abdul Kabbani, and Mohammad Alizadeh. JUGGLER: A Practical Reordering Resilient Network Stack for Datacenters. In EuroSys 2016. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Soudeh Ghorbani, Brighten Godfrey, Yashar Ganjali, and Amin Firoozshahian. Micro Load Balancing in Data Centers with DRILL. In HotNets 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications. In SIGCOMM 2011.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Albert Greenberg, James R. Hamilton, Navendu Jain, Srikanth Kandula, Changhoon Kim, Parantap Lahiri, David A. Maltz, Parveen Patel, and Sudipta Sengupta. VL2: A Scalable and Flexible Data Center Network. In ACM SIGCOMM 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Chuanxiong Guo, Lihua Yuan, Dong Xiang, Yingnong Dang, Ray Huang, Dave Maltz, Zhaoyi Liu, Vin Wang, Bin Pang, Hua Chen, Zhi-Wei Lin, and Varugis Kurien. Pingmesh: A Large-Scale System for Data Center Network Latency Measurement and Analysis. In SIGCOMM 2015. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Keqiang He, Eric Rozner, Kanak Agarwal, Wes Felter, John Carter, and Aditya Akella. Presto: Edge-based Load Balancing for Fast Datacenter Networks. In SIGCOMM 2015.Google ScholarGoogle Scholar
  21. CE Hopps. Analysis of an Equal-Cost Multi-Path Algorithm. In RFC 2992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Shuihai Hu, Kai Chen, Haitao Wu, Wei Bai, Chang Lan, Hao Wang, Hongze Zhao, and Chuanxiong Guo. Explicit Path Control in Commodity Data Centers: Design and Applications. In NSDI 2015.Google ScholarGoogle Scholar
  23. Abdul Kabbani, Balajee Vamanan, Jahangir Hasan, and Fabien Duchene. FlowBender: Flow-level Adaptive Routing for Improved Latency and Throughput in Datacenter Networks. In CoNEXT 2014.Google ScholarGoogle Scholar
  24. Naga Katta, Mukesh Hira, Aditi Ghag, Changhoon Kim, Isaac Keslassy, and Jennifer Rexford. CLOVE: How I Learned to Stop Worrying about the Core and Love the Edge. In HotNets 2016.Google ScholarGoogle Scholar
  25. Naga Katta, Mukesh Hira, Changhoon Kim, Anirudh Sivaraman, and Jennifer Rexford. HULA: Scalable Load Balancing Using Programmable Data Planes. In SOSR 2016.Google ScholarGoogle Scholar
  26. Radhika Mittal, Nandita Dukkipati, Emily Blem, Hassan Wassel, Monia Ghobadi, Amin Vahdat, Yaogong Wang, David Wetherall, David Zats, and others. TIMELY: RTT-based Congestion Control for the Datacenter. In SIGCOMM 2015.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Michael Mitzenmacher. 2000. How Useful Is Old Information? IEEE TPDS 11, 1 (2000), 6--20. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Michael Mitzenmacher. 2001. The Power of Two Choices in Randomized Load Balancing. IEEE TPDS 12, 10 (2001), 1094--1104. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Michael Mitzenmacher, Balaji Prabhakar, and Devavrat Shah. Load Balancing with Memory. In FOCS 2002. Google ScholarGoogle ScholarCross RefCross Ref
  30. Jayakrishnan Nair, Adam Wierman, and Bert Zwart. The Fundamentals of Heavy-tails: Properties, Emergence, and Identification. In SIGMETRICS 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Costin Raiciu, Sebastien Barre, Christopher Pluntke, Adam Greenhalgh, Damon Wischik, and Mark Handley. Improving Datacenter Performance and Robustness with Multipath TCP. In SIGCOMM 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Arjun Roy, Hongyi Zeng, Jasmeet Bagga, and Alex C Snoeren. Passive Realtime Datacenter Fault Detection and Localization. In NSDI 2017.Google ScholarGoogle Scholar
  33. Shan Sinha, Srikanth Kandula, and Dina Katabi. Harnessing TCP's Burstiness with Flowlet Switching. In HotNets 2004.Google ScholarGoogle Scholar
  34. Balajee Vamanan, Jahangir Hasan, and TN Vijaykumar. Deadline-Aware Datacenter TCP (D2TCP). In SIGCOMM 2012.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Peng Wang, Hong Xu, Zhixiong Niu, Dongsu Han, and Yongqiang Xiong. Expeditus: Congestion-aware Load Balancing in Clos Data Center Networks. In SoCC 2016.Google ScholarGoogle Scholar

Index Terms

  1. Resilient Datacenter Load Balancing in the Wild

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in
          • Published in

            cover image ACM Conferences
            SIGCOMM '17: Proceedings of the Conference of the ACM Special Interest Group on Data Communication
            August 2017
            515 pages
            ISBN:9781450346535
            DOI:10.1145/3098822

            Copyright © 2017 ACM

            Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

            Publisher

            Association for Computing Machinery

            New York, NY, United States

            Publication History

            • Published: 7 August 2017

            Permissions

            Request permissions about this article.

            Request Permissions

            Check for updates

            Qualifiers

            • research-article
            • Research
            • Refereed limited

            Acceptance Rates

            Overall Acceptance Rate554of3,547submissions,16%

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader