ABSTRACT
This paper introduces sharable backup as a novel solution to failure recovery in data center networks. It allows the entire network to share a small pool of backup devices. This proposal is grounded in three key observations. First, the traditional rerouting-based failure recovery is ineffective, because bandwidth loss from failures degrades application performance drastically. Therefore, failed devices should be replaced to restore bandwidth. Second, failures in data centers are rare but destructive [11], so it is desirable to seek cost-effective backup options. Third, the emergence of configurable data center network architectures promises feasibility of bringing backup devices online dynamically. We design the ShareBackup prototype architecture to realize this idea. Compared to rerouting-based solutions, ShareBackup provides more bandwidth with short path length at low cost.
Supplemental Material
- Coflow-Benchmark, https://github.com/coflow/coflow-benchmark/.Google Scholar
- FS.COM, http://www.fs.com/.Google Scholar
- Introducing data center fabric, the next-generation Facebook data center network, https://code.facebook.com/posts/360346274145943/introducing-data-center-fabric-the-next-generation-facebook-data-center-network/.Google Scholar
- J. H. Ahn, N. Binkert, A. Davis, M. McLaren, and R. S. Schreiber. HyperX: Topology, Routing, and Packaging of Efficient Large-scale Networks. In SC '09, pages 41:1--41:11, Portland, Oregon, USA, November 2009. Google ScholarDigital Library
- M. Al-Fares, A. Loukissas, and A. Vahdat. A Scalable, Commodity Data Center Network Architecture. In SIGCOMM '08, pages 63--74, Seattle, Washington, USA, August 2008. Google ScholarDigital Library
- K. Chen, A. Singla, A. Singh, K. Ramachandran, L. Xu, Y. Zhang, X. Wen, and Y. Chen. OSA: An Optical Switching Architecture for Data Center Networks with Unprecedented Flexibility. In NSDI '12, San Joes, CA, April 2012. Google ScholarDigital Library
- K. Chen, X. Wen, X. Ma, Y. Chen, Y. Xia, C. Hu, and Q. Dong. Wave-Cube: A Scalable, Fault-tolerant, High-performance Optical Data Center Architecture. In 2015 IEEE Conference on Computer Communications (INFOCOM), pages 1903--1911, April 2015.Google ScholarCross Ref
- M. Chowdhury and I. Stoica. Coflow: A Networking Abstraction for Cluster Applications. In HotNets-XI, pages 31--36, Redmond, WA, 2012. Google ScholarDigital Library
- N. Farrington, G. Porter, S. Radhakrishnan, H. H. Bazzaz, V. Subramanya, Y. Fainman, G. Papen, and A. Vahdat. Helios: A Hybrid Electrical/Optical Switch Architecture for Modular Data Centers. In SIGCOMM '10, pages 339--350, New Delhi, India, August 2010. Google ScholarDigital Library
- M. Ghobadi, R. Mahajan, A. Phanishayee, N. Devanur, J. Kulkarni, G. Ranade, P.-A. Blanche, H. Rastegarfar, M. Glick, and D. Kilper. ProjecToR: Agile Reconfigurable Data Center Interconnect. In Proceedings of the 2016 Conference on ACM SIGCOMM 2016 Conference, SIGCOMM '16, pages 216--229, Florianopolis, Brazil, August 2016. Google ScholarDigital Library
- P. Gill, N. Jain, and N. Nagappan. Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications. In Proceedings of the ACM SIGCOMM 2011 Conference, SIGCOMM '11, pages 350--361, New York, NY, USA, 2011. ACM. Google ScholarDigital Library
- A. Greenberg, J. R. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. A. Maltz, P. Patel, and S. Sengupta. VL2: A Scalable and Flexible Data Center Network. In Proceedings of the ACM SIGCOMM 2009 Conference on Data Communication, SIGCOMM '09, pages 51--62, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- C. Guo, G. Lu, D. Li, H. Wu, X. Zhang, Y. Shi, C. Tian, Y. Zhang, and S. Lu. BCube: A High Performance, Server-centric Network Architecture for Modular Data Centers. In SIGCOMM '09, pages 63--74, Barcelona, Spain, August 2009. Google ScholarDigital Library
- C. Guo, H. Wu, K. Tan, L. Shi, Y. Zhang, and S. Lu. DCell: A Scalable and Fault-Tolerant Network Structure for Data Centers. In SIGCOMM '08, pages 75--86, Seattle, Washington, USA, August 2008. Google ScholarDigital Library
- D. Halperin, S. Kandula, J. Padhye, P. Bahl, and D. Wetherall. Augmenting Data Center Networks with Multi-gigabit Wireless Links. In Proceedings of the ACM SIGCOMM 2011 Conference, SIGCOMM '11, pages 38--49, Toronto, Ontario, Canada, August 2011. Google ScholarDigital Library
- N. Hamedazimi, Z. Qazi, H. Gupta, V. Sekar, S. R. Das, J. P. Longtin, H. Shah, and A. Tanwer. FireFly: A Reconfigurable Wireless Data Center Fabric Using Free-space Optics. In Proceedings of the 2014 ACM Conference on SIGCOMM, SIGCOMM '14, pages 319--330, Chicago, Illinois, USA, August 2014. Google ScholarDigital Library
- K. He, J. Khalid, A. Gember-Jacobson, S. Das, C. Prakash, A. Akella, L. E. Li, and M. Thottan. Measuring Control Plane Latency in SDN-enabled Switches. In Proceedings of the 1st ACM SIGCOMM Symposium on Software Defined Networking Research, SOSR '15, pages 25:1--25:6, Santa Clara, California, 2015. Google ScholarDigital Library
- S. Legtchenko, N. Chen, D. Cletheroe, A. Rowstron, H. Williams, and X. Zhao. XFabric: A Reconfigurable In-Rack Network for Rack-Scale Computers. In 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI 16), pages 15--29, Santa Clara, CA, 2016. USENIX Association. Google ScholarDigital Library
- V. Liu, D. Halperin, A. Krishnamurthy, and T. Anderson. F10: A Fault-Tolerant Engineered Network. In Presented as part of the 10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13), pages 399--412, Lombard, IL, 2013. USENIX. Google ScholarDigital Library
- Y. J. Liu, P. X. Gao, B. Wong, and S. Keshav. Quartz: A New Design Element for Low-latency DCNs. In SIGCOMM '14, pages 283--294, Chicago, Illinois, USA, August 2014. Google ScholarDigital Library
- R. Niranjan Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S. Radhakrishnan, V. Subramanya, and A. Vahdat. PortLand: A Scalable Fault-tolerant Layer 2 Data Center Network Fabric. In Proceedings of the ACM SIGCOMM 2009 Conference on Data Communication, SIGCOMM '09, pages 39--50, New York, NY, USA, 2009. ACM. Google ScholarDigital Library
- G. Porter, R. Strong, N. Farrington, A. Forencich, P. Chen-Sun, T. Rosing, Y. Fainman, G. Papen, and A. Vahdat. Integrating Microsecond Circuit Switching into the Data Center. In SIGCOMM '13, pages 447--458, Hong Kong, China, August 2013. Google ScholarDigital Library
- M. Schlansker, M. Tan, J. Tourrilhes, J. R. Santos, and S.-Y. Wang. Configurable optical interconnects for scalable datacenters. In Optical Fiber Communication Conference and Exposition and the National Fiber Optic Engineers Conference (OFC/NFOEC), 2013, pages 1--3. IEEE, 2013.Google ScholarCross Ref
- A. Singh, J. Ong, A. Agarwal, G. Anderson, A. Armistead, R. Bannon, S. Boving, G. Desai, B. Felderman, P. Germano, A. Kanagala, J. Provost, J. Simmons, E. Tanda, J. Wanderer, U. Hölzle, S. Stuart, and A. Vahdat. Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google's Datacenter Network. In SIGCOMM '15, pages 183--197, London, United Kingdom, August 2015. ACM. Google ScholarDigital Library
- A. Singla, C.-Y. Hong, L. Popa, and P. B. Godfrey. Jellyfish: Networking Data Centers Randomly. In NSDI '12, pages 1--14, San Jose, California, USA, April 2012. Google ScholarDigital Library
- M. Walraed-Sullivan, A. Vahdat, and K. Marzullo. Aspen Trees: Balancing Data Center Fault Tolerance, Scalability and Cost. In Proceedings of the Ninth ACM Conference on Emerging Networking Experiments and Technologies, CoNEXT '13, pages 85--96, New York, NY, USA, 2013. ACM. Google ScholarDigital Library
- G. Wang, D. G. Andersen, M. Kaminsky, K. Papagiannaki, T. S. E. Ng, M. Kozuch, and M. Ryan. c-Through: Part-time Optics in Data Centers. In SIGCOMM '10, pages 327--338, New Delhi, India, August 2010. Google ScholarDigital Library
- M. C. Wu, O. Solgaard, and J. E. Ford. Optical MEMS for Lightwave Communication. Journal of Lightwave Technology, 24(12):4433--4454, December 2006.Google ScholarCross Ref
- Y. Xia and T. S. E. Ng. Flat-tree: A Convertible Data Center Network Architecture from Clos to Random Graph. In Proceedings of the 15th ACM Workshop on Hot Topics in Networks, HotNets '16, pages 71--77, Atlanta, GA, November 2016. Google ScholarDigital Library
- Y. Xia, X. S. Sun, S. Dzinamarira, D. Wu, X. S. Huang, and T. S. E. Ng. A tale of two topologies: Exploring convertible data center network architectures with flat-tree. In Proceedings of the Conference of the ACM Special Interest Group on Data Communication, SIGCOMM '17, pages 295--308, New York, NY, USA, 2017. ACM. Google ScholarDigital Library
- X. Zhou, Z. Zhang, Y. Zhu, Y. Li, S. Kumar, A. Vahdat, B. Y. Zhao, and H. Zheng. Mirror Mirror on the Ceiling: Flexible Wireless Links for Data Centers. In Proceedings of the ACM SIGCOMM 2012 Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication, SIGCOMM '12, pages 443--454, Helsinki, Finland, August 2012. Google ScholarDigital Library
Index Terms
- Stop Rerouting!: Enabling ShareBackup for Failure Recovery in Data Center Networks
Recommendations
From crash-stop to permanent omission: automatic transformation and weakest failure detectors
DISC'07: Proceedings of the 21st international conference on Distributed ComputingThis paper studies the impact of omission failures on asynchronous distributed systems with crash-stop failures. We provide two different transformations for algorithms, failure detectors, and problem specifications, one of which is weakest failure ...
IP Fast Rerouting for Multi-Link Failures
IP fast reroute methods are used to recover packets in the data plane upon link failures. Previous work provided methods that guarantee failure recovery from at most two-link failures. We develop an IP fast reroute method that employs rooted arc-...
Load-Optimal Local Fast Rerouting for Dense Networks
Reliable and highly available computer networks must implement resilient fast rerouting mechanisms: upon a link or node failure, an alternative route is determined quickly, without involving the network control plane. Designing such fast failover ...
Comments