ABSTRACT
The growing demand for always-on and low-latency cloud services is driving the creation of globally distributed datacenters. A major factor affecting service availability is reliability of the network, both inside the datacenters and wide-area links connecting them. While several research efforts focus on building scale-out datacenter networks, little has been reported on real network failures and how they impact geo-distributed services. This paper makes one of the first attempts to characterize intra-datacenter and inter-datacenter network failures from a service perspective. We describe a large-scale study analyzing and correlating failure events over three years across multiple datacenters and thousands of network elements such as Access routers, Aggregation switches, Top-of-Rack switches, and long-haul links. Our study reveals several important findings on (a) the availability of network domains, (b) root causes, (c) service impact, (d) effectiveness of repairs, and (e) modeling failures. Finally, we outline steps based on existing network mechanisms to improve service availability.
- Keynote Web Performance Testing. http://goo.gl/khl9Q.Google Scholar
- S. Agarwal, J. Dunagan, N. Jain, S. Saroiu, A. Wolman, and H. Bhogan. Volley: Automated Data Placement for Geo-distributed Cloud Services. In Proceedings of NSDI. USENIX Association, 2010. Google ScholarDigital Library
- Amazon. Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region. http://goo.gl/yUlTJ, May 2011.Google Scholar
- L. Bairavasundaram, A. Arpaci-Dusseau, R. Arpaci-Dusseau, G. Goodson, and B. Schroeder. An Analysis of Data Corruption in the Storage Stack. Proceedings of ACM Transactions on Storage (TOS), 4(3), 2008. Google ScholarDigital Library
- G. Box, J. Hunter, and W. Hunter. Statistics for Experimenters: Design, Innovation, and Discovery. Wiley, 2005.Google Scholar
- E. A. Brewer. Lessons from Giant-Scale Services. Internet Computing, IEEE, 5(4): 46--55, 2001. Google ScholarDigital Library
- J. Brodkin. Netflix attacks own network with "Chaos Monkey" - And now you can too. http://goo.gl/XhiKM, July 2012.Google Scholar
- C. E. Brown. Coefficient of Variation. In Applied Multivariate Statistics in Geohydrology and Related Sciences, pages 155--157. Springer, 1998.Google ScholarCross Ref
- J. Case, M. Fedor, M. Schoffstall, and J. Davin. Simple Network Management Protocol. http://goo.gl/az3Fv, May 1990. Google ScholarDigital Library
- Y. Chen, S. Jain, V. Adhikari, Z. Zhang, and K. Xu. A First Look at Inter-data Center Traffic Characteristics via Yahoo! Datasets. In Proceedings of INFOCOM. IEEE, 2011.Google ScholarCross Ref
- S. Deering and R. Hinden. Internet Protocol, Version (IPv6) Specification. RFC 2460. Google ScholarDigital Library
- L. Ellram. Total Cost of Ownership: An Analysis Approach for Purchasing. Journal of PDLM, 1995.Google ScholarCross Ref
- D. Etherington. Dropbox Currently Experiencing Widespread Service Outage. http://goo.gl/rszmb, May 2013.Google Scholar
- N. Feamster and H. Balakrishnan. Detecting BGP Configuration Faults with Static Analysis. In Proceedings of USENIX NSDI, 2005. Google ScholarDigital Library
- S. G. and I. B. Websites Scramble as Hurricane Sandy Floods Data Centers. http://goo.gl/zOXDb, October 31 2012.Google Scholar
- P. Gill, N. Jain, and N. Nagappan. Understanding Network Failures in Datacenters: Measurement, Analysis, and Implications. In Proceedings of SIGCOMM, 2011. Google ScholarDigital Library
- A. Greenberg, J. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. Maltz, P. Patel, and S. Sengupta. VL2: A Scalable and Flexible Datacenter Network. ACM SIGCOMM CCR, 2009. Google ScholarDigital Library
- H. Jiang, F. Kéfélian, S. Crane, O. Lopez, M. Lours, J. Millo, D. Holleville, P. Lemonde, C. Chardonnet, A. Amy-Klein, et al. Long-distance Frequency Transfer Over an Urban Fiber Link Using Optical Phase Stabilization. JOSA B, 25(12), 2008.Google Scholar
- W. Jiang, C. Hu, Y. Zhou, and A. Kanevsky. Are disks the dominant contributor for storage failures?: A Comprehensive Study of Storage Subsystem Failure Characteristics. TOS, 2008. Google ScholarDigital Library
- D. Johnson. NOC Internal Integrated Trouble Ticket System. http://goo.gl/eMZxX, January 1992. Google ScholarDigital Library
- S. Kandula, R. Mahajan, P. Verkaik, S. Agarwal, J. Padhye, and P. Bahl. Detailed Diagnosis in Enterprise Networks. In ACM SIGCOMM CCR, 2009. Google ScholarDigital Library
- S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken. The Nature of Data center Traffic: Measurements & Analysis. In Proceedings of SIGCOMM. ACM, 2009. Google ScholarDigital Library
- D. C. Knowledge. Data Center Global Expansion Trend. http://goo.gl/SOvtA, November 2012.Google Scholar
- K. Kompella, L. Berger, and Y. Rekhter. Link Bundling in MPLS Traffic Engineering (TE). 2005.Google Scholar
- C. Labovitz, A. Ahuja, and F. Jahanian. Experimental Study of Internet Stability and Backbone Failures. In Proceedings of IEEE Fault-Tolerant Computing, 1999. Google ScholarDigital Library
- N. Laoutaris, M. Sirivianos, X. Yang, and P. Rodriguez. Inter-datacenter Bulk Transfers with Net-Stitcher. In Proceedings of SIGCOMM, 2011. Google ScholarDigital Library
- Y. Li, H. Wang, P. Zhang, J. Dong, and S. Cheng. D4D: Inter-datacenter Bulk Transfers with ISP Friendliness. In IEEE CLUSTER, 2012. Google ScholarDigital Library
- H. Lilliefors. On the Kolmogorov-Smirnov Test for the Exponential Distribution with Mean Unknown. Journal of the American Statistical Association, 64(325), 1969.Google ScholarCross Ref
- A. Mahimkar, A. Chiu, R. Doverspike, M. Feuer, P. Magill, E. Mavrogiorgis, J. Pastor, S. Woodward, and J. Yates. Bandwidth On Demand for Inter-Data center Communication. In HotNets. ACM, 2011. Google ScholarDigital Library
- A. Markopoulou, G. Iannaccone, S. Bhattacharyya, C. Chuah, Y. Ganjali, and C. Diot. Characterization of Failures in an Operational IP Backbone Network. IEEE/ACM TON, 2008. Google ScholarDigital Library
- M. McCloghrie, K. ad Rose. Management Information Base for Network Management of TCP/IP-based internets. RFC 1213. Google ScholarDigital Library
- G. Mohan and C. Murthy. Lightpath Restoration in WDM Optical Networks. Network, IEEE, 14(6), 2000. Google ScholarDigital Library
- T. K. Moon. The Expectation-Maximization Algorithm. Signal Processing Magazine, IEEE, 13(6): 47--60, 1996.Google ScholarCross Ref
- J. Mudigonda, P. Yalagandula, J. Mogul, B. Stiekes, and Y. Pouffary. NetLord: A Scalable Multi-tenant Network Architecture for Virtualized Datacenters. In Proceedings of ACM SIGCOMM, 2011. Google ScholarDigital Library
- E. Nightingale, J. Douceur, and V. Orgovan. Cycles, Cells and Platters: An Empirical Analysis of Hardware Failures on a Million Consumer PCs. In Proceedings of the Sixth Conference on Computer Systems. ACM, 2011. Google ScholarDigital Library
- R. Niranjan Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S. Radhakrishnan, V. Subramanya, and A. Vahdat. PortLand: A Scalable Fault-Tolerant Layer-2 Data center Network Fabric. In SIGCOMM CCR. ACM, 2009. Google ScholarDigital Library
- V. Padmanabhan, S. Ramabhadran, S. Agarwal, and J. Padhye. A Study of End-to-End Web Access Failures. In Proceedings of ACM CoNEXT, 2006. Google ScholarDigital Library
- E. Pinheiro, W. Weber, and L. Barroso. Failure Trends in a Large Disk Drive Population. In Proceedings of FAST, 2007. Google ScholarDigital Library
- R. Potharaju and N. Jain. An Empirical Analysis of Intra-and Inter-datacenter Network Failures for Geo-distributed Services. In Extended Abstract Proceedings of ACM SIGMETRICS. ACM, 2013. Google ScholarDigital Library
- R. Potharaju and N. Jain. Demystifying the Dark Side of the Middle: A Field Study of Middlebox Failures in Datacenters. In Proceedings of the 13th ACM SIGCOMM Conference on Internet Measurement, 2013. Google ScholarDigital Library
- R. Potharaju, N. Jain, and C. Nita-Rotaru. Juggling the Jigsaw: Towards Automated Problem Inference from Network Trouble Tickets. In Proceedings of USENIX NSDI, 2013. Google ScholarDigital Library
- R. Sakia. The Box-Cox Transformation Technique: A Review. The Statistician, pages 169--178, 1992.Google Scholar
- B. Schroeder and G. Gibson. Disk Failures in the Real World: What does an MTTF of 1,000,000 hours mean to you. In Proceedings of FAST, 2007. Google ScholarDigital Library
- B. Schroeder, E. Pinheiro, and W. Weber. DRAM Errors in the Wild: A Large-scale Field Study. In Proceedings of ACM SIGMETRICS, 2009. Google ScholarDigital Library
- A. Shaikh, C. Isett, A. Greenberg, M. Roughan, and J. Gottlieb. A Case Study of OSPF Behavior in a Large Enterprise Network. In ACM SIGCOMM WIM, 2002. Google ScholarDigital Library
- J. Sherry, S. Hasan, C. Scott, A. Krishnamurthy, S. Ratnasamy, and V. Sekar. Making Middleboxes someone else's Problem: Network Processing as a Cloud Service. In Proceedings of SIGCOMM, 2012. Google ScholarDigital Library
- C. Talbot. Dropbox Outage Represents First Major Cloud Outage of 2013. http://goo.gl/rszmb, January 2013.Google Scholar
- D. Turner, K. Levchenko, A. Snoeren, and S. Savage. California Fault Lines: Understanding the Causes and Impact of Network Failures. In ACM SIGCOMM CCR, 2010. Google ScholarDigital Library
- M. Wilk and R. Gnanadesikan. Probability Plotting Methods for the Analysis for the Analysis of Data. Biometrika, 55(1), 1968.Google Scholar
- S. Works. Hurricane Sandy - AC2 Transatlantic Cable Cut. http://goo.gl/dywVO, October 2012.Google Scholar
- Z. Yin, X. Ma, J. Zheng, Y. Zhou, L. Bairavasundaram, and S. Pasupathy. An Empirical Study on Configuration Errors in Commercial and Open Source Systems. In Proceedings of ACM SOSP, 2011. Google ScholarDigital Library
Index Terms
- When the network crumbles: an empirical study of cloud network failures and their impact on services
Recommendations
Understanding network failures in data centers: measurement, analysis, and implications
SIGCOMM '11: Proceedings of the ACM SIGCOMM 2011 conferenceWe present the first large-scale analysis of failures in a data center network. Through our analysis, we seek to answer several fundamental questions: which devices/links are most unreliable, what causes failures, how do failures impact network traffic ...
Demystifying the dark side of the middle: a field study of middlebox failures in datacenters
IMC '13: Proceedings of the 2013 conference on Internet measurement conferenceNetwork appliances or middleboxes such as firewalls, intrusion detection and prevention systems (IDPS), load balancers, and VPNs form an integral part of datacenters and enterprise networks. Realizing their importance and shortcomings, the research ...
Network reliability optimization problem of interconnection network under node-edge failure model
The network reliability optimization problem for an interconnection network is to maximize the network reliability subjected to some constraints such as the total cost of the network. Even though, the problem is NP-Hard, many researchers have solved ...
Comments