skip to main content
10.1145/2523616.2523638acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

When the network crumbles: an empirical study of cloud network failures and their impact on services

Published:01 October 2013Publication History

ABSTRACT

The growing demand for always-on and low-latency cloud services is driving the creation of globally distributed datacenters. A major factor affecting service availability is reliability of the network, both inside the datacenters and wide-area links connecting them. While several research efforts focus on building scale-out datacenter networks, little has been reported on real network failures and how they impact geo-distributed services. This paper makes one of the first attempts to characterize intra-datacenter and inter-datacenter network failures from a service perspective. We describe a large-scale study analyzing and correlating failure events over three years across multiple datacenters and thousands of network elements such as Access routers, Aggregation switches, Top-of-Rack switches, and long-haul links. Our study reveals several important findings on (a) the availability of network domains, (b) root causes, (c) service impact, (d) effectiveness of repairs, and (e) modeling failures. Finally, we outline steps based on existing network mechanisms to improve service availability.

References

  1. Keynote Web Performance Testing. http://goo.gl/khl9Q.Google ScholarGoogle Scholar
  2. S. Agarwal, J. Dunagan, N. Jain, S. Saroiu, A. Wolman, and H. Bhogan. Volley: Automated Data Placement for Geo-distributed Cloud Services. In Proceedings of NSDI. USENIX Association, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Amazon. Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region. http://goo.gl/yUlTJ, May 2011.Google ScholarGoogle Scholar
  4. L. Bairavasundaram, A. Arpaci-Dusseau, R. Arpaci-Dusseau, G. Goodson, and B. Schroeder. An Analysis of Data Corruption in the Storage Stack. Proceedings of ACM Transactions on Storage (TOS), 4(3), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. G. Box, J. Hunter, and W. Hunter. Statistics for Experimenters: Design, Innovation, and Discovery. Wiley, 2005.Google ScholarGoogle Scholar
  6. E. A. Brewer. Lessons from Giant-Scale Services. Internet Computing, IEEE, 5(4): 46--55, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. Brodkin. Netflix attacks own network with "Chaos Monkey" - And now you can too. http://goo.gl/XhiKM, July 2012.Google ScholarGoogle Scholar
  8. C. E. Brown. Coefficient of Variation. In Applied Multivariate Statistics in Geohydrology and Related Sciences, pages 155--157. Springer, 1998.Google ScholarGoogle ScholarCross RefCross Ref
  9. J. Case, M. Fedor, M. Schoffstall, and J. Davin. Simple Network Management Protocol. http://goo.gl/az3Fv, May 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Y. Chen, S. Jain, V. Adhikari, Z. Zhang, and K. Xu. A First Look at Inter-data Center Traffic Characteristics via Yahoo! Datasets. In Proceedings of INFOCOM. IEEE, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  11. S. Deering and R. Hinden. Internet Protocol, Version (IPv6) Specification. RFC 2460. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. L. Ellram. Total Cost of Ownership: An Analysis Approach for Purchasing. Journal of PDLM, 1995.Google ScholarGoogle ScholarCross RefCross Ref
  13. D. Etherington. Dropbox Currently Experiencing Widespread Service Outage. http://goo.gl/rszmb, May 2013.Google ScholarGoogle Scholar
  14. N. Feamster and H. Balakrishnan. Detecting BGP Configuration Faults with Static Analysis. In Proceedings of USENIX NSDI, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. G. and I. B. Websites Scramble as Hurricane Sandy Floods Data Centers. http://goo.gl/zOXDb, October 31 2012.Google ScholarGoogle Scholar
  16. P. Gill, N. Jain, and N. Nagappan. Understanding Network Failures in Datacenters: Measurement, Analysis, and Implications. In Proceedings of SIGCOMM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Greenberg, J. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. Maltz, P. Patel, and S. Sengupta. VL2: A Scalable and Flexible Datacenter Network. ACM SIGCOMM CCR, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. H. Jiang, F. Kéfélian, S. Crane, O. Lopez, M. Lours, J. Millo, D. Holleville, P. Lemonde, C. Chardonnet, A. Amy-Klein, et al. Long-distance Frequency Transfer Over an Urban Fiber Link Using Optical Phase Stabilization. JOSA B, 25(12), 2008.Google ScholarGoogle Scholar
  19. W. Jiang, C. Hu, Y. Zhou, and A. Kanevsky. Are disks the dominant contributor for storage failures?: A Comprehensive Study of Storage Subsystem Failure Characteristics. TOS, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. D. Johnson. NOC Internal Integrated Trouble Ticket System. http://goo.gl/eMZxX, January 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. S. Kandula, R. Mahajan, P. Verkaik, S. Agarwal, J. Padhye, and P. Bahl. Detailed Diagnosis in Enterprise Networks. In ACM SIGCOMM CCR, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken. The Nature of Data center Traffic: Measurements & Analysis. In Proceedings of SIGCOMM. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. D. C. Knowledge. Data Center Global Expansion Trend. http://goo.gl/SOvtA, November 2012.Google ScholarGoogle Scholar
  24. K. Kompella, L. Berger, and Y. Rekhter. Link Bundling in MPLS Traffic Engineering (TE). 2005.Google ScholarGoogle Scholar
  25. C. Labovitz, A. Ahuja, and F. Jahanian. Experimental Study of Internet Stability and Backbone Failures. In Proceedings of IEEE Fault-Tolerant Computing, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. N. Laoutaris, M. Sirivianos, X. Yang, and P. Rodriguez. Inter-datacenter Bulk Transfers with Net-Stitcher. In Proceedings of SIGCOMM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Y. Li, H. Wang, P. Zhang, J. Dong, and S. Cheng. D4D: Inter-datacenter Bulk Transfers with ISP Friendliness. In IEEE CLUSTER, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. H. Lilliefors. On the Kolmogorov-Smirnov Test for the Exponential Distribution with Mean Unknown. Journal of the American Statistical Association, 64(325), 1969.Google ScholarGoogle ScholarCross RefCross Ref
  29. A. Mahimkar, A. Chiu, R. Doverspike, M. Feuer, P. Magill, E. Mavrogiorgis, J. Pastor, S. Woodward, and J. Yates. Bandwidth On Demand for Inter-Data center Communication. In HotNets. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. A. Markopoulou, G. Iannaccone, S. Bhattacharyya, C. Chuah, Y. Ganjali, and C. Diot. Characterization of Failures in an Operational IP Backbone Network. IEEE/ACM TON, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. McCloghrie, K. ad Rose. Management Information Base for Network Management of TCP/IP-based internets. RFC 1213. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. G. Mohan and C. Murthy. Lightpath Restoration in WDM Optical Networks. Network, IEEE, 14(6), 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. T. K. Moon. The Expectation-Maximization Algorithm. Signal Processing Magazine, IEEE, 13(6): 47--60, 1996.Google ScholarGoogle ScholarCross RefCross Ref
  34. J. Mudigonda, P. Yalagandula, J. Mogul, B. Stiekes, and Y. Pouffary. NetLord: A Scalable Multi-tenant Network Architecture for Virtualized Datacenters. In Proceedings of ACM SIGCOMM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. E. Nightingale, J. Douceur, and V. Orgovan. Cycles, Cells and Platters: An Empirical Analysis of Hardware Failures on a Million Consumer PCs. In Proceedings of the Sixth Conference on Computer Systems. ACM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. R. Niranjan Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S. Radhakrishnan, V. Subramanya, and A. Vahdat. PortLand: A Scalable Fault-Tolerant Layer-2 Data center Network Fabric. In SIGCOMM CCR. ACM, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. V. Padmanabhan, S. Ramabhadran, S. Agarwal, and J. Padhye. A Study of End-to-End Web Access Failures. In Proceedings of ACM CoNEXT, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. E. Pinheiro, W. Weber, and L. Barroso. Failure Trends in a Large Disk Drive Population. In Proceedings of FAST, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. R. Potharaju and N. Jain. An Empirical Analysis of Intra-and Inter-datacenter Network Failures for Geo-distributed Services. In Extended Abstract Proceedings of ACM SIGMETRICS. ACM, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. R. Potharaju and N. Jain. Demystifying the Dark Side of the Middle: A Field Study of Middlebox Failures in Datacenters. In Proceedings of the 13th ACM SIGCOMM Conference on Internet Measurement, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. R. Potharaju, N. Jain, and C. Nita-Rotaru. Juggling the Jigsaw: Towards Automated Problem Inference from Network Trouble Tickets. In Proceedings of USENIX NSDI, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. R. Sakia. The Box-Cox Transformation Technique: A Review. The Statistician, pages 169--178, 1992.Google ScholarGoogle Scholar
  43. B. Schroeder and G. Gibson. Disk Failures in the Real World: What does an MTTF of 1,000,000 hours mean to you. In Proceedings of FAST, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. B. Schroeder, E. Pinheiro, and W. Weber. DRAM Errors in the Wild: A Large-scale Field Study. In Proceedings of ACM SIGMETRICS, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. A. Shaikh, C. Isett, A. Greenberg, M. Roughan, and J. Gottlieb. A Case Study of OSPF Behavior in a Large Enterprise Network. In ACM SIGCOMM WIM, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. J. Sherry, S. Hasan, C. Scott, A. Krishnamurthy, S. Ratnasamy, and V. Sekar. Making Middleboxes someone else's Problem: Network Processing as a Cloud Service. In Proceedings of SIGCOMM, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. C. Talbot. Dropbox Outage Represents First Major Cloud Outage of 2013. http://goo.gl/rszmb, January 2013.Google ScholarGoogle Scholar
  48. D. Turner, K. Levchenko, A. Snoeren, and S. Savage. California Fault Lines: Understanding the Causes and Impact of Network Failures. In ACM SIGCOMM CCR, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. M. Wilk and R. Gnanadesikan. Probability Plotting Methods for the Analysis for the Analysis of Data. Biometrika, 55(1), 1968.Google ScholarGoogle Scholar
  50. S. Works. Hurricane Sandy - AC2 Transatlantic Cable Cut. http://goo.gl/dywVO, October 2012.Google ScholarGoogle Scholar
  51. Z. Yin, X. Ma, J. Zheng, Y. Zhou, L. Bairavasundaram, and S. Pasupathy. An Empirical Study on Configuration Errors in Commercial and Open Source Systems. In Proceedings of ACM SOSP, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. When the network crumbles: an empirical study of cloud network failures and their impact on services

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SOCC '13: Proceedings of the 4th annual Symposium on Cloud Computing
        October 2013
        427 pages
        ISBN:9781450324281
        DOI:10.1145/2523616

        Copyright © 2013 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 1 October 2013

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        SOCC '13 Paper Acceptance Rate23of114submissions,20%Overall Acceptance Rate169of722submissions,23%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader