research-article

When the network crumbles: an empirical study of cloud network failures and their impact on services

Authors:
Rahul Potharaju

Purdue University

Purdue University
View Profile

,
Navendu Jain

Microsoft Research

Microsoft Research
View Profile

SOCC '13: Proceedings of the 4th annual Symposium on Cloud ComputingOctober 2013Article No.: 15Pages 1–17https://doi.org/10.1145/2523616.2523638

Published:01 October 2013Publication History

SOCC '13: Proceedings of the 4th annual Symposium on Cloud Computing

Pages 1–17

ABSTRACT

The growing demand for always-on and low-latency cloud services is driving the creation of globally distributed datacenters. A major factor affecting service availability is reliability of the network, both inside the datacenters and wide-area links connecting them. While several research efforts focus on building scale-out datacenter networks, little has been reported on real network failures and how they impact geo-distributed services. This paper makes one of the first attempts to characterize intra-datacenter and inter-datacenter network failures from a service perspective. We describe a large-scale study analyzing and correlating failure events over three years across multiple datacenters and thousands of network elements such as Access routers, Aggregation switches, Top-of-Rack switches, and long-haul links. Our study reveals several important findings on (a) the availability of network domains, (b) root causes, (c) service impact, (d) effectiveness of repairs, and (e) modeling failures. Finally, we outline steps based on existing network mechanisms to improve service availability.

References

Keynote Web Performance Testing. http://goo.gl/khl9Q.Google Scholar
S. Agarwal, J. Dunagan, N. Jain, S. Saroiu, A. Wolman, and H. Bhogan. Volley: Automated Data Placement for Geo-distributed Cloud Services. In Proceedings of NSDI. USENIX Association, 2010. Google ScholarDigital Library
Amazon. Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region. http://goo.gl/yUlTJ, May 2011.Google Scholar
L. Bairavasundaram, A. Arpaci-Dusseau, R. Arpaci-Dusseau, G. Goodson, and B. Schroeder. An Analysis of Data Corruption in the Storage Stack. Proceedings of ACM Transactions on Storage (TOS), 4(3), 2008. Google ScholarDigital Library
G. Box, J. Hunter, and W. Hunter. Statistics for Experimenters: Design, Innovation, and Discovery. Wiley, 2005.Google Scholar
E. A. Brewer. Lessons from Giant-Scale Services. Internet Computing, IEEE, 5(4): 46--55, 2001. Google ScholarDigital Library
J. Brodkin. Netflix attacks own network with "Chaos Monkey" - And now you can too. http://goo.gl/XhiKM, July 2012.Google Scholar
C. E. Brown. Coefficient of Variation. In Applied Multivariate Statistics in Geohydrology and Related Sciences, pages 155--157. Springer, 1998.Google ScholarCross Ref
J. Case, M. Fedor, M. Schoffstall, and J. Davin. Simple Network Management Protocol. http://goo.gl/az3Fv, May 1990. Google ScholarDigital Library
Y. Chen, S. Jain, V. Adhikari, Z. Zhang, and K. Xu. A First Look at Inter-data Center Traffic Characteristics via Yahoo! Datasets. In Proceedings of INFOCOM. IEEE, 2011.Google ScholarCross Ref
S. Deering and R. Hinden. Internet Protocol, Version (IPv6) Specification. RFC 2460. Google ScholarDigital Library
L. Ellram. Total Cost of Ownership: An Analysis Approach for Purchasing. Journal of PDLM, 1995.Google ScholarCross Ref
D. Etherington. Dropbox Currently Experiencing Widespread Service Outage. http://goo.gl/rszmb, May 2013.Google Scholar
N. Feamster and H. Balakrishnan. Detecting BGP Configuration Faults with Static Analysis. In Proceedings of USENIX NSDI, 2005. Google ScholarDigital Library
S. G. and I. B. Websites Scramble as Hurricane Sandy Floods Data Centers. http://goo.gl/zOXDb, October 31 2012.Google Scholar
P. Gill, N. Jain, and N. Nagappan. Understanding Network Failures in Datacenters: Measurement, Analysis, and Implications. In Proceedings of SIGCOMM, 2011. Google ScholarDigital Library
A. Greenberg, J. Hamilton, N. Jain, S. Kandula, C. Kim, P. Lahiri, D. Maltz, P. Patel, and S. Sengupta. VL2: A Scalable and Flexible Datacenter Network. ACM SIGCOMM CCR, 2009. Google ScholarDigital Library
H. Jiang, F. Kéfélian, S. Crane, O. Lopez, M. Lours, J. Millo, D. Holleville, P. Lemonde, C. Chardonnet, A. Amy-Klein, et al. Long-distance Frequency Transfer Over an Urban Fiber Link Using Optical Phase Stabilization. JOSA B, 25(12), 2008.Google Scholar
W. Jiang, C. Hu, Y. Zhou, and A. Kanevsky. Are disks the dominant contributor for storage failures?: A Comprehensive Study of Storage Subsystem Failure Characteristics. TOS, 2008. Google ScholarDigital Library
D. Johnson. NOC Internal Integrated Trouble Ticket System. http://goo.gl/eMZxX, January 1992. Google ScholarDigital Library
S. Kandula, R. Mahajan, P. Verkaik, S. Agarwal, J. Padhye, and P. Bahl. Detailed Diagnosis in Enterprise Networks. In ACM SIGCOMM CCR, 2009. Google ScholarDigital Library
S. Kandula, S. Sengupta, A. Greenberg, P. Patel, and R. Chaiken. The Nature of Data center Traffic: Measurements & Analysis. In Proceedings of SIGCOMM. ACM, 2009. Google ScholarDigital Library
D. C. Knowledge. Data Center Global Expansion Trend. http://goo.gl/SOvtA, November 2012.Google Scholar
K. Kompella, L. Berger, and Y. Rekhter. Link Bundling in MPLS Traffic Engineering (TE). 2005.Google Scholar
C. Labovitz, A. Ahuja, and F. Jahanian. Experimental Study of Internet Stability and Backbone Failures. In Proceedings of IEEE Fault-Tolerant Computing, 1999. Google ScholarDigital Library
N. Laoutaris, M. Sirivianos, X. Yang, and P. Rodriguez. Inter-datacenter Bulk Transfers with Net-Stitcher. In Proceedings of SIGCOMM, 2011. Google ScholarDigital Library
Y. Li, H. Wang, P. Zhang, J. Dong, and S. Cheng. D4D: Inter-datacenter Bulk Transfers with ISP Friendliness. In IEEE CLUSTER, 2012. Google ScholarDigital Library
H. Lilliefors. On the Kolmogorov-Smirnov Test for the Exponential Distribution with Mean Unknown. Journal of the American Statistical Association, 64(325), 1969.Google ScholarCross Ref
A. Mahimkar, A. Chiu, R. Doverspike, M. Feuer, P. Magill, E. Mavrogiorgis, J. Pastor, S. Woodward, and J. Yates. Bandwidth On Demand for Inter-Data center Communication. In HotNets. ACM, 2011. Google ScholarDigital Library
A. Markopoulou, G. Iannaccone, S. Bhattacharyya, C. Chuah, Y. Ganjali, and C. Diot. Characterization of Failures in an Operational IP Backbone Network. IEEE/ACM TON, 2008. Google ScholarDigital Library
M. McCloghrie, K. ad Rose. Management Information Base for Network Management of TCP/IP-based internets. RFC 1213. Google ScholarDigital Library
G. Mohan and C. Murthy. Lightpath Restoration in WDM Optical Networks. Network, IEEE, 14(6), 2000. Google ScholarDigital Library
T. K. Moon. The Expectation-Maximization Algorithm. Signal Processing Magazine, IEEE, 13(6): 47--60, 1996.Google ScholarCross Ref
J. Mudigonda, P. Yalagandula, J. Mogul, B. Stiekes, and Y. Pouffary. NetLord: A Scalable Multi-tenant Network Architecture for Virtualized Datacenters. In Proceedings of ACM SIGCOMM, 2011. Google ScholarDigital Library
E. Nightingale, J. Douceur, and V. Orgovan. Cycles, Cells and Platters: An Empirical Analysis of Hardware Failures on a Million Consumer PCs. In Proceedings of the Sixth Conference on Computer Systems. ACM, 2011. Google ScholarDigital Library
R. Niranjan Mysore, A. Pamboris, N. Farrington, N. Huang, P. Miri, S. Radhakrishnan, V. Subramanya, and A. Vahdat. PortLand: A Scalable Fault-Tolerant Layer-2 Data center Network Fabric. In SIGCOMM CCR. ACM, 2009. Google ScholarDigital Library
V. Padmanabhan, S. Ramabhadran, S. Agarwal, and J. Padhye. A Study of End-to-End Web Access Failures. In Proceedings of ACM CoNEXT, 2006. Google ScholarDigital Library
E. Pinheiro, W. Weber, and L. Barroso. Failure Trends in a Large Disk Drive Population. In Proceedings of FAST, 2007. Google ScholarDigital Library
R. Potharaju and N. Jain. An Empirical Analysis of Intra-and Inter-datacenter Network Failures for Geo-distributed Services. In Extended Abstract Proceedings of ACM SIGMETRICS. ACM, 2013. Google ScholarDigital Library
R. Potharaju and N. Jain. Demystifying the Dark Side of the Middle: A Field Study of Middlebox Failures in Datacenters. In Proceedings of the 13th ACM SIGCOMM Conference on Internet Measurement, 2013. Google ScholarDigital Library
R. Potharaju, N. Jain, and C. Nita-Rotaru. Juggling the Jigsaw: Towards Automated Problem Inference from Network Trouble Tickets. In Proceedings of USENIX NSDI, 2013. Google ScholarDigital Library
R. Sakia. The Box-Cox Transformation Technique: A Review. The Statistician, pages 169--178, 1992.Google Scholar
B. Schroeder and G. Gibson. Disk Failures in the Real World: What does an MTTF of 1,000,000 hours mean to you. In Proceedings of FAST, 2007. Google ScholarDigital Library
B. Schroeder, E. Pinheiro, and W. Weber. DRAM Errors in the Wild: A Large-scale Field Study. In Proceedings of ACM SIGMETRICS, 2009. Google ScholarDigital Library
A. Shaikh, C. Isett, A. Greenberg, M. Roughan, and J. Gottlieb. A Case Study of OSPF Behavior in a Large Enterprise Network. In ACM SIGCOMM WIM, 2002. Google ScholarDigital Library
J. Sherry, S. Hasan, C. Scott, A. Krishnamurthy, S. Ratnasamy, and V. Sekar. Making Middleboxes someone else's Problem: Network Processing as a Cloud Service. In Proceedings of SIGCOMM, 2012. Google ScholarDigital Library
C. Talbot. Dropbox Outage Represents First Major Cloud Outage of 2013. http://goo.gl/rszmb, January 2013.Google Scholar
D. Turner, K. Levchenko, A. Snoeren, and S. Savage. California Fault Lines: Understanding the Causes and Impact of Network Failures. In ACM SIGCOMM CCR, 2010. Google ScholarDigital Library
M. Wilk and R. Gnanadesikan. Probability Plotting Methods for the Analysis for the Analysis of Data. Biometrika, 55(1), 1968.Google Scholar
S. Works. Hurricane Sandy - AC2 Transatlantic Cable Cut. http://goo.gl/dywVO, October 2012.Google Scholar
Z. Yin, X. Ma, J. Zheng, Y. Zhou, L. Bairavasundaram, and S. Pasupathy. An Empirical Study on Configuration Errors in Commercial and Open Source Systems. In Proceedings of ACM SOSP, 2011. Google ScholarDigital Library

Index Terms

When the network crumbles: an empirical study of cloud network failures and their impact on services
1. Networks
  1. Network architectures
  2. Network services

Recommendations

Understanding network failures in data centers: measurement, analysis, and implications
SIGCOMM '11: Proceedings of the ACM SIGCOMM 2011 conference

We present the first large-scale analysis of failures in a data center network. Through our analysis, we seek to answer several fundamental questions: which devices/links are most unreliable, what causes failures, how do failures impact network traffic ...
Read More
Demystifying the dark side of the middle: a field study of middlebox failures in datacenters
IMC '13: Proceedings of the 2013 conference on Internet measurement conference

Network appliances or middleboxes such as firewalls, intrusion detection and prevention systems (IDPS), load balancers, and VPNs form an integral part of datacenters and enterprise networks. Realizing their importance and shortcomings, the research ...
Read More
Network reliability optimization problem of interconnection network under node-edge failure model

The network reliability optimization problem for an interconnection network is to maximize the network reliability subjected to some constraints such as the total cost of the network. Even though, the problem is NP-Hard, many researchers have solved ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SOCC '13: Proceedings of the 4th annual Symposium on Cloud Computing
October 2013
427 pages
ISBN:9781450324281
DOI:10.1145/2523616
General Chair:
Guy Lohman
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 October 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
cloud services
datacenters
inter-datacenter links
network reliability
Qualifiers
- research-article
Conference

Acceptance Rates
SOCC '13 Paper Acceptance Rate23of114submissions,20%Overall Acceptance Rate169of722submissions,23%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 55
  Total Citations
  View Citations
- 522
  Total Downloads
- Downloads (Last 12 months)58
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

When the network crumbles: an empirical study of cloud network failures and their impact on services

SOCC '13: Proceedings of the 4th annual Symposium on Cloud Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Understanding network failures in data centers: measurement, analysis, and implications

Demystifying the dark side of the middle: a field study of middlebox failures in datacenters

Network reliability optimization problem of interconnection network under node-edge failure model

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

When the network crumbles: an empirical study of cloud network failures and their impact on services

SOCC '13: Proceedings of the 4th annual Symposium on Cloud Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Understanding network failures in data centers: measurement, analysis, and implications

Demystifying the dark side of the middle: a field study of middlebox failures in datacenters

Network reliability optimization problem of interconnection network under node-edge failure model

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media