ABSTRACT
Outages to the cloud infrastructures have been widely publicized and it would be easy to conclude that application developers only need to be concerned with large scale cloud provider infrastructure outages. Unfortunately, this is not the case. In-cloud applications heavily rely on cloud infrastructure APIs (directly or indirectly through scripts and consoles) for many sporadic activities such as deployment change, scaling out/in, backup, recovery and migration. Failures and/or issues around API calls are a large source of faults that could lead to application failures, especially during sporadic activities. Infrastructure outages can also be greatly exacerbated by API-related issues.
In this paper we present an empirical study of issues in Amazon EC2 APIs. Some of the major findings around API issues include: 1) A majority (60%) of the cases of API failures are related to "stuck" API calls or unresponsive API calls. 2) A significant portion (12%) of the cases of API failures are about slow responsive API calls. 3) 19% of the cases of API failures are related to the output issues of API calls, including failed calls with unclear error messages, as well as missing output, wrong output, and unexpected output of API calls. 4) There are 9% cases of API failures reporting that their calls (performing some actions and expecting a state change) were pending for a certain time and then returned to the original state without informing the caller properly or the calls were reported to be successful first but failed later. We also classify the causes of API issues and discuss the impact of API issues on application architectures.
- Netflix. 2013. The Netflix Tech Blog. Available: http://techblog.netflix.com/Google Scholar
- Yuruware. 2013. Yuruware Bolt Migration and Disaster Recovery. Available: http://www.yuruware.com/Google Scholar
- Amazon. 2013. Amazon Elastic Compute Cloud (Amazon EC2). Available: http://aws.amazon.com/ec2/Google Scholar
- Amazon. 2013. Amazon Elastic Compute Cloud Forum. Available: https://forums.aws.amazon.com/forum.jspa?forumID=30Google Scholar
- Amazon. 2011. Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region. Available: http://aws.amazon.com/message/65648/Google Scholar
- Amazon. 2012. Summary of the AWS Service Event in the US East Region. Available: http://aws.amazon.com/message/67457/Google Scholar
- Amazon. 2011. Summary of the Amazon SimpleDB Service Disruption. Available: http://aws.amazon.com/message/65649/Google Scholar
- Amazon. 2012. Summary of the December 24, 2012 Amazon ELB Service Event in the US-East Region. Available: http://aws.amazon.com/message/680587/Google Scholar
- Amazon. 2011. Summary of the Amazon EC2, Amazon EBS, and Amazon RDS Service Event in the EU West Region. Available: http://aws.amazon.com/message/2329B7/Google Scholar
- Amazon. 2012. Summary of the October 22,2012 AWS Service Event in the US-East Region. Available: http://aws.amazon.com/message/680342/Google Scholar
- Netflix. 2013. Netflix - Watch TV Shows Online, Watch Movies Online. Available: https://www.netflix.com/Google Scholar
- Netflix. 2013. Netflix Open Source Center. Available: http://netflix.github.com/Google Scholar
- Avizienis, A., Laprie, J. C., Randell, B., and Landwehr, C. 2004. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing. vol. 1, pp. 11--33, 2004. Google ScholarDigital Library
- Netflix. 2013. Netflix presentations channel on SlideShare. Available: http://www.slideshare.net/netflixGoogle Scholar
- Amazon. 2012. API Tool Reference. Available: http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/command-reference.htmlGoogle Scholar
- Reason, J. 1990. Human Error. Cambridge university press.Google Scholar
- Amazon. 2012. Common Options for API Tools. Available: http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/CLTRG-common-args-api.htmlGoogle Scholar
- Amazon. 2013. Amazon EC2 Documentation Archive. Available: http://aws.amazon.com/archives/Amazon%20EC2?_encoding=UTF8&jiveRedirect=1Google Scholar
- Russell, N., Aalst, W. V. D. and Hofstede, A. T. 2006. Workflow Exception Patterns. In Advanced Information Systems Engineering. pp. 288--302. Google ScholarDigital Library
- Russell, N., Aalst, W. V. D., Hofstede, A. T., and Edmond, D. 2005. Workflow Resource Patterns: Identification, Representation and Tool Support. In Advanced Information Systems Engineering. pp. 11--42. Google ScholarDigital Library
- Cockcroft, A. 2012. Highly Available Architecture at Netflix. Available: http://www.slideshare.net/adrianco/high-availability-architecture-at-netflixGoogle Scholar
- Joshi, K. R., Bunker, G., Jahanian, F., Moorsel, A. V., and Weinman, J. 2009. Dependability in the Cloud: Challenges and Opportunities. In IEEE/IFIP International Conference on Dependable Systems & Networks. pp. 103--104.Google Scholar
- Ford, D., Labelle, F., Popovici, F. I., Stokely, M., Truong, V. A., Barroso, L. 2010. Availability in Globally Distributed Storage Systems. In Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation. Google ScholarDigital Library
- Yin, Z., Ma, X., Zheng, J., Zhou, Y., Bairavasundaram, L. N., and Pasupathy, S. 2011. An Empirical Study on Configuration Errors in Commercial and Open Source Systems. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. pp. 159--172. Google ScholarDigital Library
- Vishwanath, K. V. and Nagappan, N. 2010. Characterizing Cloud Computing Hardware Reliability. In Proceedings of the 1st ACM symposium on Cloud computing. pp. 193--204. Google ScholarDigital Library
- Gill, P. 2011. Understanding Network Failures in Data Centers: Measurement, Analysis and Implications. In Proceedings of the ACM SIGCOMM 2011 conference. pp. 350--361. Google ScholarDigital Library
- Dean, J. and Barroso, L. A. The Tail at Scale. Communications of the ACM. vol. 56. pp. 74--80. Google ScholarDigital Library
- Malek, S., Medvidovic, N., and Mikic-Rakic, M. 2012. An Extensible Framework for Improving a Distributed Software System's Deployment Architecture. IEEE Transactions on Software Engineering. vol. 38. pp. 73--100. Google ScholarDigital Library
Index Terms
- Cloud API issues: an empirical study and impact
Recommendations
API Learning: Applying Machine Learning to Manage the Rise of API Economy
WWW '18: Companion Proceedings of the The Web Conference 2018Application Programming Interface (API) exposes data and functions of a software application to third-party users. In digital business, API economy is one of the key component for determining the value of provided services. With the rise in number of ...
The OWL API: A Java API for OWL ontologies
We present the OWL API, a high level Application Programming Interface (API) for working with OWL ontologies. The OWL API is closely aligned with the OWL 2 structural specification. It supports parsing and rendering in the syntaxes defined in the W3C ...
The Lowly API Is Ready to Step Front and Center
The API is taking on new roles and is becoming critical to important technologies such as cloud computing and to the use of both Web and mobile applications.
Comments