skip to main content
10.1145/2465478.2465481acmconferencesArticle/Chapter ViewAbstractPublication PagescomparchConference Proceedingsconference-collections
research-article

Cloud API issues: an empirical study and impact

Published:17 June 2013Publication History

ABSTRACT

Outages to the cloud infrastructures have been widely publicized and it would be easy to conclude that application developers only need to be concerned with large scale cloud provider infrastructure outages. Unfortunately, this is not the case. In-cloud applications heavily rely on cloud infrastructure APIs (directly or indirectly through scripts and consoles) for many sporadic activities such as deployment change, scaling out/in, backup, recovery and migration. Failures and/or issues around API calls are a large source of faults that could lead to application failures, especially during sporadic activities. Infrastructure outages can also be greatly exacerbated by API-related issues.

In this paper we present an empirical study of issues in Amazon EC2 APIs. Some of the major findings around API issues include: 1) A majority (60%) of the cases of API failures are related to "stuck" API calls or unresponsive API calls. 2) A significant portion (12%) of the cases of API failures are about slow responsive API calls. 3) 19% of the cases of API failures are related to the output issues of API calls, including failed calls with unclear error messages, as well as missing output, wrong output, and unexpected output of API calls. 4) There are 9% cases of API failures reporting that their calls (performing some actions and expecting a state change) were pending for a certain time and then returned to the original state without informing the caller properly or the calls were reported to be successful first but failed later. We also classify the causes of API issues and discuss the impact of API issues on application architectures.

References

  1. Netflix. 2013. The Netflix Tech Blog. Available: http://techblog.netflix.com/Google ScholarGoogle Scholar
  2. Yuruware. 2013. Yuruware Bolt Migration and Disaster Recovery. Available: http://www.yuruware.com/Google ScholarGoogle Scholar
  3. Amazon. 2013. Amazon Elastic Compute Cloud (Amazon EC2). Available: http://aws.amazon.com/ec2/Google ScholarGoogle Scholar
  4. Amazon. 2013. Amazon Elastic Compute Cloud Forum. Available: https://forums.aws.amazon.com/forum.jspa?forumID=30Google ScholarGoogle Scholar
  5. Amazon. 2011. Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region. Available: http://aws.amazon.com/message/65648/Google ScholarGoogle Scholar
  6. Amazon. 2012. Summary of the AWS Service Event in the US East Region. Available: http://aws.amazon.com/message/67457/Google ScholarGoogle Scholar
  7. Amazon. 2011. Summary of the Amazon SimpleDB Service Disruption. Available: http://aws.amazon.com/message/65649/Google ScholarGoogle Scholar
  8. Amazon. 2012. Summary of the December 24, 2012 Amazon ELB Service Event in the US-East Region. Available: http://aws.amazon.com/message/680587/Google ScholarGoogle Scholar
  9. Amazon. 2011. Summary of the Amazon EC2, Amazon EBS, and Amazon RDS Service Event in the EU West Region. Available: http://aws.amazon.com/message/2329B7/Google ScholarGoogle Scholar
  10. Amazon. 2012. Summary of the October 22,2012 AWS Service Event in the US-East Region. Available: http://aws.amazon.com/message/680342/Google ScholarGoogle Scholar
  11. Netflix. 2013. Netflix - Watch TV Shows Online, Watch Movies Online. Available: https://www.netflix.com/Google ScholarGoogle Scholar
  12. Netflix. 2013. Netflix Open Source Center. Available: http://netflix.github.com/Google ScholarGoogle Scholar
  13. Avizienis, A., Laprie, J. C., Randell, B., and Landwehr, C. 2004. Basic concepts and taxonomy of dependable and secure computing. IEEE Transactions on Dependable and Secure Computing. vol. 1, pp. 11--33, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Netflix. 2013. Netflix presentations channel on SlideShare. Available: http://www.slideshare.net/netflixGoogle ScholarGoogle Scholar
  15. Amazon. 2012. API Tool Reference. Available: http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/command-reference.htmlGoogle ScholarGoogle Scholar
  16. Reason, J. 1990. Human Error. Cambridge university press.Google ScholarGoogle Scholar
  17. Amazon. 2012. Common Options for API Tools. Available: http://docs.aws.amazon.com/AWSEC2/latest/CommandLineReference/CLTRG-common-args-api.htmlGoogle ScholarGoogle Scholar
  18. Amazon. 2013. Amazon EC2 Documentation Archive. Available: http://aws.amazon.com/archives/Amazon%20EC2?_encoding=UTF8&jiveRedirect=1Google ScholarGoogle Scholar
  19. Russell, N., Aalst, W. V. D. and Hofstede, A. T. 2006. Workflow Exception Patterns. In Advanced Information Systems Engineering. pp. 288--302. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Russell, N., Aalst, W. V. D., Hofstede, A. T., and Edmond, D. 2005. Workflow Resource Patterns: Identification, Representation and Tool Support. In Advanced Information Systems Engineering. pp. 11--42. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Cockcroft, A. 2012. Highly Available Architecture at Netflix. Available: http://www.slideshare.net/adrianco/high-availability-architecture-at-netflixGoogle ScholarGoogle Scholar
  22. Joshi, K. R., Bunker, G., Jahanian, F., Moorsel, A. V., and Weinman, J. 2009. Dependability in the Cloud: Challenges and Opportunities. In IEEE/IFIP International Conference on Dependable Systems & Networks. pp. 103--104.Google ScholarGoogle Scholar
  23. Ford, D., Labelle, F., Popovici, F. I., Stokely, M., Truong, V. A., Barroso, L. 2010. Availability in Globally Distributed Storage Systems. In Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Yin, Z., Ma, X., Zheng, J., Zhou, Y., Bairavasundaram, L. N., and Pasupathy, S. 2011. An Empirical Study on Configuration Errors in Commercial and Open Source Systems. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. pp. 159--172. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Vishwanath, K. V. and Nagappan, N. 2010. Characterizing Cloud Computing Hardware Reliability. In Proceedings of the 1st ACM symposium on Cloud computing. pp. 193--204. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Gill, P. 2011. Understanding Network Failures in Data Centers: Measurement, Analysis and Implications. In Proceedings of the ACM SIGCOMM 2011 conference. pp. 350--361. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Dean, J. and Barroso, L. A. The Tail at Scale. Communications of the ACM. vol. 56. pp. 74--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Malek, S., Medvidovic, N., and Mikic-Rakic, M. 2012. An Extensible Framework for Improving a Distributed Software System's Deployment Architecture. IEEE Transactions on Software Engineering. vol. 38. pp. 73--100. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Cloud API issues: an empirical study and impact

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        QoSA '13: Proceedings of the 9th international ACM Sigsoft conference on Quality of software architectures
        June 2013
        180 pages
        ISBN:9781450321266
        DOI:10.1145/2465478

        Copyright © 2013 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 17 June 2013

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        QoSA '13 Paper Acceptance Rate17of42submissions,40%Overall Acceptance Rate46of131submissions,35%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader