ABSTRACT
Cloud systems are increasingly being managed by operation programs termed operators, which automate tedious, human-based operations. Operators of modern management platforms like Kubernetes, Twine, and ECS implement declarative interfaces based on the state-reconciliation principle. An operation declares a desired system state and the operator automatically reconciles the system to that declared state.
Operator correctness is critical, given the impacts on system operations---bugs in operator code put systems in un-desired or error states, with severe consequences. However, validating operator correctness is challenging due to the enormous system-state space and complex operation interface. A correct operator must not only satisfy correctness properties of its own code, but it must also maintain managed systems in desired states. Unfortunately, end-to-end testing of operators significantly falls short.
We present Acto, the first automatic end-to-end testing technique for cloud system operators. Acto uses a state-centric approach to test an operator together with a managed system. Acto continuously instructs an operator to reconcile a system to different states and checks if the system successfully reaches those desired states. Acto models operations as state transitions and systematically realizes state-transition sequences to exercise supported operations in different scenarios. Acto's oracles automatically check whether a system's state is as desired. To date, Acto has helped find 56 serious new bugs (42 were confirmed and 30 have been fixed) in eleven Kubernetes operators with few false alarms.
- Assigning Pods to Nodes. https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/.Google Scholar
- Cloud Native Computing Foundation Operator White Paper. https://www.cncf.io/wp-content/uploads/2021/07/CNCF_Operator_WhitePaper.pdf.Google Scholar
- Custom Resources. https://kubernetes.io/docs/concepts/extend-kubernetes/api-extension/custom-resources/.Google Scholar
- Debugging Go Code with GDB. https://go.dev/doc/gdb.Google Scholar
- Dynamic Admission Control. https://kubernetes.io/docs/reference/access-authn-authz/extensible-admission-controllers/.Google Scholar
- Ephemeral Containers. https://kubernetes.io/docs/concepts/workloads/pods/ephemeral-containers/.Google Scholar
- etcd. https://etcd.io/.Google Scholar
- K3d. https://github.com/k3d-io/k3d.Google Scholar
- Kind. https://kind.sigs.k8s.io/.Google Scholar
- Kubernetes End-to-end Testing for Everyone. https://kubernetes.io/blog/2019/03/22/kubernetes-end-to-end-testing-for-everyone/.Google Scholar
- Labels and Selectors. https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/.Google Scholar
- Minikube. https://minikube.sigs.k8s.io/.Google Scholar
- OpenAPI Specification. https://swagger.io/specification/#schema-object.Google Scholar
- Package pointer. https://pkg.go.dev/golang.org/x/tools/go/pointer.Google Scholar
- Package ssa. https://pkg.go.dev/golang.org/x/tools/go/ssa.Google Scholar
- Specifying a Disruption Budget for your Application. https://kubernetes.io/docs/tasks/run-application/configure-pdb/.Google Scholar
- Understanding Kubernetes Objects. https://kubernetes.io/docs/concepts/overview/working-with-objects/kubernetes-objects/.Google Scholar
- Automatically generated regex validation for Quantity does not match the validation used by unmarshalerDecoder. https://github.com/kubernetes-sigs/controller-tools/issues/665, 2022.Google Scholar
- Cassandra operator becomes partially inoperable if replaceNodes has a wrong pod name (issue comment). https://github.com/k8ssandra/cass-operator/issues/315#issuecomment-1090149844, 2022.Google Scholar
- CLOUDP-116155 Initial bootup with arbiters. https://github.com/mongodb/mongodb-kubernetes-operator/pull/1024, 2022.Google Scholar
- cmd/cgo: allow cgo to pass strings or []bytes bigger than 1«30. https://go-review.googlesource.com/c/go/+/418557, 2022.Google Scholar
- Contour pod is not deleted when disabled by user. https://github.com/knative/operator/pull/1176, 2022.Google Scholar
- Mongodb system is down and unable to recover when the feature-CompatibilityVersion is not specified and changed to an invalid value. https://github.com/mongodb/mongodb-kubernetes-operator/pull/1118, 2022.Google Scholar
- Redis does not run with resource request/limit set by cr.spec.resources. https://github.com/OT-CONTAINER-KIT/redis-operator/issues/290, 2022.Google Scholar
- Specifying the field redisFollower.pdb does not have any effect. https://github.com/OT-CONTAINER-KIT/redis-operator/pull/301, 2022.Google Scholar
- The number conversion of Value() of type Quantity is incorrect. https://github.com/kubernetes/kubernetes/issues/110653, 2022.Google Scholar
- The operator crashes if the image name does not contain colon. https://github.com/cockroachdb/cockroach-operator/pull/922, 2022.Google Scholar
- Unable to remove the additional labels on the seed service through CR. https://github.com/k8ssandra/cass-operator/pull/344, 2022.Google Scholar
- Updating the field spec.ingress.sql.tls.secretName is not reflected in the sql ingress object. https://github.com/cockroachdb/cockroach-operator/issues/920, 2022.Google Scholar
- Zookeeper pod keeps crashing when scaling down and up. https://github.com/pravega/zookeeper-operator/pull/526, 2022.Google Scholar
- TiDB crash loop when enabling binlog. https://github.com/pingcap/tidb-operator/issues/4945, 2023.Google Scholar
- TiDB operator unable to recover an unhealthy cluster even with manual revert. https://github.com/pingcap/tidb-operator/issues/4946, 2023.Google Scholar
- Andersen, L. O. Program Analysis and Specialization for the C Programming Language. PhD thesis, DIKU, University of Copenhagen, May 1994.Google Scholar
- Arpaci-Dusseau, R. H., and Arpaci-Dusseau, A. C. Fail-Stutter Fault Tolerance. In Proceedings of the 8th Workshop on Hot Topics in Operating Systems (HotOS-VIII) (May 2001).Google ScholarCross Ref
- Barroso, L. A., Hölzle, U., and Ranganathan, P. The Datacenter as a Computer: Designing Warehouse-Scale Machines, 3 ed. Morgan and Claypool Publishers, 2018.Google Scholar
- Behrang, F., Cohen, M. B., and Orso, A. Users Beware: Preference Inconsistencies Ahead. In Proceedings of the 10th Joint Meeting of the European Software Engineering Conference and the ACM SIGSOFT Symposium on the Foundations of Software Engineering (ESEC/FSE'15) (Aug. 2015).Google ScholarDigital Library
- Bianchini, R., Martin, R. P., Nagaraja, K., Nguyen, T. D., and Oliveira, F. Human-Aware Computer System Design. In Proceedings of the 10th Workshop on Hot Topics in Operating Systems (HotOS-X) (June 2005).Google Scholar
- Brown, A. B., and Patterson, D. A. Undo for Operators: Building an Undoable E-mail Store. In Proceedings of the 2003 USENIX Annual Technical Conference (ATC'03) (June 2003).Google Scholar
- Burns, B., Grant, B., Oppenheimer, D., Brewer, E., and Wilkes, J. Borg, Omega, and Kubernetes. Communications of the ACM 59, 5 (May 2016), 50--57.Google Scholar
- Cadar, C., and Sen, K. Symbolic Execution For Software Testing: Three Decades Later. Communications of the ACM 56, 2 (Feb. 2013), 82--90.Google ScholarDigital Library
- Cebula, M., and Sherrod, B. 10 Weird Ways to Blow Up Your Kubernetes. In KubeCon North America (Nov. 2019).Google Scholar
- Chekrygin, I. Keep the Space Shuttle Flying: Writing Robust Operators. In KubeCon Europe (May 2019).Google Scholar
- Chen, Q., Wang, T., Legunsen, O., Li, S., and Xu, T. Understanding and Discovering Software Configuration Dependencies in Cloud and Datacenter Systems. In Proceedings of the 2020 ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE'20) (Nov. 2020).Google ScholarDigital Library
- Chen, Y., Sun, X., Nath, S., Yang, Z., and Xu, T. Push-Button Reliability Testing for Cloud-Backed Applications with Rainmaker. In Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI'23) (Apr. 2023).Google Scholar
- Crameri, O., Knežević, N., Kostić, D., Bianchini, R., and Zwaenepoel:, W. Staged Deployment in Mirage, an Integrated Software Upgrade Testing and Distribution System. In Proceedings of the 21st Symposium on Operating System Principles (SOSP'07) (Oct. 2007).Google ScholarDigital Library
- DeTreville, J. Making System Configuration More Declarative. In Proceedings of the 10th Workshop on Hot Topics in Operating Systems (HotOS-X) (June 2005).Google Scholar
- Dobies, J., and Wood, J. Kubernetes Operators: Automating the Container Orchestration Platform. O'Reilly Media, Inc., 2020.Google Scholar
- Duplyakin, D., Ricci, R., Maricq, A., Wong, G., Duerig, J., Eide, E., Stoller, L., Hibler, M., Johnson, D., Webb, K., Akella, A., Wang, K., Ricart, G., Landweber, L., Elliott, C., Zink, M., Cecchet, E., Kar, S., and Mishra, P. The Design and Operation of CloudLab. In Proceedings of the 2019 USENIX Annual Technical Conference (ATC'19) (July 2019).Google ScholarDigital Library
- Flemström, D., and Buck, A. Fleet Management at Spotify (Part 2): The Path to Declarative Infrastructure. https://engineering.atspotify.com/2023/05/fleet-management-at-spotify-part-2-the-path-to-declarative-infrastructure/, May 2023. Shopify Engineering Blog.Google Scholar
- Gao, L., and Menon, R. Scaling Apache Spark on Kubernetes at Lyft. https://www.youtube.com/watch?v=PPtrY_XxYBE, Apr. 2019. Spark+AI Summit.Google Scholar
- Gray, J. Why Do Computers Stop and What Can Be Done About It? Tandem Technical Report 85.7 (June 1985).Google Scholar
- Guilloux, S. Writing a Kubernetes Operator: the Hard Parts. In KubeCon North America (Nov. 2019).Google Scholar
- Gunawi, H. S., Hao, M., Suminto, R. O., Laksono, A., Satria, A. D., Adityatama, J., and Eliazar, K. J. Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages. In Proceedings of the 7th ACM Symposium on Cloud Computing (SOCC'16) (Oct. 2016).Google ScholarDigital Library
- Gunawi, H. S., Suminto, R. O., Sears, R., Golliher, C., Sundararaman, S., Lin, X., Emami, T., Sheng, W., Bidokhti, N., McCaffrey, C., Srinivasan, D., Panda, B., Baptist, A., Grider, G., Fields, P. M., Harms, K., Ross, R. B., Jacobson, A., Ricci, R., Webb, K., Alvaro, P., Runesha, H. B., Hao, M., and Li, H. Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems. In Proceedings of the 16th USENIX Conference on File and Storage Technologies (FAST'18) (Feb. 2018).Google ScholarDigital Library
- Haase, S. How an Operator Becomes the Hero of the Edge. In OperatorCon (May 2019).Google Scholar
- Hall, C. AWS, Google, Microsoft, Red Hat's New Registry to Act as Clearing House for Kubernetes Operators. https://www.datacenterknowledge.com/open-source/aws-google-microsoft-red-hats-new-registry-act-clearing-house-kubernetes-operators, Mar. 2019.Google Scholar
- Hockin, T. Kubernetes: Edge vs. Level Triggered Logic. https://speakerdeck.com/thockin/edge-vs-level-triggered-logic, June 2017.Google Scholar
- Huang, P., Guo, C., Zhou, L., Lorch, J. R., Dang, Y., Chintalapati, M., and Yao, R. Gray Failure: The Achilles' Heel of Cloud-Scale Systems. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HotOS-XVI) (May 2017).Google ScholarDigital Library
- Kumar, H., and Šafránek, J. Storage on Kubernetes - Learning From Failures. In KubeCon North America (Nov. 2019).Google Scholar
- Lagresle, M. Moving to Kubernetes: the Bad and the Ugly. In ContainerDays (June 2019).Google Scholar
- Lander, R. Kubernetes Operators: Should You Use Them? https://tanzu.vmware.com/developer/blog/kubernetes-operators-should-you-use-them/, July 2021. VMware Blog.Google Scholar
- Li, Z., Cheng, Q., Hsieh, K., Dang, Y., Huang, P., Singh, P., Yang, X., Lin, Q., Wu, Y., Levy, S., and Chintalapati, M. Gandalf: An Intelligent, End-To-End Analytics Service for Safe Deployment in Large-Scale Cloud Infrastructure. In Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI'20) (Feb. 2020).Google Scholar
- Lou, C., Huang, P., and Smith, S. Understanding, Detecting and Localizing Partial Failures in Large System Software. In Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation (NSDI'20) (Feb. 2020).Google Scholar
- Ma, S., Zhou, F., Bond, M. D., and Wang, Y. Finding Heterogeneous-Unsafe Configuration Parameters in Cloud Systems. In Proceedings of the 16th ACM European Conference on Computer Systems (EuroSys'21) (Apr. 2021).Google ScholarDigital Library
- Madhu, C. Preventing Controller Sprawl From Taking Down Your Cluster. In KubeCon North America (Oct. 2022).Google Scholar
- Manes, V. J., Han, H., Han, C., Cha, S. K., Egele, M., Schwartz, E. J., and Woo, M. The Art, Science, and Engineering of Fuzzing: A Survey. IEEE Transactions on Software Engineering 47, 11 (Nov. 2021), 2312--2331.Google ScholarCross Ref
- Melissaris, T., Nabar, K., Radut, R., Rehmtulla, S., Shi, A., Chandrashekar, S., and Papapanagiotou, I. Elastic Cloud Services: Scaling Snowflake's Control Plane. In Proceedings of the 13th ACM Symposium on Cloud Computing (SOCC'22) (Nov. 2022).Google ScholarDigital Library
- Nagaraja, K., Oliveira, F., Bianchini, R., Martin, R. P., and Nguyen, T. D. Understanding and Dealing with Operator Mistakes in Internet Services. In Proceedings of the 6th USENIX Conference on Operating Systems Design and Implementation (OSDI'04) (Dec. 2004).Google ScholarDigital Library
- Oliveira, F., Tjang, A., Bianchini, R., Martin, R. P., and Nguyen, T. D. Barricade: Defending Systems Against Operator Mistakes. In Proceedings of the 5th European Conference on Computer Systems (EuroSys'10) (Apr. 2010).Google ScholarDigital Library
- Oppenheimer, D., Ganapathi, A., and Patterson, D. A. Why Do Internet Services Fail, and What Can Be Done About It? In Proceedings of the 4th USENIX Symposium on Internet Technologies and Systems (USITS'03) (Mar. 2003).Google ScholarDigital Library
- Patterson, D., Brown, A., Broadwell, P., Candea, G., Chen, M., Cutler, J., Enriqez, P., Fox, A., Kiciman, E., Merzbacher, M., Oppenheimer, D., Sastry, N., Tetzlaff, W., Traupman, J., and Treuhaft, N. Recovery-Oriented Computing (ROC): Motivation, Definition, Techniques, and Case Studies. Tech. Rep. UCB//CSD-02-1175, University of California Berkeley, Mar. 2002.Google Scholar
- Pham, V.-T., Khurana, S., Roy, S., and Roychoudhury, A. Bucketing Failing Tests via Symbolic Analysis. In Proceedings of the 20th International Conference on Fundamental Approaches to Software Engineering (FASE'17) (Apr. 2017).Google ScholarDigital Library
- Pina, L., Andronidis, A., Hicks, M., and Cadar, C. MVEDSUA: Higher Availability Dynamic Software Updates via Multi-Version Execution. In Proceedings of the 24th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS'19) (Apr. 2019).Google ScholarDigital Library
- Rajagopalan, S., Williams, D., Jamjoom, H., and Warfield, A. Escape Capsule: Explicit State is Robust and Scalable. In Proceedings of the 16th Workshop on Hot Topics in Operating Systems (HotOS-XIV) (May 2013).Google Scholar
- Ratis, P. Lessons Learned using the Operator Pattern to build a Kubernetes Platform. In USENIX SREcon (Oct. 2021).Google Scholar
- Schwarzkopf, M., Konwinski, A., Abd-El-Malek, M., and Wilkes, J. Omega: Flexible, Scalable Schedulers for Large Compute Clusters. In Proceedings of the 8th ACM European Conference on Computer Systems (EuroSys'13) (Apr. 2013).Google ScholarDigital Library
- Shen, Z., Shang, R., and Bedi, R. How eBay Leverages Kubernetes, Helm Charts and Jenkins Pipelines to Deliver High-Quality Software. https://tech.ebayinc.com/engineering/how-ebay-leverages-kubernetes-helm-charts-and-jenkins-pipelines-to-deliver-high-quality-software/, 2021. eBay Tech Blog.Google Scholar
- Sosa, C., and Bhatia, P. Application management made easier with Kubernetes Operators on GCP Marketplace. https://cloud.google.com/blog/products/containers-kubernetes/application-management-made-easier-with-kubernete-operators-on-gcp-marketplace, May 2019. Google Cloud Blog.Google Scholar
- Sun, X., Cheng, R., Chen, J., Ang, E., Legunsen, O., and Xu, T. Testing Configuration Changes in Context to Prevent Production Failures. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI'20) (Nov. 2020).Google Scholar
- Sun, X., Luo, W., Gu, J. T., Ganesan, A., Alagappan, R., Gasch, M., Suresh, L., and Xu, T. Automatic Reliability Testing for Cluster Management Controllers. In Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI'22) (July 2022).Google Scholar
- Sun, X., Suresh, L., Ganesan, A., Alagappan, R., Gasch, M., Tang, L., and Xu, T. Reasoning about modern datacenter infrastructures using partial histories. In Proceedings of the 18th Workshop on Hot Topics in Operating Systems (HotOS-XVIII) (May 2021).Google ScholarDigital Library
- Suresh, L., ao Loff, J., Kalim, F., Jyothi, S. A., Narodytska, N., Ryzhyk, L., Gamage, S., Oki, B., Jain, P., and Gasch, M. Building Scalable and Flexible Cluster Managers Using Declarative Programming. In Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI'20) (Nov. 2020).Google Scholar
- Tang, C., Yu, K., Veeraraghavan, K., Kaldor, J., Michelson, S., Kooburat, T., Anbudurai, A., Clark, M., Gogia, K., Cheng, L., Christensen, B., Gartrell, A., Khutornenko, M., Kulkarni, S., Pawlowski, M., Pelkonen, T., Rodrigues, A., Tibrewal, R., Venkatesan, V., and Zhang, P. Twine: A Unified Cluster Management System for Shared Infrastructure. In Proceedings of the 14th USENIX Conference on Operating Systems Design and Implementation (OSDI'20) (Nov. 2020).Google Scholar
- Tang, L., Bhandari, C., Zhang, Y., Karanika, A., Ji, S., Gupta, I., and Xu, T. Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems. In Proceedings of the 18th European Conference on Computer Systems (EuroSys'23) (May 2023).Google ScholarDigital Library
- Tang, Z., Li, X., and Guo, F. Demystifying Kubernetes as a service - How Alibaba cloud manages 10,000s of Kubernetes clusters. https://www.cncf.io/blog/2019/12/12/demystifying-kubernetes-as-a-service-how-does-alibaba-cloud-manage-10000s-of-kubernetes-clusters/, Dec. 2019. Cloud Native Computing Foundation Blog.Google Scholar
- Templeton, G., and Davidson, S. How a Couple of Characters (and GitOps) Brought Down Our Site. In KubeCon Europe (May 2022).Google Scholar
- Tirmazi, M., Barker, A., Deng, N., Haqe, M. E., Qin, Z. G., Hand, S., Harchol-Balter, M., and Wilkes, J. Borg: The Next Generation. In Proceedings of the 15th ACM European Conference on Computer Systems (EuroSys'20) (Apr. 2020).Google ScholarDigital Library
- van Tonder, R., Kotheimer, J., and Goues, C. L. Semantic Crash Bucketing. In Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering (ASE'18) (Sept. 2018).Google ScholarDigital Library
- Vasudevan, J. Azure Service Operators - A Kubernetes native way of Deploying Azure Resources. https://devblogs.microsoft.com/cse/2021/11/11/azure-service-operators-a-kubernetes-native-way-of-deploying-azure-resources/, Nov. 2021. Microsoft Developer Blogs.Google Scholar
- Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D., Tune, E., and Wilkes, J. Large-Scale Cluster Management at Google with Borg. In Proceedings of the 10th European Conference on Computer Systems (EuroSys'15) (Apr. 2015).Google ScholarDigital Library
- Wang, S., Lian, X., Marinov, D., and Xu, T. Test Selection for Unified Regression Testing. In Proceedings of the 45th IEEE/ACM International Conference on Software Engineering (ICSE'23) (May 2023).Google ScholarDigital Library
- Xu, T., Jin, X., Huang, P., Zhou, Y., Lu, S., Jin, L., and Pasupathy, S. Early Detection of Configuration Errors to Reduce Failure Damage. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI'16) (Nov. 2016).Google ScholarDigital Library
- Xu, T., Zhang, J., Huang, P., Zheng, J., Sheng, T., Yuan, D., Zhou, Y., and Pasupathy, S. Do Not Blame Users for Misconfigurations. In Proceedings of the 24th Symposium on Operating System Principles (SOSP'13) (Nov. 2013).Google ScholarDigital Library
- Xu, T., and Zhou, Y. Systems Approaches to Tackling Configuration Errors: A Survey. ACM Computing Surveys (CSUR) 47, 4 (July 2015).Google ScholarDigital Library
- Zhang, J., Renganarayana, L., Zhang, X., Ge, N., Bala, V., Xu, T., and Zhou, Y. EnCore: Exploiting System Environment and Correlation Information for Misconfiguration Detection. In Proceedings of the 19th International Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS'14) (Mar. 2014).Google ScholarDigital Library
- Zhang, Y., Yang, J., Jin, Z., Sethi, U., Rodrigues, K., Lu, S., and Yuan, D. Understanding and Detecting Software Upgrade Failures in Distributed Systems. In Proceedings of the 28th ACM Symposium on Operating Systems Principles (SOSP'21) (Oct. 2021).Google ScholarDigital Library
Index Terms
- Acto: Automatic End-to-End Testing for Operation Correctness of Cloud System Management
Recommendations
Automatic generation of smoke test suites for kubernetes
ISSTA 2022: Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and AnalysisSetting up a reliable and automated testing process can be challenging in a cloud environment, due to the many ways automatic and repeated system deployment may unexpectedly fail. Imperfect deployments may cause spurious test failures, resulting in a ...
Induced weighted operators based on dissimilarity functions
Based on the minimization considering dissimilarity function D ( x , y ) = f ( x ) - f ( y ) 2 induced ordered weighted averaging operators IOWA and induced ordered generalized mixture operators IOM g are discussed. In general, these operators need not ...
A prioritized aggregation operator based on the OWA operator and prioritized measure
Multi-criteria decision making problems are well-known problems. This paper mainly investigates a particular type of multi-criteria aggregation imperative called prioritized aggregation. We first introduce the concepts of measure, additive measure and ...
Comments