Fault-tolerance analyzer: A middle layer for pre-provision testing in OpenStack

https://doi.org/10.1016/j.compeleceng.2017.11.019Get rights and content

Abstract

OpenStack is arguably the most popular open-source cloud orchestration software available today. Its user-base spans from large corporations and large scale service providers to SMEs. OpenStack manages infrastructure resources e.g. compute, storage and networking. While customers get a fully functional cloud, an incumbent fault-tolerance analysis and suggestion system is absent from this otherwise robust cloud management system. In this work, we demonstrate a system for automated fault-tolerance (FT) analysis of network applications hosted on OpenStack. Faults are modeled by random server and/or service shutdowns. Aim of the FT Analyzer is to equip cloud tenants with information to determine optimum configuration for their applications before firing up the service. FT Analyzer helps tenants maintain maximum up-time for services and data and preemptively tests the virtual network based applications for any issues that can have a negative impact on performance in post deployment scenario.

Introduction

Virtualization of networking resources has introduced a new and automated way for service providers to provision networks that can scale seamlessly and across geographical boundaries. At the same time, decoupling of data and control plane within conventional networks has offered immense potential for exploration and growth within the networking industry. This shift marks the beginning of the end of vendor locked networks and opens an avenue for software based tools that can perform similar functions, are more robust, manageable and cost effective. Resultantly, networking innovation is no longer dependent on vendor shipped boxes that perform a set of networking functions and become obsolete every few years, forcing service provider to update at hefty cost. These are replaced by Virtual Network Functions (VNF) that perform similar functionality however in software which can be upgraded with much less cost. As corporations move towards utilizing cloud resources for their compute and storage, cloud service providers are able to provision production grade networks that are also scalable, using automated systems and platforms such as OpenStack [1]. This enables service providers to orchestrate resources and manage commercial clouds with great ease. A large number of cloud service providers use OpenStack for providing networking, storage and compute to their customers [2]. When a customer requests for a specific resource, the cloud OS gathers resources required to fulfill the request and prepares to create an isolated container within its infrastructure which is accessible only by the requesting customer. Since containers are isolated, they function as independent environments that have all the resources they require i.e., a virtual network, interfaces, security groups, storage and virtual machines. All these components form a functional virtual network. In a typical business-to-business model, customer application spans across virtual machines however an application interface is presented to customers to interact with the application. The customers however, are not always aware of the intricate deployment details related to their applications (here an application means a cloud resource) and most non-commercial users do not need this information anyway. B2B customers however require some level of transparency in the choices made for their deployments. In the same context, this paper proposes a middle-layer, as shown in Fig. 1, that enables the customers to analyze their virtual networks for their ability to handle faults before they are deployed. Fault-tolerance tests provide a basis for better decision making and more efficient services. We argue that, prior to deployment of an application for large scale usage, the service provider must analyze the fault-tolerance of the application on its platform. Fault-tolerance analysis involves testing the virtual network for a number of metrics in order to have insight about the deployment. Testing is essential since, for instance, an improperly load balanced virtual network may cause service degradation and cause loss of revenue for both service provider and the customer. A number of instances of malfunctioning virtual networking elements, that brought the system down for hours at a stretch, have been observed [3], [4]. In essence, it is imperative that network applications are tested thoroughly for their fault tolerance before being handed over to the customer.

As part of the testing, application response time is tested when few randomly chosen servers, in the designated load-balancer pool fail. The upper bound of failure threshold can be analyzed by random shutdowns. We assign a numerical value to the fault-tolerance of respective network applications. We call this number Fault Tolerance Index (FTI).

The remainder of this paper is organized as follows. Section 2 discusses related work. Section 3 details fault classification and Section 4 describes fault-tolerance in the context of virtual networks. Section 5 presents the system architecture and implementation of our Fault-tolerance analyzer. Section 6 shows the test setup while in Section 7 we discuss the results. Section 8 presents model accuracy and Section 9 concludes the paper.

Section snippets

Related work

Quantification of the amount, types and causes of network failures in wide area networks has been thoroughly covered in literature. An empirical analysis of network failures in a traditional data center is presented in [5]. According to the study, load-balancers are the most failure-prone network component. Commodity switches are highly reliable as well as low-cost. Finally, increasing redundancy is 40% effective in eliminating the impact of failure. Shaikh et al. [6] studied OSPF performance

Fault classification

Customer Faced Bugs (CFB) is a term commonly used by service providers. It means that faults that customers experienced during provision of services and were reported. We gathered data from a well-known cloud service provider working in the OpenStack domain and classified the Bugs. Analysis involved examining each bug in detail. For the purpose of this classification, CFBs for a period of eight months (Mar-15–Oct-15) were examined. Each CFB is categorized based on its fault parameters inside

Fault-tolerance in virtual networks

Although, compute and storage have traditionally been provided as a service commodity, Network-as-a-Service (NaaS) or virtual networks are a relatively new service offering. A virtual network is a logical entity typically created on top of physical infrastructure. It is a virtual equivalent of a physical network, providing connectivity to the tenant resources. Tenants can create a customized virtual network by arranging the resources at will. Applications can then be deployed as desired. In

Fault-tolerance analyzer

Neutron provides networking services in OpenStack. Just like a mis-configured application, if the virtual network is not properly stress tested before provisioning, the service may crash rendering it unreachable for clients. Networking related failures are all too common in production environments. These failures are pronounced when the servers are located in multiple geographical areas as discussed earlier.

Let f0,1,2,3 represent faults in the system. For the purpose of this work we consider

Test setup

Multi-node OpenStack environment was setup, on data center at NUST-SEECS, to evaluate the performance of the proposed scheme. Multiple Core i7 HT Quad-Core (HQ) CPUs with 32GB of RAM and a Terabyte of hard-drive were used as the base machines. Across multiple tests, average scores were recorded. The process of testing is automated. Keeping in mind that most services require a number of servers on the cloud, for testing, servers are connected through a virtual network within the tenant

Results and discussion

There can be different network configurations that clients request and it is not possible to test all of them. Here we take a test configuration and show results. Based on our tests on a variety of network configurations, the proposed system is compatible with any OpenStack supported network configuration.

Model accuracy

In order to gauge accuracy of the proposed model, we ran tests on production environment and compared with our test environment. The idea is to ensure that the system output does not fluctuate to a large degree when used in a production environment. Testing system was setup on Microsoft Azure cloud. OpenStack was installed and fault-tolerance analyzer was setup to work with newly created networking requests. Ubuntu Server 14.04 LTS instances were used with multi-core CPUs and 32GB of RAM. We

Conclusion

Virtual networking forms the basis for cloud applications of scale. With the increase in adoption of cloud resources, it is imperative that we make provisions to ensure fault-tolerance of cloud based applications and virtual networks. We propose a mechanism to gauge fault-tolerance of a given virtual network based application within the OpenStack environment. FT Analyzer helps tenants in maintaining maximum up-time for services and data by preemptively testing applications for issues that can

Fizza Hussain received her M.S. (2016) in Electrical Engineering from National University of Sciences and Technology, SEECS, Islamabad, Pakistan. She completed her B.S.(2011) from National University-FAST, Karachi, Pakistan. She specialized in Computer Networking in her M.S. degree. Her research interests include Software Defined Networking, Internet of Things and Cloud Computing.

References (27)

  • O. Sefraoui et al.

    OpenStack: toward an open-source solution for cloud computing

    (2012)
  • J. Carapinha et al.

    Network virtualization: a view from the bottom

    Proceedings of the 1st ACM workshop on virtualized infrastructure systems and architectures

    (2009)
  • Ribeiro J. Salesforce outage continues in some parts of the US. URL: http://www.pcworld.com/article/3068699/. Accessed:...
  • Team T.A. Amazon ELB service event in the US-East Region. https://aws.amazon.com/message/680587/. Accessed:...
  • P. Gill et al.

    Understanding network failures in data centers: measurement, analysis, and implications

    ACM SIGCOMM computer communication review

    (2011)
  • A. Shaikh et al.

    A case study of OSPF behavior in a large enterprise network

    Proceedings of the 2nd ACM SIGCOMM workshop on Internet measurment

    (2002)
  • D. Turner et al.

    A comparison of Syslog and IS-IS for network failure analysis

    Proceedings of IMC

    (2013)
  • D. Turner et al.

    California fault lines: understanding the causes and impact of network failures

    ACM SIGCOMM Comput Commun Rev

    (2011)
  • CENIC: corporation for education network initiatives in California. URL: http://cenic.org/network/network-maps....
  • C. Labovitz et al.

    Experimental study of internet stability and backbone failures

    Fault-tolerant computing, 1999. digest of papers. twenty-ninth annual international symposium on

    (1999)
  • W. Xu et al.

    Detecting large-scale system problems by mining console logs

    Proceedings of the ACM SIGOPS 22nd symposium on Operating Systems principles

    (2009)
  • E.S. Myakotnykh et al.

    Analyzing causes of failures in the global research network using active measurements

    Ultra modern telecommunications and control systems and workshops (ICUMT), 2010 international congress on

    (2010)
  • M. Manzano et al.

    Robustness analysis of real network topologies under multiple failure scenarios

    Networks and optical communications (NOC), 2012 17th European conference on

    (2012)
  • Cited by (0)

    Fizza Hussain received her M.S. (2016) in Electrical Engineering from National University of Sciences and Technology, SEECS, Islamabad, Pakistan. She completed her B.S.(2011) from National University-FAST, Karachi, Pakistan. She specialized in Computer Networking in her M.S. degree. Her research interests include Software Defined Networking, Internet of Things and Cloud Computing.

    Syed Ali Haider is with department of Computing and Information Technology at University of Jeddah. Prior to this, he served at National University of Sciences & Technology, SEECS, Islamabad, Pakistan. He completed his Ph.D. (2012) from UNC Charlotte, USA and his Masters (2006) from University of Strathclyde, Glasgow, UK. His area of research is SDN and Software Defined Storage.

    Abdullah AlAmri holds a Bachelor degree in Computer Science from King Khalid University (2007) and a Master degree of Information Technology from School of Engineering & Mathematical Sciences, La Trobe University, Australia (2009). His Ph.D. is in Computer Science from RMIT University (2014). He is currently Assistant Professor in the Faculty of Computing and Information Technology at the University of Jeddah.

    Mohammed A. AlQarni is currently the vice dean of Faculty of Computing and Information Technology at University of Jeddah. He has graduated recently with a Ph.D. from McMaster University, Canada. His main research interest is in modelling of concurrent systems. He is also working on various areas of networking including SDN.

    Reviews processed and recommended for publication to the Editor-in-Chief by Associate Editor Dr. Z. Arnavut.

    View full text