Fault-tolerance analyzer: A middle layer for pre-provision testing in OpenStack☆
Introduction
Virtualization of networking resources has introduced a new and automated way for service providers to provision networks that can scale seamlessly and across geographical boundaries. At the same time, decoupling of data and control plane within conventional networks has offered immense potential for exploration and growth within the networking industry. This shift marks the beginning of the end of vendor locked networks and opens an avenue for software based tools that can perform similar functions, are more robust, manageable and cost effective. Resultantly, networking innovation is no longer dependent on vendor shipped boxes that perform a set of networking functions and become obsolete every few years, forcing service provider to update at hefty cost. These are replaced by Virtual Network Functions (VNF) that perform similar functionality however in software which can be upgraded with much less cost. As corporations move towards utilizing cloud resources for their compute and storage, cloud service providers are able to provision production grade networks that are also scalable, using automated systems and platforms such as OpenStack [1]. This enables service providers to orchestrate resources and manage commercial clouds with great ease. A large number of cloud service providers use OpenStack for providing networking, storage and compute to their customers [2]. When a customer requests for a specific resource, the cloud OS gathers resources required to fulfill the request and prepares to create an isolated container within its infrastructure which is accessible only by the requesting customer. Since containers are isolated, they function as independent environments that have all the resources they require i.e., a virtual network, interfaces, security groups, storage and virtual machines. All these components form a functional virtual network. In a typical business-to-business model, customer application spans across virtual machines however an application interface is presented to customers to interact with the application. The customers however, are not always aware of the intricate deployment details related to their applications (here an application means a cloud resource) and most non-commercial users do not need this information anyway. B2B customers however require some level of transparency in the choices made for their deployments. In the same context, this paper proposes a middle-layer, as shown in Fig. 1, that enables the customers to analyze their virtual networks for their ability to handle faults before they are deployed. Fault-tolerance tests provide a basis for better decision making and more efficient services. We argue that, prior to deployment of an application for large scale usage, the service provider must analyze the fault-tolerance of the application on its platform. Fault-tolerance analysis involves testing the virtual network for a number of metrics in order to have insight about the deployment. Testing is essential since, for instance, an improperly load balanced virtual network may cause service degradation and cause loss of revenue for both service provider and the customer. A number of instances of malfunctioning virtual networking elements, that brought the system down for hours at a stretch, have been observed [3], [4]. In essence, it is imperative that network applications are tested thoroughly for their fault tolerance before being handed over to the customer.
As part of the testing, application response time is tested when few randomly chosen servers, in the designated load-balancer pool fail. The upper bound of failure threshold can be analyzed by random shutdowns. We assign a numerical value to the fault-tolerance of respective network applications. We call this number Fault Tolerance Index (FTI).
The remainder of this paper is organized as follows. Section 2 discusses related work. Section 3 details fault classification and Section 4 describes fault-tolerance in the context of virtual networks. Section 5 presents the system architecture and implementation of our Fault-tolerance analyzer. Section 6 shows the test setup while in Section 7 we discuss the results. Section 8 presents model accuracy and Section 9 concludes the paper.
Section snippets
Related work
Quantification of the amount, types and causes of network failures in wide area networks has been thoroughly covered in literature. An empirical analysis of network failures in a traditional data center is presented in [5]. According to the study, load-balancers are the most failure-prone network component. Commodity switches are highly reliable as well as low-cost. Finally, increasing redundancy is 40% effective in eliminating the impact of failure. Shaikh et al. [6] studied OSPF performance
Fault classification
Customer Faced Bugs (CFB) is a term commonly used by service providers. It means that faults that customers experienced during provision of services and were reported. We gathered data from a well-known cloud service provider working in the OpenStack domain and classified the Bugs. Analysis involved examining each bug in detail. For the purpose of this classification, CFBs for a period of eight months (Mar-15–Oct-15) were examined. Each CFB is categorized based on its fault parameters inside
Fault-tolerance in virtual networks
Although, compute and storage have traditionally been provided as a service commodity, Network-as-a-Service (NaaS) or virtual networks are a relatively new service offering. A virtual network is a logical entity typically created on top of physical infrastructure. It is a virtual equivalent of a physical network, providing connectivity to the tenant resources. Tenants can create a customized virtual network by arranging the resources at will. Applications can then be deployed as desired. In
Fault-tolerance analyzer
Neutron provides networking services in OpenStack. Just like a mis-configured application, if the virtual network is not properly stress tested before provisioning, the service may crash rendering it unreachable for clients. Networking related failures are all too common in production environments. These failures are pronounced when the servers are located in multiple geographical areas as discussed earlier.
Let represent faults in the system. For the purpose of this work we consider
Test setup
Multi-node OpenStack environment was setup, on data center at NUST-SEECS, to evaluate the performance of the proposed scheme. Multiple Core i7 HT Quad-Core (HQ) CPUs with 32GB of RAM and a Terabyte of hard-drive were used as the base machines. Across multiple tests, average scores were recorded. The process of testing is automated. Keeping in mind that most services require a number of servers on the cloud, for testing, servers are connected through a virtual network within the tenant
Results and discussion
There can be different network configurations that clients request and it is not possible to test all of them. Here we take a test configuration and show results. Based on our tests on a variety of network configurations, the proposed system is compatible with any OpenStack supported network configuration.
Model accuracy
In order to gauge accuracy of the proposed model, we ran tests on production environment and compared with our test environment. The idea is to ensure that the system output does not fluctuate to a large degree when used in a production environment. Testing system was setup on Microsoft Azure cloud. OpenStack was installed and fault-tolerance analyzer was setup to work with newly created networking requests. Ubuntu Server 14.04 LTS instances were used with multi-core CPUs and 32GB of RAM. We
Conclusion
Virtual networking forms the basis for cloud applications of scale. With the increase in adoption of cloud resources, it is imperative that we make provisions to ensure fault-tolerance of cloud based applications and virtual networks. We propose a mechanism to gauge fault-tolerance of a given virtual network based application within the OpenStack environment. FT Analyzer helps tenants in maintaining maximum up-time for services and data by preemptively testing applications for issues that can
Fizza Hussain received her M.S. (2016) in Electrical Engineering from National University of Sciences and Technology, SEECS, Islamabad, Pakistan. She completed her B.S.(2011) from National University-FAST, Karachi, Pakistan. She specialized in Computer Networking in her M.S. degree. Her research interests include Software Defined Networking, Internet of Things and Cloud Computing.
References (27)
- et al.
OpenStack: toward an open-source solution for cloud computing
(2012) - et al.
Network virtualization: a view from the bottom
Proceedings of the 1st ACM workshop on virtualized infrastructure systems and architectures
(2009) - Ribeiro J. Salesforce outage continues in some parts of the US. URL: http://www.pcworld.com/article/3068699/. Accessed:...
- Team T.A. Amazon ELB service event in the US-East Region. https://aws.amazon.com/message/680587/. Accessed:...
- et al.
Understanding network failures in data centers: measurement, analysis, and implications
ACM SIGCOMM computer communication review
(2011) - et al.
A case study of OSPF behavior in a large enterprise network
Proceedings of the 2nd ACM SIGCOMM workshop on Internet measurment
(2002) - et al.
A comparison of Syslog and IS-IS for network failure analysis
Proceedings of IMC
(2013) - et al.
California fault lines: understanding the causes and impact of network failures
ACM SIGCOMM Comput Commun Rev
(2011) - CENIC: corporation for education network initiatives in California. URL: http://cenic.org/network/network-maps....
- et al.
Experimental study of internet stability and backbone failures
Fault-tolerant computing, 1999. digest of papers. twenty-ninth annual international symposium on
(1999)
Detecting large-scale system problems by mining console logs
Proceedings of the ACM SIGOPS 22nd symposium on Operating Systems principles
Analyzing causes of failures in the global research network using active measurements
Ultra modern telecommunications and control systems and workshops (ICUMT), 2010 international congress on
Robustness analysis of real network topologies under multiple failure scenarios
Networks and optical communications (NOC), 2012 17th European conference on
Cited by (0)
Fizza Hussain received her M.S. (2016) in Electrical Engineering from National University of Sciences and Technology, SEECS, Islamabad, Pakistan. She completed her B.S.(2011) from National University-FAST, Karachi, Pakistan. She specialized in Computer Networking in her M.S. degree. Her research interests include Software Defined Networking, Internet of Things and Cloud Computing.
Syed Ali Haider is with department of Computing and Information Technology at University of Jeddah. Prior to this, he served at National University of Sciences & Technology, SEECS, Islamabad, Pakistan. He completed his Ph.D. (2012) from UNC Charlotte, USA and his Masters (2006) from University of Strathclyde, Glasgow, UK. His area of research is SDN and Software Defined Storage.
Abdullah AlAmri holds a Bachelor degree in Computer Science from King Khalid University (2007) and a Master degree of Information Technology from School of Engineering & Mathematical Sciences, La Trobe University, Australia (2009). His Ph.D. is in Computer Science from RMIT University (2014). He is currently Assistant Professor in the Faculty of Computing and Information Technology at the University of Jeddah.
Mohammed A. AlQarni is currently the vice dean of Faculty of Computing and Information Technology at University of Jeddah. He has graduated recently with a Ph.D. from McMaster University, Canada. His main research interest is in modelling of concurrent systems. He is also working on various areas of networking including SDN.
- ☆
Reviews processed and recommended for publication to the Editor-in-Chief by Associate Editor Dr. Z. Arnavut.