Keywords

1 Introduction

Nowadays, Service-Oriented Computing (SOC) has emerged as a new way to develop extensible computing systems that evolve from the component-based software engineering. In SOC, the service is a black box to users and it is either an atomic Web service or a complex Web service that is constituted by several smaller, loosely coupled, reusable Web services via the Business Process Execution Language (BPEL) [5]. Reliability is a key issue of Quality of Service (QoS) for choosing and compositing Web services [9], especially for the mission-critical domains such as military or finance. In these domains, systems are complex and built by many component services with different reliabilities, leading to the analysis a very challenging yet crucial task. To perform the reliability analysis of composite Web services, there are two main issues to be resolved:

Modeling the composition structure. An appropriate representation of the composition structure is the foundation for reliability analysis. Most existing reliability analysis methods assume that the composite Web service is well-structured by some methodologies such as Service graph [7] and Semi Markov Process (SMP) [9]. However, clear explanation on how the structure model is built from the service composition is either missing or insufficient. In practice, the composition structure is varied in the integration stage and some composite Web services may be black boxes to users. Thus the transition from a composite Web service to the composition structure model requires explicit discussion. As the BPEL process describes the service composition, the problem of modeling the composition structure can be turned into the transition from BPEL to a composition structure model [4]. Moreover, Web services operate in an unstable Internet. Fault tolerance is an effective way to achieve high reliability. Although some existing reliability analysis methods consider the fault-tolerant mechanism in reliability calculation, they do not represent the fault-tolerant strategies in their composition structure models [4, 9].

Calculating the composite reliability. Composite reliability is the integration of component reliabilities with the transition probabilities between every component service. The transition probability can be obtained by statistical analysis of service invocations or empirical study of similar service compositions. All transition probabilities in a composition constitute the service operation profile which is a description of the generated pattern of external service requests expressed in a probabilistic form. Many composite reliability calculation methods use various mathematical equations to integrate the component reliabilities with the high level composition structure model [1, 7, 9]. These methods can obtain the composite reliability directly and they are applied widely in QoS-based service compositions. However, there are many restrictions on mathematical equations, such as the calculation equations may be very cumbersome and the sensitiveness of components cannot be obtained easily. Moreover, the composite reliability is dependent on the operation profile [2]. For a composite Web service, the operation profile may be varied in different time intervals according to users’ requests. Although the dynamic operation profile is very important in reliability analysis, it is considered by few composite reliability calculation methods.

Based on above discussions, a tree-based reliability analysis approach is proposed in this paper. We represent the composition structure in a Fault-tolerant Composite Web Services Tree (FCWS-T). There are two types of nodes in FCWS-T: the control node and the service node. The service node is a leaf of FCWS-T which represents a component service. The control node is the internal node which is used to represent the composition activity of children. By separating the node types of FCWS-T, various structures of the composite Web service can be represented explicitly. Moreover, the FCWS-T can be transformed from the BPEL process or the composition designer’s description directly. Considering the limitations of mathematical equations, the discrete-event simulation method [3] is used here for its flexibility in describing the component reliability functions. By integrating multiple operation profiles in simulation, the varying operation profile can also be considered in the composite reliability analysis.

The remaining paper is organized as follows: Sect. 2 presents the FCWS-T model and the methodology to transform the BPEL to a FCWS-T; Sect. 3 describes the reliability analysis simulation algorithms; Sect. 4 reports the experiments on a finance management service; Sect. 5 provides some conclusions.

2 The FCWS-T Model

2.1 The Definition of a FCWS-T Model

The FCWS-T is defined as a tree in this work. There are two main types of elements in the composition structure: component services and composition activities. Correspondingly, we define two types of nodes, namely ServiceNodes and ControlNodes, to represent them respectively. The ControlNodes represent four basic composition activities which include Sequence, If, While/Repeat and Flow [5]. The ServiceNode describes a component service’s reliability and execution time. In reality, the round trip of invoking a component service is more vulnerable than the service execution. In FCWS-T, the link reliability and link time of a component service are considered in the ServiceNode. Moreover, only several key component services in a whole composite Web service will be fault-tolerant due to the fault-tolerant strategies application costs significantly in time or resources. Thus in FCWS-T, fault-tolerant strategies are only defined for the ServiceNodes. According to the classification in [8], there are three main fault-tolerant strategies: Retry, Active replication, Passive replication.

According to the iteration feature of a tree, the following is the definition of FCWS-T model. Every tree node is: \(\textit{TreeNode} = \left\langle \textit{type}, \textit{parent}, \textit{childList}, \textit{weight} \right\rangle \).

  1. (1)

    type: The ServiceNode and ControlNode. The ServiceNode is\( \left\{ \textit{ServiceReli} (), \textit{ServiceTime}(), \textit{LinkReli}(), \textit{LinkTime}(), \textit{FT} \right\} \), and \(\textit{FT} \in \left\{ \textit{None}, \textit{Retry}, \textit{Passive}, \textit{Active}\right\} \). The \(\textit{ControlNode}\) is \(\{\textit{Sequence}, \textit{If}, \textit{Flow}, \textit{While}/\textit{Repeat}\}\).

  2. (2)

    parent: FCWS-T TreeNode, the father of the tree node.

  3. (3)

    childList: \(\{\textit{child}_{1}, \textit{child}_{2}, \cdots , \textit{child}_{n}\): \(FCWS-T \textit{TreeNode}\}\).

  4. (4)

    weight: The execution probability p \(_i\) relative to the parent node. In the If activity, {p \(_i\)} is the branch execution probability. In the While/Repeat activity, p \(_i\) represents the probability of executing i times. In both of these two activities, the sum of all branch execution probabilities is 1. In the Sequence or Flow activity, all children execute in sequence or in parallel and p \(_i\) is 1 for the children.

2.2 The Transition from BPEL to FCWS-T

The BPEL process of a composite Web service elucidates the structure activities (i.e., a series of basic composition activities) by nesting and iterations [5]. Here, we build the FCWS-T model directly by parsing BPEL process in two steps.

  1. (1)

    Extracting the WS-token String from BPEL

We define the WS-token string to represent the lexical analysis results of BPEL. A WS-token string is a set of tuples. Each tuple represents a service sub-composition and it is consisted by four elements: the left bracket “(”, the basic composition activity, the Web service number, and the right bracket “)”. The left bracket “(” and the right bracket “)” denote the start and the end of a sub-composition activity. The basic composition activity can be Sequence, Flow, While/Repeat, If and they are denoted as S, F, W, I. The Web service numbers are the identifiers of component services invoked in the sub-composition of a tuple.

The extraction process includes three parts. First, the BPEL source file is split into strings by lexical analysis. Then, the strings are read in sequence and the corresponding element of a tuple is generated. For example, in a sequence sub-composition, there are two component services which are Service 1 and Service 2. The tuple of this sub-composition is denoted as (S12). Finally, by parsing all BPEL strings, a WS-token string is created by constituting the tuples nested.

  1. (2)

    Mapping the WS-token String to FCWS-T

As an intermediate representation, the WS-token string can be used for transforming BPEL to FCWS-T. Every tuple of the WS-token string represents a Web services sub-composition. Algorithm 1 shows the mapping algorithm from the WS-token string to FCWS-T. The WS-token string is scanned from left to right. A sub-composition starts with the left bracket “(” and ends with the right bracket “)”. The new tree nodes of the ControlNode and ServiceNode will be created according to the basic composition activity and Web services number of a tuple. When a sub-composition activity finishes, the corresponding subtree is generated and inserted to FCWS-T as a component service.

figure a

3 The Reliability Analysis Simulation Methodology

To calculate the reliability of a service composition, we need a mechanism which can integrate the composition structure model and component reliabilities. The simulation method is an effective way to address these two issues. Moreover, it can explore the “what-if” questions and get more reliability details at the design stage [3]. Here, the discrete-event simulation is adopted to study the failure behavior of each component service in the composition. Then, a simulation algorithm for the whole composite Web service is proposed based on the FCWS-T.

3.1 The Discrete-Event Simulation of Component Reliability

The discrete-event simulation technique [3] has been used to study the failure behavior of Web services which are described by a non-homogeneous continuous time Markov chain (NHCTMC) process. The failures of a Web service are treated as the discrete-events in simulation. The main idea of this technique is to compare a random number x with the probability of a failure occurred (i.e., a event happens) in the infinitesimal interval (t, t + dt). The failure probability is given by lambda() \(\times \) dt and lambda() is the failure rate function, which can be provided by service developers or the evaluating third party. If x>lambda() \(\times \) dt, it means a failure happened in (t, t + dt) and returns 1, otherwise the service executes successfully and returns 0. The Web service reliability can be obtained by the number of failures is divided by the entire simulation times in the period (0, t).

It is costly and not feasible to explore every fault tolerant strategy via testing. The simulation technique can help developers in determining how fault-tolerant Web services will perform when they are employed. In our previous work [6], we have applied the discrete-event simulation method to investigate the reliability problem of fault-tolerant Web services. The reliability simulation algorithms of retry, active replication and passive replication strategies are proposed. Due to space constraints, the details of these simulation algorithms are not discussed.

3.2 The Simulation Algorithm of the Composite Reliability

As the composition structure and component services are distinguished by ControlNodes and ServiceNodes, the composite reliability simulation just needs to travel FCWS-T according to the type of tree nodes. Algorithm 2 shows the simulation process of composite services. The basic idea of our algorithm is to travel all sub-trees in a preorder. Each sub-tree from the root node is iteratively simulated according to the composition structure of their father node. When the tree node is a ServiceNode, the component reliability simulation is executed. The link reliability and service reliability are simulated sequentially for a ServiceNode. If a service or link is failed, the simulation stops. The failure times and the execution time are recorded. Otherwise, the simulation will traverse all nodes in FCWS-T and return the execution time.

figure b
Table 1. The reliability of component Web services
Fig. 1.
figure 1

The operation profiles and the FCWS-T model of the financial management composite service

4 Experimental Studies

4.1 The Experiment Setup

A financial management composite service is used to demonstrate the effectiveness of our reliability analysis approach. This composite service provides the deposit and withdrawal service, the investment service and the loan service. The investment service is composed by four component services which are the risk assessment service, the primary approval service, the intermediate approval service and the advanced approval service. Moreover, the passive fault-tolerant strategy is applied for the loan service to ensure its reliability. There are three loan services which are named loanversion1, loanversion2, loanversion3. The reliability of each component service is shown in Table 1. As the loan service is not available in the non-working hours, the working hours operation profile is quite different from the non-working hours. 14,925 test cases are executed during the period of one month. The numbers of test cases in the working hours and non-working hours are 10,031 and 4,894. These two groups of test cases constitute the working hours and non-working hours operation profiles which are shown in Fig. 1(a).

4.2 The Simulation Reliability Analysis Results

This section reports the results of the simulation approach and it is twofold. First, we exhibit the usability of simulation results with multiple operation profiles. Second, we demonstrate how the simulation approach determines the reliability bottleneck and explore the effectiveness of different fault-tolerant strategies.

  1. (1)

    The Reliability Simulation Results

The FCWS-T model is generated by transforming the BPEL of the financial management service. First, the WS-token string is extracted from the BPEL and it is (I(W1)(I(S1)(S4(F326)))). Then, the FCWS-T is generated. Figure 1(b) shows the FCWS-T of the financial management service. Based on Table 1 and Fig. 1(a), the parameters can be specified for the ServiceNodes and ControlNodes respectively. In our examples, the LinkTime() of services is a random value which ranges from 0ms to 200ms and the LinkReli() of services is set as 0.99 since the financial management service is operating in a small local area network.

As the test cases of working hours and non-working hours are 10,031 and 4,894, the proportion of two operation profiles execution can be assumed as 2:1. We define that every 1,000 simulations of the working hours will follow 500 simulations of the non-working hours. The two operation profiles are alternatively simulated. With 100,000 simulation times, the average reliability and execution time are 0.7383 and 254.84 ms. The simulation reliability results of working hours and non-working hours are 0.7541 and 0.6762. As the executions of the working hours profile are twice of the executions of the non-working hours profile, the whole time result is more close to the working hours. Moreover, the reliability of the non-working hours is much lower than the reliability of the working hours. The simulation results suggest that developers need to pay more attention on the reliability of the financial management service in non-working hours.

  1. (2)

    The Fault-tolerant Strategy of Web Services

Finding the most reliability sensitive component service is essential in applying fault-tolerant strategies. The sensitiveness of every component service can be investigated by changing component reliabilities. When every component reliability is increased by 10% in each composite reliability simulation, Service 1 is found to be the most sensitive component service which has the greatest improvement of the composite reliability. Thus it is an effective way to improve the whole composition reliability by applying fault-tolerant strategies on Service 1.

With the simulation approach, we can further explore the effectiveness of fault-tolerant strategies in improving the reliability of Service 1 and the whole composition. For Retry strategy, Service 1 will repeat three times until it succeeds. For Passive strategy, three replicas of Service 1 will be executed in order if the prior one is failed. For Active strategy, three replicas of Service 1 are executed in parallel. The execution result is the first return of three versions. Each replica is configured with different reliability and execution time. Table 2 shows the simulation results of Service 1 and the whole composition with different fault-tolerant strategies. It can be seen that the reliability of Service 1 is significantly improved by applying fault-tolerant strategies. However, the resources and execution time are also increased. The whole composite reliability can be improved by 14.7%, 16.3% and 16.1%, comparing with no fault-tolerant strategy of Service 1. The composition designer can choose a suitable strategy to improve the reliability of the whole composite Web service based on the simulation results.

Table 2. The Reliability Results of Service 1 with Different Fault-Tolerant Strategies

5 Conclusion

This paper proposes a tree-based reliability analysis approach for fault-tolerant Web services composition. The composition structure is represented by the FCWS-T model which is a tree. Based on the FCWS-T model and the discrete-event simulation method, the composition structure, the component reliabilities and fault-tolerant strategies can be integrated in the composite reliability analysis. Developers can not only obtain the reliability of the whole composite Web service with multiple operation profiles, but also the sensitiveness of each component Web service and the effectiveness of different fault-tolerant strategies.