Keywords

These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.

1 Introduction

The quest for efficient methods for automated web service composition [15, 21, 22, 27] that can meet business needs and deliver guaranteed performance has drawn a significant research attention since long. A significant body of early work in this direction considered composition of static service repositories and proposed a number of composition solutions [14, 16, 25]. The underlying theory assumes that the functional parameters of a web service remain unchanged [9, 19] or change rarely. However, in reality, this is often not a valid assumption. Service repositories have grown to exhibit significant dynamic characteristics today. A large number of new services are added to the service repository, while unpopular services are removed from the service repository. Due to continuous change in user requirements, the interface of a web service is often modified and thereby, the functional and non-functional parameters of a service may change. The dynamic characteristics over non-functional parameters have been studied in [3, 23, 26]. The authors in [12] conducted an 11 week-long survey of Web services on the Internet, collected from seekda.com, webservicelist.com, and xmethods.net to show the dynamic nature of the web services. It is evident from their experiments that the number of web services and their inputs, outputs fluctuate considerably. Classical web service composition approaches that deal with static web services, may not fit well in this scenario and in most of the cases, these methods may end up producing a solution that may not be available at the time of execution. This is the main motivation behind our work.

In [10, 13], the authors proposed dynamic service binding during composition. Uncertainty events in service composition are dealt with in [2, 18]. In [12, 17], the authors considered dynamic service composition and proposed an algorithm based on a variation of the Dijkstra’s shortest path [7] algorithm. The authors in [26] identified a set of backup services, and using these backup services, the authors tried to improve the reliability of the composition solution. In this work, however, authors did not consider the dynamism of the functional parameters of a service (except, one characteristics, i.e., a service may be unavailable). This is the main focus in our paper. In [11], an adaptation approach for web service composition has been proposed. This paper measured the changed information values, which is potentially introduced when a service is updated in a business process and demonstrated how this adaptation takes place for different workflow patterns. However, most of the above proposals are able to handle the situation when the service workflow (a specific order of execution) is known. In reality, some times a query is specified in terms of input and output parameters. In such cases, the workflow is not known beforehand and therefore, these methods fail to handle it.

In this work, we propose automated web service composition approaches that can capture the dynamic behavior of a service from the functional perspective in the situation when the workflow is unknown. To make composition more realistic in dynamic environments, we assign a probability value with each service and its functional parameters and demonstrate how composition can be done in this situation. The inherent scalability limitation of optimization formulations renders them infeasible for large problem dimensions in real time. Therefore, in our final proposal, we propose a heuristic method that can generate a solution in a reasonable time limit. In summary, this paper has the following contributions:

  • We model the dynamic behavior of a service.

  • We propose an optimal and a heuristic approaches to handle this modeling.

  • We perform an extensive experiment of our proposed methods on synthetically generated datasets and benchmark datasets to show the effectiveness of our proposal.

2 Background and Problem Formulation

A web service is a software component that takes a set of inputs, performs a specific task and produces a set of outputs. In classical web service composition, we are given:

  • A set of web services \(S = \{\mathcal{{S}}_1, \mathcal{{S}}_2, \ldots , \mathcal{{S}}_n\}\).

  • For each \(\mathcal{{S}}_i \in S\), a set of inputs \(\mathcal{{S}}_i^{(ip)}\) and outputs \(\mathcal{{S}}_i^{(op)}\).

  • A query \(\mathcal{{Q}}\), expressed in terms of a set of provided inputs \(\mathcal{{Q}}^{(ip)}\) and a set of desired outputs \(\mathcal{{Q}}^{(op)}\).

The objective of the classical web service composition problem is to serve a query by providing a solution, in terms of a set of services with a specific execution order, so that the functional dependencies [4] are preserved. However, the dynamic characteristics of the services, as discussed in Sect. 1, are mostly missing in the classical setting. In this paper, we augment the above composition problem with the following features to incorporate a dynamic setting.

  • Feature-1: A service in the service repository may or may not be available for execution, when the composition is done at query time.

  • Feature-2: The interface of a service may change, and thereby, the functional parameters of a service (i.e., input-output parameters) may change at query time.

The above features capture the dynamic aspects in functional behavior of a service. In other words, these allow us to capture the possible differences in functional parameters that may occur at query time, with respect to the original service description provided in the repository. We begin by describing the details of the model for dynamics.

3 Modeling Architecture

We now present the modeling of dynamic behavior of functional parameters of a service.

3.1 Modeling of Functional Characteristics

In order to model the dynamic characteristics of the functional behavior of a service, we assign a probability value with each service and its functional parameters, as shown below.

  • Each service \(\mathcal{{S}}_i \in S\) is available in the repository at query time with probability \(p_i\), i.e.,

    $$\begin{aligned} P(\mathcal{{S}}_i \text { is available}) = p_i \end{aligned}$$
    (1)
  • An output \(io_j \in \mathcal{{S}}_i^{(op)}\) is produced by \(\mathcal{{S}}_i\) at query time with probability \(\beta _{i,j}\), i.e.,

    $$\begin{aligned} P(io_j \text { is produced by }\mathcal{{S}}_i~|~\mathcal{{S}}_i \text { is executed}) = \beta _{i,j} \end{aligned}$$
    (2)

It may be noted that Eq. 1 models Event-1, i.e., the probability of a service being available in the repository. Equation 2 models Event-2, which is a conditional probability that captures the probability of producing an output \(io_j\) by \(\mathcal{{S}}_i\), given that \(\mathcal{{S}}_i\) is executed (denoted by the classical|operator) [20].

As already discussed earlier, once a query comes into a system, a dependency graph [4] or planning graph [5] is constructed based on the input-output dependency relationship. However, the conventional dependency graph or planning graph does not capture the dynamic characteristics of the services. Therefore, instead of constructing the dependency graph or planning graph, we propose to build a dependency network. The dependency network, which is also constructed based on service input-output dependency, is a variant of the classical dependency graph and also a variant of the planning graph. In the next subsection, we demonstrate the modeling of the dynamic behavior through our proposed dependency network.

3.2 The Dependency Network for Service Composition

Our proposed dependency network \(\mathcal{{G}} = (V_S \cup V_{io}, E)\) is a variant of the classical AND-OR graph [8] consisting of two types of nodes: AND nodes (\(V_S\)) corresponding to the services and OR nodes (\(V_{io}\)) corresponding to the inputs/outputs of services. Since a node is associated with either a service or an input/output, the node is required to be activated in the network. Initially, the nodes corresponding to the query inputs are activated in the dependency network. Once a node is activated, its corresponding output links are available in the network. As the network is traversed, more nodes are activated depending on the availability of the links. Depending on the activation criteria, we now formally define AND and OR nodes.

Definition 1

[OR Node]: A node \(u_i \in V_{io}\) is called an OR node, if it is activated on availability of one of its input links. \(\blacksquare \)

Definition 2

[AND Node]: A node \(v_i \in V_S\) is called an AND node, if the node is activated on availability of all its input links. \(\blacksquare \)

Throughout this paper, we represent an AND node using \(v_i\) and OR node using \(u_i\). We now discuss the properties of \(\mathcal{{G}}\).

  • \(\mathcal{{G}}\) is a layered network.

  • Each layer of \(\mathcal{{G}}\) consists of either a set of OR nodes or a set of AND nodes.

  • The OR layer (i.e., the layer consisting of only OR nodes) and the AND layer (i.e., the layer consisting of only AND nodes) alternate in \(\mathcal{{G}}\).

  • No links exist between nodes in the same layer. Links are available only between AND layer to OR layer/OR layer to AND layer.

To capture the dynamic characteristics of services, the dependency network has some additional features, as below.

  • A probability value is assigned to each node and link of the dependency network.

  • Each AND node is activated with probability \(p_i\) on availability of all its input links.

  • Once an AND node \(v_i\) is activated, an output link \((v_i, u_j)\) is available with probability \(\beta _{i,j}\), where \(u_j\) is an OR node.

Example 1

Consider the service repository shown in Table 1. The first column of Table 1 shows the service name. Columns 2 and 3 represent the set of inputs and the outputs of the service. Column 4 represents the probability of generating an output \(io_j\) by a service \(\mathcal{{S}}_i\) given that \(\mathcal{{S}}_i\) is executed (as shown in Eq. 2). Finally, Column 5 represents the probability of availability of the service in the repository (as shown in Eq. 1).

Table 1. Definition of services
Fig. 1.
figure 1

Dependency Network corresponding to \(\mathcal{{Q}}\)

Consider a query \(\mathcal{{Q}}\) with inputs \(\{io_1, io_2, io_3\}\) and outputs \(\{io_{12}, io_{13}\}\). The dependency network \(\mathcal{{G}}\) constructed based on the query is shown in Fig. 1. Each AND node (as represented by a circle in the figure) from \(V_S = \{v_1, v_2, \ldots , v_9\}\) represents a service, whereas, each OR node (as represented by a square box in the figure) from \(V_{io} = \{u_1, u_2, \ldots , u_{13}\}\) represents an input/output. Nodes in the first OR layer do not have any input link and thereby, these nodes are activated by default. It may be noted that each AND node is annotated with a probability value, representing the probability of the service corresponding to the AND node being available in the service repository. Further, each link from an AND node \(v_i\) to an OR node \(u_j\) is also annotated with a probability value \(\beta _{i,j}\), that represents the probability that the service \(\mathcal{{S}}_i\) corresponding to \(v_i\) generates the output \(io_j\) corresponding to \(u_j\) with probability \(\beta _{i,j}\), given that \(\mathcal{{S}}_i\) is activated or executed. In order to make the dependency network uniform, we annotate each OR node and each link from an OR node to an AND node with probability value 1. It may be noted, unlike a service, an input/output is not controlled externally by the service provider. Therefore, the availability of an input/output is a certain event, though its activation depends on the query inputs or a service that produces it as its output. Hence, assigning the probability value 1 to each OR node and each link from an OR node to an AND node do not have any impact on constructing the solution.\(\blacksquare \)

We now discuss some properties of the dependency network.

  • Each AND node \(v_i\) corresponds to a service \(\mathcal{{S}}_i\). A probability \(p_i\) (i.e., \(P(\mathcal{{S}}_i \text { is available})\)) is assigned to \(v_i\), denoting \(v_i\) is available with probability \(p_i\). \(Av(v_i)\) denotes that the service corresponding to \(v_i\) is available.

    $$\begin{aligned} P (Av(v_i)) = p_i \end{aligned}$$
    (3)
  • Each OR node \(u_i\) corresponds to an input / output. A probability 1 is assigned to \(u_i\). Since a probability value is assigned to each AND node, to maintain uniformity, we assign a probability 1 to each OR node.

    $$\begin{aligned} \forall u_i \in V_{io}, P (Av(u_i)) = 1 \end{aligned}$$
    (4)
  • If an AND node \(v_i\) is activated, this implies the corresponding service \(\mathcal{{S}}_i\) is also activated. \(\mathcal{{S}}_i\) produces an output \(io_j\) with probability \(\beta _{i,j}\). Consider \(u_j\) be the corresponding OR node of \(io_j\). Hence, \(\beta _{i,j}\) is assigned as the probability of having a link \((v_i, u_j)\) available in \(\mathcal{{G}}\), given that \(v_i\) is activated.

    $$\begin{aligned} P (Av(v_i, u_j) ~|~ Ac(v_i)) = \beta _{i,j} \end{aligned}$$
    (5)

    \(Ac(v_i)\) denotes \(v_i\) is activated.

  • To maintain uniformity, we assign a probability 1 to each link from an OR node to an AND node in \(\mathcal{{G}}\).

    $$\begin{aligned} \forall (u_i, v_j) \in E, P (Av(u_i, v_j) ~|~ Ac(u_i)) = 1 \end{aligned}$$
    (6)
  • A link in \(\mathcal{{G}}\) is activated, if the source node of the link is activated and the source node generates the link.

    $$\begin{aligned} \begin{aligned} P (Ac(u_i, v_j))&= P (Ac(u_i) \cap Av(u_i, v_j)) \\&= P (Ac(u_i)) . P (Av(u_i, v_j) | Ac(u_i)) \end{aligned} \end{aligned}$$
    (7)
    $$\begin{aligned} \begin{aligned} P (Ac(v_i, u_j))&= P (Ac(v_i) \cap Av(v_i, u_j))\\&= P (Ac(v_i)) . P (Av(v_i, u_j) | Ac(v_i)) \end{aligned} \end{aligned}$$
    (8)

    This follows from the definition of conditional probability \(P(A|B) = \frac{P(A \cap B)}{P(B)}\), where \(\cap \) denotes the intersection operator.

  • An AND node \(v_i\) is activated, if all its input links are activated and \(v_i\) is available.

    $$ \begin{aligned} P (Ac(v_i)) = P (\bigcap \limits _{\begin{array}{c} u_j \in V_{io}\\ \& (u_j, v_i) \in E \end{array}} Ac(u_j, v_i)) \text { } . \text { } P (Av(v_i)) \end{aligned}$$
    (9)
  • An OR node \(u_i\) is activated, if any of its input links is activated and \(u_i\) is available, represented as a union over the corresponding inputs.

    $$ \begin{aligned} P (Ac(u_i)) = P (\bigcup \limits _{\begin{array}{c} v_j \in V_{S}\\ \& (v_j, u_i) \in E \end{array}} Ac(v_j, u_i)) \text { } . \text { } P (Av(u_i)) \end{aligned}$$
    (10)

    where \(\bigcup \) denotes the union operator.

3.3 Dependency Network Construction

We now present the details of the procedure of dependency network construction. We start with the query inputs and identify the set of services that can be directly or eventually (i.e., with the outputs of the services that are directly or eventually activated by the query inputs) activated by the query inputs. We construct an AND node corresponding to each service and an OR node corresponding to each input/output. The network links are constructed depending on the set of inputs and outputs of a service.

Once the network is constructed, we add two dummy nodes \(v_s\) and \(v_e\) to the network to represent the start and the end nodes of the network. \(v_s\) does not have any input link, however, a set of links is created from \(v_s\) to the nodes corresponding to the query inputs. Similarly, \(v_e\) does not have any output link, hence a set of links is created from the nodes corresponding to the query outputs to \(v_e\). While constructing the dependency network, we start with the query inputs. Therefore, each node is connected to \(v_s\) through some path. However, it is not necessary that each path of the network starting from \(v_s\) ends at \(v_e\). Therefore, we traverse the dependency network backward and identify the set of nodes belonging to the paths from \(v_e\) to \(v_s\) and finally, we remove the remaining set of nodes, which do not belong to any path from \(v_s\) to \(v_e\), since these nodes do not take any part in resolving the query. It may be noted that \(v_s\) in the network is available and activated with probability 1, whereas, \(v_e\) has to be available with probability 1 for the query to be fulfilled.

$$\begin{aligned} P (Av(v_s)) = P (Ac(v_s)) = 1; P (Av(v_e)) = 1 \end{aligned}$$
(11)

It may also be noted further that if \(v_e\) is activated with probability \(p_e\), this implies we obtain the query outputs with probability \(p_e\).

Layering the Dependency Network. Once the dependency network is constructed, the network is divided into multiple layers. This step is required for constructing a solution to a query. A node \(v_i \in V_S\) in \(\mathcal{{G}}\) belongs to a layer L, if the following condition is satisfied: for all \(u_j \in V_{io}\), such that \((u_j, v_i)\) is a link in \(\mathcal{{G}}\), \(u_j\) belongs to a layer \(L'\), where \(L' < L\). Similarly, a node \(u_i \in V_{io}\) in \(\mathcal{{G}}\) belongs to a layer L, if for all \(v_j \in V_{S}\), such that \((v_j, u_i)\) is a link in \(\mathcal{{G}}\), \(v_j\) belongs to a layer \(L'\), where \(L' < L\). The first layer of the network contains only \(v_s\). A layer of the network is mathematically defined as:

$$\begin{aligned} \begin{aligned} V_L = {\left\{ \begin{array}{ll} \{v_s\},&{} L = 0 \\ \{u_i \in V_{io} | \forall v_j \in V_{S}, (v_j, u_i) \in E, v_j \in V_{L'}, L'< L\}, &{} L \text { is odd} \\ \{v_i \in V_{S} | \forall u_j \in V_{io}, (u_j, v_i) \in E, u_j \in V_{L'}, L' < L\},&{} L \text { is even}\\ \end{array}\right. } \end{aligned} \end{aligned}$$
(12)

In order to ensure that the input-output edge relationship does not span across multiple layers, we add some dummy nodes in the network. In other words, if a link does not connect two nodes between two consecutive layers, we add a set of dummy nodes in the network. Consider a link \((u_i, v_j) \in E\), such that \(u_i\) belongs to a layer L and \(v_j\) belongs to a layer \(L'\) and \((L' - L) > 1\). In this case, we do the following:

  • \(\forall L_k\), such that \(L< L_k < L'\), we add a dummy node \(w_{L_k}\) in layer \(L_k\). If \(L_k\) is even, \(w_{L_k}\) is an OR node. Otherwise, \(w_{L_k}\) is an AND node.

  • We add the following set of links in the network: \(\{(u_i, w_{(L + 1)}), (w_{(L + 1)}, w_{(L + 2)}),\) \( \ldots ,\) \( (w_{(L' - 1)}, v_j)\}\).

  • \(\forall w_{L_k} \in \{w_{(L + 1)}, w_{(L + 2)}, \ldots , w_{(L' - 1)}\}\), we assign \(P (Av(w_{L_k})) = 1\). This essentially means, the dummy nodes are always available with probability 1. This is required to ensure that the probability of generating a solution from the dependency network is not affected due to the insertion of the dummy nodes.

  • \(\forall (x, y) \in \{(u_i, w_{(L + 1)}), (w_{(L + 1)}, w_{(L + 2)}), \ldots , (w_{(L' - 1)}, \) \(v_j)\}\), we assign \(P (Av((x, y)) | Ac(x)) = 1\). This condition implies, once a dummy node is activated, it generates all its outputs with probability 1.

4 Dynamic Service Composition

It may be noted that the dependency network of a query contains all possible solutions to the query. The probability of generating a solution to a query from the dependency network is maximized, if the entire dependency network is returned as a solution to the query. However, in this case, the solution becomes inefficient in terms of its cost and quality, since the solution contains a set of redundant services (i.e., services without which the solution construction is possible). Therefore, here our objective is to minimize the number of services. In addition to the classical web service composition problem, we have the following constraint: the desired solution has to be obtained with a probability greater than or equal to \(\alpha _{solution}\), \((0 \le \alpha _{solution} \le 1)\), where \(\alpha _{solution}\) is provided by the user.

5 Solution Generation

We now discuss the procedure for generating a solution to a query from the dependency network. It may be noted that each solution to a query is a subnetwork of the dependency network. However, the converse is not true, i.e., each subnetwork of the dependency network is not a solution to the query. This is mainly because of the following reason. Each AND node of the dependency network is associated with a service and a service is activated only when all its inputs are available. Therefore, an AND node requires all its input links to be available. However, each OR node is associated with an input/output, which is available if a service produces it or it is a part of the query inputs. Therefore, only one link is sufficient to activate an OR node. Hence, a solution subnetwork requires (a) each AND node, belonging to it, to have the same number of incoming links as in the dependency network and (b) each OR node, belonging to it, to have at least one incoming link. Therefore, for ease of analysis, to find the optimal solution satisfying all the constraints, we transform the dependency network to a hyper dependency network, where each path of the hyper dependency network provides a solution to the query. We now define a few terms related to the hyper dependency network.

Definition 3

[Hyper Node]: A node containing multiple homogeneous nodes (i.e., either AND nodes or OR nodes) of a dependency network is called a hyper node. \(\blacksquare \)

A hyper node consisting of multiple OR nodes of a dependency network is an OR hyper node. Similarly, a hyper node consisting of multiple AND nodes of a dependency network is an AND hyper node.

Definition 4

[Hyper Edge]: An edge containing multiple edges of a dependency network is called a hyper edge. \(\blacksquare \)

Definition 5

[Hyper Dependency Network \(\mathcal{{HG}} = (HV_S \cup HV_{io}, HE)\)]: A dependency network consisting of hyper nodes and hyper edges is called a hyper dependency network. \(\blacksquare \)

We now present the construction of the hyper dependency network \(\mathcal{{HG}} = (HV_S \cup HV_{io}, HE)\) from a dependency network \(\mathcal{{G}}\). We start with \(v_e\) of \(\mathcal{{G}}\). We then identify the set of nodes that are responsible for activating \(v_e\) and construct a hyper node consisting of this set. Once we have an OR hyper node \(hu \in HV_{io}\), we construct all possible combinations of AND nodes that are responsible for activating each OR node corresponding to hu and for each combination, we construct an AND hyper node. In the process of constructing an AND hyper node hv, we keep track of all the edges that have been considered during this construction and create a hyper edge (hv, hu) consisting of the set of edges.

Example 2

Consider the dependency network \(\mathcal{{G}}\) as shown in Fig. 2(a). Figure 2(b) shows the hyper dependency network \(\mathcal{{HG}}\) corresponding to \(\mathcal{{G}}\). Consider an OR hyper node \(hu_9\) consisting of \(u_{12}\) and \(u_{13}\). The set of edges that can activate \(u_{12}\) is \(\{(v_6, u_{12}), (v_8, u_{12})\}\), while the set of edges that can activate \(u_{13}\) is \(\{(v_7, u_{13}), (v_9, u_{13})\}\). Therefore, four combinations are possible to activate \(hu_9\), i.e., \(\{(v_6, u_{12}), (v_7, u_{13})\}\), \(\{(v_6, u_{12}), (v_9, u_{13})\}\), \(\{(v_8, u_{12}), (v_7, u_{13})\}\) and \(\{(v_8, u_{12})\) \(, (v_9, u_{13})\}\). Considering these combinations, we have four AND hyper nodes \(hv_6: \{v_6, v_7\}\), \(hv_7: \{v_6, v_9\}\), \(hv_8: \{v_8, v_7\}\) and \(hv_9: \{v_8, v_9\}\) and four hyper edges \(he_{21}: \{(v_6, u_{12}), (v_7, u_{13})\}\), \(he_{22}: \{(v_6, u_{12}), (v_9, u_{13})\}\), \(he_{23}: \{(v_8, u_{12}), (v_7, u_{13})\}\) and \(he_{24}: \{(v_8, u_{12}), (v_9, u_{13})\}\).\(\blacksquare \)

Once we have an AND hyper node \(hv \in HV_S\), we construct the set of OR nodes that can activate each AND node corresponding to hv and construct an OR hyper node hu consisting of this set. We then construct an hyper edge (hu, hv) in a similar manner as discussed above.

Example 3

Consider an AND hyper node \(hv_6\) consisting of \(v_6\) and \(v_7\). The set of edges that can activate \(v_6\) is \(\{(u_8, v_6)\}\), while the set of edges that can activate \(v_7\) is \(\{(u_9, v_7)\}\). Therefore, the hyper dependency network contains an OR hyper node \(hu_5: \{u_8, u_9\}\) and a hyper edge \(he_{17}: \{(u_8, v_6), (u_9, v_7)\}\). \(\blacksquare \)

Fig. 2.
figure 2

(a) Dependency network (b) Hyper dependency network (c) An USDN (Color figure online)

Note that an AND hyper node has always inDegree (i.e., the number of incoming links) 1, an OR hyper node may have inDegree more than 1. We now define two more terms.

Definition 6

[Solution Dependency Network (SDN)]: A subnetwork SDN \(\mathcal{{G}}' = (V'_S \cup V'_{io}, E')\) of the dependency network \(\mathcal{{G}} = (V_S \cup V_{io}, E)\) is a connected network, such that the following conditions hold:

  • \(V'_S \subseteq V_S\), \(V'_{io} \subseteq V_{io}\), \(E' \subseteq E\) and \(v_s, v_e \in V'_S\).

  • Each node and link in \(\mathcal{{G}}'\) belongs to at least one path from \(v_s\) to \(v_e\) in \(\mathcal{{G}}'\).

  • \(\forall u_i \in V'_{io}\), \(1 \le inDegree(\mathcal{{G}}', u_i) \le inDegree(\mathcal{{G}}, u_i)\).

  • \(\forall v_i \in V'_S\), \(inDegree(\mathcal{{G}}', v_i) = inDegree(\mathcal{{G}}, v_i)\).

  • The probability assignments for each node and link in \(\mathcal{{G}}'\) are same as in \(\mathcal{{G}}\).

where \(inDegree(\mathcal{{G}}, v_i)\) denotes the inDegree [7] of \(v_i\) in \(\mathcal {G}\). \(\blacksquare \)

It may be noted that the SDN is a subnetwork of the dependency network. Therefore, while the dependency network constructed based on a query consists of all solutions to the query, the SDN of the dependency network consists of only a subset of solutions. We now define the notion of a unique solution dependency network representing a unique solution to a query, where a unique solution refers to a solution, in which each service is dependent only on one service for a specific input.

Definition 7

[Unique Solution Dependency Network (USDN)]: An USDN \(\mathcal{{G}}' = (V'_S \cup V'_{io}, E')\) of a dependency network \(\mathcal{{G}} = (V_S \cup V_{io}, E)\) is a SDN of \(\mathcal{{G}}\), such that, \(\forall u_i \in V'_{io}\), \(inDegree(\mathcal{{G}}', u_i) = 1\). \(\blacksquare \)

It may be noted that each path of the hyper dependency network is an USDN.

Example 4

Figure 2(c) shows an USDN corresponding to the dependency network shown in Fig. 2(a), which is generated from a path of the hyper dependency network as shown in Fig. 2(b). In the figure, the path is shown by a red dashed line. \(\blacksquare \)

It may be noted that being a SDN, an USDN constitutes a solution to a query. We now prove the following lemma.

Lemma 1

An USDN contains a single solution to a query. \(\blacksquare \)

Proof

In order to prove the above lemma, it is sufficient to prove that if a node is removed from an USDN, no solution is produced from the USDN. We prove this by contradiction. We first assume that a solution can be produced by removing a node from the USDN. We first consider the case where a solution can be produced by removing an OR node \(u_i \in V'_{io}\) from the USDN. Since \(u_i\) belongs to at least one path from \(v_s\) to \(v_e\), thereby, after removal of \(u_i\), the path from \(u_i\) to \(v_e\) becomes invalid. We now consider the OR node \(u_j \in V'_{io}\) belonging to the path from \(u_i\) to \(v_e\), such that \((u_j, v_e) \in E'\). Being part of the USDN, \(u_j\) has a single input and thereby, \(u_j\) becomes inactivated, as the entire path from \(u_i\) to \(v_e\) becomes invalid. Therefore, the link \((u_j, v_e)\) also becomes inactivated. Hence, \(v_e\) becomes inactivated. As a result, no solution to the query is produced, which contradicts our assumption. Using a similar argument, it can be proved that no solution can be produced by removing an AND node from the USDN.

It is evident from the above lemma that a solution to a query consists of at least one USDN. Though a single USDN is enough to generate a solution with a certain probability, however, multiple USDNs in a solution to a query are required just to increase the probability of generating a solution. The following lemma helps to compute the probability of the solution characterized by a USDN.

Lemma 2

In an USDN \(\mathcal{{G}}'\) of a dependency network \(\mathcal{{G}}\),

$$ \begin{aligned} P(Ac(v_e)) =&\prod \limits _{v_i \in V'_S} {P(Av(v_i))}.\prod \limits _{(v_i, u_j) \in E'} {P(Av(v_i, u_j) | Ac(v_i))}.\\ \end{aligned} $$

i.e., \(P(Ac(v_e))\) is equal to the product of the probabilities of all nodes and links belonging to any path from \(v_s\) to \(v_e\). \(\blacksquare \)

The proof of this lemma is omitted from this text due to space limitation. If a SDN \(\mathcal{{G}}'\) of a dependency network \(\mathcal{{G}}\) consists of m USDNs \(\mathcal{{G}}_1, \mathcal{{G}}_2, \ldots . \mathcal{{G}}_m\), the activation probability of \(v_e\) of the SDN can be expressed as:

$$\begin{aligned} P(Ac^{\mathcal{{G}}'}(v_e)) = P(\bigcup \limits _{i = 1}^{m} Ac^{\mathcal{{G}}_i}(v_e)) \end{aligned}$$
(13)

It may be noted that each \(\mathcal{{G}}_i\) is an USDN of \(\mathcal{{G}}'\), for \(i = 1, 2, \ldots , m\) and \(Ac^{\mathcal{{G}}_i}(v_e)\) denotes the activation event of \(v_e\) of \(\mathcal{{G}}_i\). \(P(\bigcup \limits _{i = 1}^{m} Ac^{\mathcal{{G}}_i}(v_e))\) can be computed using principle of inclusion and exclusion [6]. The end node \(v_e\) of a SDN \(\mathcal{{G}}'\) can be activated through any of the USDNs belonging to the SDN. Therefore, the activation event of \(v_e\) in a SDN can be expressed as the union of the activation events of \(v_e\) for each USDN corresponding to \(\mathcal{{G}}'\).

The Principle of inclusion and exclusion is stated as below.

\(P(A_1 \cup A_2 \cup \ldots \cup A_n) = \sum _{i} P(A_i) - \sum _{i,j} P(A_i \cap A_j) + \sum _{i,j,k} P(A_i \cap A_j \cap A_k) - \ldots \)

where each \(A_i\), for \(i = 1,2,\ldots ,n\), represents an event of a random experiment.

In case of two USDN networks, the above expression can be applied as:

$$\begin{aligned} P(Ac^{\mathcal{{G}}_1}(v_e) \cup Ac^{\mathcal{{G}}_2}(v_e)) = P (Ac^{\mathcal{{G}}_1}(v_e)) + P(Ac^{\mathcal{{G}}_1}(v_e)) - P (Ac^{\mathcal{{G}}_1}(v_e) \cap Ac^{\mathcal{{G}}_1}(v_e)) \end{aligned}$$
$$\begin{aligned} P (Ac^{\mathcal{{G}}_1}(v_e) \cap Ac^{\mathcal{{G}}_1}(v_e)) = P (Ac^{\mathcal{{G}}_1}(v_e)). ~P(Ac^{\mathcal{{G}}_2}(v_e)|Ac^{\mathcal{{G}}_1}(v_e)) \end{aligned}$$

\(P(Ac^{\mathcal{{G}}_2}(v_e)|Ac^{\mathcal{{G}}_1}(v_e))\) is computed by multiplying all the link and node probabilities of the SDN containing \(\mathcal{{G}}_1\) and \(\mathcal{{G}}_2\).

We now demonstrate the algorithm for generating a solution to a query from the dependency network. The first step of our algorithm is to generate the hyper dependency network \(\mathcal{{HG}}\) from a given dependency network \(\mathcal{{G}}\). Once we have a hyper dependency network, we construct an USDN from each path of the hyper dependency network. To construct an USDN from a path of the hyper dependency network, we split each hyper node and hyper edge into the set of nodes and edges of the dependency network. Consider \(G^*\) be the set of all possible USDNs constructed from \(\mathcal{{G}}\). We compute the activation probability of \(v_e\) for each USDN as stated in Lemma 2. We then construct the power set \(\wp ^{G^*}\) of \(G^*\). It may be noted, each set belonging to \(\wp ^{G^*}\), which is a subset of \(G^*\), forms a SDN of \(\mathcal{{G}}\). Since our objective is to minimize the number of services in a solution, we next construct a list of SDNs sorted in ascending order based on the number of AND nodes without considering the dummy nodes in that network. Finally, we consider the SDNs one by one from the sorted list and compute the activation probability of \(v_e\) of each SDN as stated in Eq. 13. For any SDN, if the activation probability of \(v_e\) is greater than or equal to \(\alpha _{solution}\), we return the SDN. If no such SDN \(\mathcal{{G}}'\) of \(\mathcal{{G}}\) exists for which \(P(Ac^{\mathcal{{G}}'}(v_e)) \ge \alpha _{solution}\), this implies the query cannot be answered with probability \(\alpha _{solution}\).

Here, we have demonstrated an optimal algorithm for service composition for a dynamic environment. Though the optimal algorithm is able to generate the optimal solution satisfying all the constraints, however, it suffers from scalability issues for large problem dimensions in real time. This is mainly because of the following two expensive operations.

  1. 1.

    Generation of the hyper dependency network from a dependency network, while constructing the AND hyper layer, we construct all possible AND hyper nodes for each OR hyper node. This step combinatorially explodes.

  2. 2.

    Computation of the power set of the set of USDNs.

In the next section, we, therefore, propose a suboptimal algorithm that can produce a solution faster than the optimal one. However, it compromises on solution quality.

6 A Heuristic Algorithm

In this section, we propose a suboptimal solution using a variant of the classical memory bounded A* algorithm [24]. Here, our objective is to find a SDN of a dependency network constructed in response to a query without constructing the entire hyper dependency network. Given the dependency network \(\mathcal{{G}}\) constructed in response to a query, our objective is to generate a solution path of the hyper dependency network \(\mathcal{{HG}}\) corresponding to \(\mathcal{{G}}\), such that, all constraints are satisfied as much as possible. To do so, we first associate a level with the nodes of the dependency network as follows:

$$\begin{aligned} \begin{aligned} Level (v_s) =&\,0;\\ Level (v_i) =&\,Level ~(\text { predecessor of } v_i~) + 1; \end{aligned} \end{aligned}$$
(14)

We now present the detailed algorithm. The state space of the A* algorithm is \(2^{|V_S|}\), where \(V_S\) is the set of AND nodes in the dependency network. A state is a collection of one or multiple nodes. The initial state of our algorithm is the state consisting of only \(v_e\) and the goal state is the state consisting of \(v_s\). Depending on the node type, a state is classified into two categories: AND state and OR state. The state consisting of a set of AND nodes is called an AND state and the state consisting of a set of OR nodes is called an OR state.

We now discuss the operator of our algorithm. To be precise, the operator of this algorithm indicates how we generate one state from another state. Once we encounter a state, we construct its neighboring state set. Here, we construct maximum n number of neighbor states, where n is a given parameter. The parameter n is used to bound the number of nodes we explore in this algorithm. If we encounter an AND state, only one neighboring OR state is possible. However, if we encounter an OR state, the number of possible neighbor states is equal to \(\prod \limits _{i = 1}^{k} (2^{n_i} - 1)\), where we assume that an OR state consists of k number of OR nodes and each OR node has \(n_i\) number of input links, for \(i = 1, 2, \ldots , k\). Although, one incoming link is sufficient to activate an OR node belonging to an OR state, we may need multiple incoming links for an OR node to increase its activation probability. Therefore, the number of ways we can choose more than one incoming link from the set of incoming links of an OR node is \(2^{n_i} - 1\), where \(n_i\) is the number of incoming links of the OR node. The expression refers to the cardinality of the power set of the set of predecessor AND nodes associated with an OR node excluding the empty set. Therefore, the number of ways we can choose more than one link for each OR node is equal to \(\prod \limits _{i = 1}^{k} (2^{n_i} - 1)\), which refers to the cardinality of the Cartesian product of the power set of the set of AND nodes associated with each OR node excluding the empty set.

Example 5

We now illustrate the neighboring state construction methodology on the dependency network shown in Fig. 2(a). The initial state \(s_1\) is an AND state consisting of \(v_e\). The only possible neighbor of \(s_1\) is an OR state \(s_2\) consisting of \(\{u_{12}, u_{13}\}\). The incoming links for \(u_{12}\) are \((v_6, u_{12}), (v_8, u_{12})\) and the incoming links for \(u_{13}\) are \((v_7, u_{13}), (v_9, u_{13})\). The number of possible AND states is therefore, \((2^2 - 1) * (2^2 - 1) = 9\). The set of neighboring AND states is computed as follows. The power set of \(\{v_6, v_8\}\) excluding the empty set is: \(\wp _1 =\) \(\{\{v_6\}, \{v_8\}, \{v_6, v_8\}\}\). The power set of \(\{v_7, v_9\}\) excluding the empty set is: \(\wp _2 = \{\{v_7\}, \{v_9\},\) \(\{v_7, v_9\}\}\). Cartesian Product of \(\wp _1 \times \wp _2 = \{\{v_6, v_7\}, \{v_7, v_8\}, \{v_6, v_7, v_8\}, \{v_6, v_9\},\) \( \{v_8, v_9\}, \{v_6, v_8, v_9\},\) \(\{v_6, v_7, v_9\}, \{v_7, v_8, v_9\}, \{v_6, v_7, v_8, v_9\}\}\). \(\blacksquare \)

We now discuss the cost function for each state. The cost function f(s) in each state s is calculated as \(f(s) = g(s) + h(s)\), where, g(s) is the number of AND nodes without considering the dummy nodes and h(s) is the heuristic function, which denotes how far the current AND state is from the goal state and defined as:

$$\begin{aligned} \begin{aligned} h(s) =&\quad Level(v_i);\quad v_i \in s \text { and } s \text { is an AND state }\\ =&\quad 0;\qquad \qquad \quad s \text { is an OR state } \end{aligned} \end{aligned}$$
(15)

Finally, we discuss about constraint validation. We validate the constraint on the cumulative set. The cumulative set consists of the set of nodes from the initial state to the current state. If multiple incoming links of an OR node are considered in a solution, we choose the maximum probability value among all the assigned probability values corresponding to all incoming links of the OR node that has been considered, in order to compute the activation probability of the OR node. However, the actual activation probability of the OR node is more than the computed value. This happens because of the following reason. We first consider an OR node \(u_i\) belonging to the cumulative set having two incoming links \(e_1\) and \(e_2\). We further consider two events: (i) \(u_i\) is activated through \(e_1\) and (ii) \(u_i\) is activated through \(e_2\). We need to compute the probability of the following event : \(u_i\) is activated through either \(e_1\) or \(e_2\). Since these two events are not independent, we cannot simply multiply the probability value of the individual events. On the other side, we cannot compute the actual probability value, since in this approach, we traverse the dependency network in the backward direction. Therefore, at the moment of computation, we cannot compute the probability of the availability of \(e_1\) and \(e_2\). Hence, the actual probability of generating a solution is more than the probability value computed by the heuristic algorithm. We use the following equation to represent the constraint.

$$\begin{aligned} \begin{aligned} \prod _{\begin{array}{c} v_i \in \\ \text { cumulative set } \end{array}} P(Av (v_i)) \times \prod _{\begin{array}{c} (v_i, u_j) \in \\ \text { cumulative set } \end{array}} Max_{v_i} P(Av (v_i, u_j)~|~Ac(v_i)) \ge \alpha _{(solution)}&; \end{aligned} \end{aligned}$$
(16)

If any state violates any of the above constraints, we ignore the path through which the node is reached and in that case, we do not update the cost value of that node. If no state is found due to constraint violation, we choose a state that has the minimal violation. The final solution is obtained from the set of states from the initial state to the goal state having minimal cost. Since less number of hyper nodes are explored in this method, this method is expected to be faster than the optimal algorithm, however, it compromises on solution quality, since the entire solution space is not explored due to the memory bound parameter.

7 Experimental Results

We implemented our proposed algorithms in Java (version \(1.7.0\_95\), 32 bit). All experiments were performed on an Ubuntu (version 14.04 LTS, 32 bit, kernel 3.13.0-77-generic) Linux system on a 2.53 GHz machine with 4 GB DDR3 RAM. The algorithms were evaluated against a synthetically generated dataset and the eight public repositories of the 2005 ICEBE Web Service Challenge (WSC) [1]. To the best of our knowledge, this work is the first of its kind. Hence, we provide comparative experimental results between the two approaches proposed in this paper, the optimal one and the heuristic.

Fig. 3.
figure 3

Service description

7.1 Dataset Description

We now present a brief description of our in-house dataset as shown in Fig. 3. We considered 19 different service categories. Each service category performs a specific operation/task. Under each category, there are 2 or 3 different sub categories. Each sub category is selected based on input-output parameters. The services under a specific sub category have identical set of inputs and outputs. We used an in-house web crawler and the open travel allianceFootnote 1 dataset to get the number of services for some service categories (e.g., searchFlight, bookFlight, searchHotel, bookHotel, forecastWeather, bookAirportTransport, bookLocalTransport, searchRestaurant etc.). Consider a query \(\mathcal{{Q}}\) with inputs {FromAirport, ToAirport, DepartureDate, ReturnDate, No.ofPersons, Class, FlightPreferenceCriteria, Credential, ArrivalDate, CheckOutDate, No. ofRooms, City, Budget, HotelPreferenceCriteria, VisitingPreferenceCriteria, Cuisine} and desired output {FlightTicket, HotelBookingConfirmation, AirportCabBookingConfirmation, CityCabBookingConfirmations, WeatherForecastReport, RestaurantName, PhoneNumber}. The total number of services involved in resolving the query was 73. Later, we randomly generated the number of services corresponding to each service category to analyze the performance of our algorithms. We used normal distribution with mean 0 and standard deviation 1 to generate the probability of the services being available in the repository and to generate the probability of producing each output by each service.

Table 2. Comparison of composition time for Case 1

We now compare the performance of the optimal algorithm with respect to the heuristic algorithm. We set \(\alpha _{solution} = 0.7\). We experimented with a few different values of n for the heuristic method. We divided our experiment into two categories.

Case 1: Comparison on Our Dataset: We compare the optimal algorithm with the heuristic algorithm on our synthetically generated dataset. Table 2 shows a comparison between the performances of the optimal and the heuristic algorithms. It is evident from the table, the heuristic algorithm is significantly faster than the optimal algorithm. Columns 3 and 4 of Table 2 show the average composition time for both the optimal and the heuristic algorithms. Column 5 of Table 2 shows the performance degradation (which is measured as the percentage of constraint violation) of the heuristic algorithm. The first 7 rows of Table 2 above the horizontal line, show the comparison between the computation time and the constraint violation with increase in the number of services. The remaining rows (marked in a different shade) in Table 2 present the comparison between the computation time and the constraint violation with increase in the value of n. In 9 cases, the heuristic algorithm violates the constraint as evident from the table. However, in 7 cases, the heuristic algorithm is able to produce a solution, but the optimal algorithm is unable to produce any result. Furthermore, it is also evident from the table that as the value of n increases, the constraint degradation decreases and computation time increases.

Table 3. Comparison of composition time for Case 2

Case 2: Comparison on the ICEBE-2005 Dataset: We compare the optimal algorithm with the heuristic algorithm on the ICEBE-2005 dataset. The dataset contains 19 service repositories. Corresponding to the first service repository (Out Composition), there are 4 queries. For the remaining service repositories, there are 11 queries. Table 3 shows a comparison between the performances of the optimal and the heuristic algorithms when executed with n as 3. Column 2 of Table 3 shows the number of services in the ICEBE-2005 dataset. Columns 3 and 4 of Table 3 show the average composition time for both the optimal and the heuristic algorithms. It is evident from the table, the heuristic algorithm is significantly faster than the optimal algorithm. Column 5 of Table 3 shows the performance degradation of the heuristic algorithm. Though the heuristic algorithm violates the constraints, however, in most of the cases the optimal algorithm fails to generate a solution, whereas the heuristic algorithm is able to produce a solution. It is evident from our experiments that the heuristic algorithm is more efficient in terms of computation time.

8 Conclusion and Future Directions

This paper presents a dynamic variation aware service composition algorithm from the functional perspectives. As future work, we are currently working on extending our proposal to develop more sophisticated techniques for complex dependency service networks. Also we are looking at the scenario when new services are added in the repository or new outputs are available in the system. We believe that our work will open up a lot of new research directions in the general paradigm of composition for dynamic environments.