1 Introduction

In service computing, web service composition is the most effective technology to implement a Service-Oriented Architecture (SOA) [10]. In recent years a large number of enterprises distribute and release their products through web services that can be accessed by others. This leads to rapid growth of the number of web services. One service usually does not meet complex user requirements, so it’s necessary to combine multiple services to form a service composition. Given that the number of services with same functional attributes may be quite large, Quality of Service (QoS) has become an important factor to differentiate competing services. QoS-aware service composition has become a key research direction in the service computing community [1, 6].

In practical applications, under the condition of meeting user’s requirements, the criteria for evaluating whether a service composition solution has applied values are the quality, adaptability and efficiency of composition [11]. Web services rely on the network environment inherently, so the network fluctuation will lead to changes in QoS performance, such as long delay. Therefore, due to the dynamic network environment, a good web service composition solution needs to adapt to the dynamic environment. In addition, the growth of the number of services with similar functionality but different QoS significantly expands the candidate service space. More specifically, if the number of abstract services in composition workflow is m and the candidate service number is n, there exist \(n^m\) possible composition solutions, which leads to a “combinatorial explosion problem” [2, 9]. Existing works mainly focus on using reinforcement learning (RL) to adapt to the dynamic environment. However, the existing RL methods show a poor efficiency for large-scale problems [11].

In this paper, we develop an adaptive service composition method based on deep reinforcement learning (DRL), which integrates reinforcement learning and deep learning. RL helps achieve adaptivity in service composition, deep learning is to enhance the ability of expression and generalization.

2 Related Work

In this section, we review some related works that deal with large-scale and adaptive problems in service composition, including the planing solution, reinforcement learning (RL), and deep reinforcement learning (DRL).

In recent years, there are many studies to address the adaptability issue, such as integer programming technology, graph planning, artificial intelligence. In [13], the authors develop a method using AI planning to build the service composition workflow. A repairing approach is used to deal with the changes in process of composition. However, building a service composition workflow needs some priori knowledge about the environment. Reinforcement learning provides an effective method to achieve adaptive service composition. RL is more suited to resolve the incomplete scenario using the trial and error exploration to discover the optimal policy [4]. Wang et al. [12] propose a service composition method based on Markov Decision Process (MDP). This method only utilizes RL, so it cannot deal with lager-scale service composition problems.

To address the high-dimensional inputs in RL, the deep learning which can extract features from raw data can be employed. In [5], a multi-layer perceptron is adopted to approximate the Q-value, leading to a Neural Fitted Q Iteration (NFQ) algorithm. Mnih et al. [7] apply the DRL with the Atari 2600 game, which successful learns control policies from the high dimension sensation input and with expert level performance.

3 Preliminaries

3.1 Reinforcement Learning

In a standard RL framework, the agent interacts with environment by executing certain actions, and gets a feedback, and adjusts its behaviors. Q-learning [3] is an widely used RL method. Q-learning approximates value function of the state-action pair by reducing the difference between neighboring condition estimated Q-value at every step of learning. The update rule of Q-function is defined as.

$$\begin{aligned} Q\left( s,a \right) \leftarrow (1-\alpha )\overset{{}}{\mathop {Q}}\,\left( s,a \right) +\alpha \left[ r+\gamma {{\max }_{{{a}'}}}\overset{{}}{\mathop {Q}}\,\left( {s}',{a}' \right) \right] \end{aligned}$$
(1)

where \(\alpha \) is learning rate, \(\gamma \) is the discount factor, and Q(sa) is the state-action value under state s executing action a. And in RL, the discounted cumulative reward is used to evaluate the result which is defined as:

$$\begin{aligned} V = \sum _{i=0}^\infty \gamma ^i r_i \end{aligned}$$
(2)

where \(r_i\) is the \(i-th\) step immediate reward.

Fig. 1.
figure 1

A simple LSTM block

3.2 Deep Learning

LSTM is a recurrent neural network (RNN) extended with memory. Three other layers are added as hidden memory units compared with the original RNN, including the input gate, output gate, and forget gate. As shown in Fig. 1: The LSTM can be divided into three parts: (1) Forget gate is used to decide what information will be discarded, and the output value will be delivered to cell state \(C_{t-1}\). (2) Determine what information can be put into the cell, which consists of two parts. One part will be updated by the input gate and another part is a new candidate vector created by Tanh layer. (3) Update the old information.

3.3 Deep Reinforcement Learning

Google DeepMind Team combines perception of Deep Learning and decision-making ability of RL to develop Deep Reinforcement Learning (DRL). The learning process is divided into three steps: (1) Through interacting with the environment (achieved by RL), an agent obtains observation and delivers the high dimension results to a neural network, to learn abstract representations; (2) The agent evaluates the action based on repayment value, and maps the current condition to a corresponding action by two kinds of strategy; (3) The environment responds to the action and gets the next observation.

This paper adopts the structure of RNN to remember the continuous state information in a history timeline and uses a Adaptive Deep Q-learning and RNN Composition Network (ADQRCN), which are suitable for service composition.

4 Problem Formulation

Consider someone who wants to arrange his trip schedule after determining departure and return back time. He may consume services, such as weather forecast, flight information search, and hotel reservation. The process of whole trip can be modelled as a transition graph in Fig. 2. It consists of two kinds of nodes. The hollow node represents state node (i.e., abstract service), such as \(S_{0}\). Another type is a solid node, namely the concrete service. Abstract service refers to a class of services with the same function attributes and different QoS. Every abstract service has multiple concrete services.

Fig. 2.
figure 2

The MDP-WSC model for vocation planning

Based on the flow chart of vocation planning, we need to construct the model to solve the problem. We model service composition using a Markov Decision Process (MDP) and further exploit how to generate an effective policy.

Definition 1

(MDP-based web service composition (MDP-WSC)). A MDP-WSC is a 6-tuple MDP-WSC=\(<S,S_{0},S_{\tau },A(.),P,R>\), where

  • S is a finite set of the world states;

  • \(S_{0}\in S\) is the initial state from which an execution of the service composition starts;

  • \(S_{\tau }\subset S\) is the set of terminal states, indicating an end of composition execution when reaching one state \(S_{\tau }^{i} \in S_{\tau }\);

  • A(s) represents the set of services that can be executed in state \(s \in S \);

  • P is a probability distribution function. When a web service \(\alpha \) is invoked, the world makes a transition from its current state s to a succeeding state \(s'\). The probability for this transition is labeled as \(P(s'\left| {s,\alpha } \right. )\);

  • R is the immediate reward function. When the current state is s and a service \(\alpha \) is selected, we get an immediate reward \(r= R(s,a)\) from the environment after executing the action.

The immediate reward from environment can be calculated by the aggregated QoS value [12]. Att represents the attribute of a service, and w is the weighting factor of Att.

$$\begin{aligned} R(s)=\sum w_i \times \frac{Att^{s}_{i} - Att^{min}_{i}}{Att^{max}_{i} - Att^{min}_{i}} \end{aligned}$$
(3)

5 Service Composition Based on DRL

5.1 RNN in Deep Reinforcement Learning

The purpose of the neural network is mainly to generalize state-action pairs and the corresponding Q-value. Figure 3 depicts the basic RNN structure in ADQRCN, where the input layer consists of state and action information collection. The input is passed through a hidden layer composed of 30 Long Short-Term Memory (LSTM) units and a full connection layer. Finally, the Q value is generated by the output layer.

Fig. 3.
figure 3

The structure of ADQRCN

5.2 Learning Strategies

With regard the training of ADQRCN, we adopt a similar method as in [7, 8]. The neural network of ADQRCN simulates the Q function, given by formula (4) which means the neural network \(f(s,a;\theta )\) is used to predict the Q-value and \(\theta \) are the parameters of neural network. Bellman Equation (5) is used to calculate variance (6). Then, gradient descent (7) is used to update the network parameters.

$$\begin{aligned} f(s,a;\theta )&\approx {{Q}}(s,a;\theta ) \end{aligned}$$
(4)
$$\begin{aligned} Q(s,a)&=r+\gamma max{}_{{{a}^{'}}}Q({{s}^{'}},{{a}^{'}};\theta ) \end{aligned}$$
(5)
$$\begin{aligned} L&=E[{{(r+\gamma ma{{x}_{a'}}Q(s',a';\theta )-Q(s,a;\theta ))}^{2}}] \end{aligned}$$
(6)
$$\begin{aligned} \frac{\partial L(\theta )}{\partial \theta }&=E[(r+\gamma ma{{x}_{a'}}Q(s',a';\theta )-Q(s,a;\theta ))\frac{\partial Q(s,a;\theta )}{\partial \theta }] \end{aligned}$$
(7)
figure a

5.3 Algorithm

Algorithm 1 describes the detailed process of training of ADQRCN. At first, the empty dataset of the recurrent neural network is initialized with capacity N. The action-value function Q and the target action-value function \(\hat{Q}\) are both implemented by the recurrent neural network with random weights. In the training process, an agent selects an action according to the Q value function and executes the action. After obtaining the reward \(r_t\) and next state \(s_{t+1}\), the transition \(\left( {{s}_{t}},{{a}_{t}},{{r}_{t}},{{s}_{t+1}} \right) \) will be stored in the replay memory D. Then, the adjusting process will begin, according to the method in Sect. 5.2, which can improve prediction accuracy of Q value function. The algorithm will repeat the above process until convergence (the service composition result remains the same over two iterations) and output the final service composition result.

6 Experiments and Analysis

We conduct the experiments to assess the proposed Adaptive Deep Reinforcement Learning algorithm (ADQRCN) on three aspects: effectiveness, adaptability and scalability. And the traditional Q-Learning Service Composition Network (QCN) [12] which use the Q-learning and MDP to obtain the optimal service composition is implemented to be compared with our method.

6.1 Experiment Setting

In the experiment, we mainly consider four QoS attributes, including ResponseTime, Throughput, Availability, and Reliability. The experimental data comes from QWS DatasetFootnote 1. Considering that the scale of QWS Dataset is small, we randomly expand the dataset to simulate a large-scale scenario, which will allow us to verify the advantage of our methods. In the evaluation of result, we use the discounted cumulative reward mentioned in formula (2) to represent the performance of composition scheme.

The experiment environment is based on a Window7 (64bit) system, running on an Intel i7-6700K 4.00GHz CPU with 16GB RAM.

6.2 Result Analysis

6.2.1 Validation of Effectiveness

The experiments are conducted with 100 state nodes (abstract services) and each state node corresponding to 500 candidate services. Therefore, the total number of possible service composition schema is \({{500}^{100}}\), which qualifies for a large-scale scenario.

Fig. 4.
figure 4

(a) Validation of effectiveness (b) Validation of Scalability (c)Validation of adaptability

As shown in Fig. 4(a), the ADQRCN is better than the QCN and the ADQRCN converges more rapidly than the QCN. Thus, this experiment also demonstrates the efficiency of ADQRCN. Due to the QCN is based on the table storage with random exploration, its performance is worse than ADQRCN on basis of generalization expression.

6.2.2 Validation of Scalability

In this series of experiments, the number of state nodes are fixed at 100, and candidate services is set as 700 in Fig. 4(b) to compare with the experiment with 500 candidate services in Fig. 4(a). From the figure the convergence rate of ADQRCN significantly outperforms QCN. Because ADQRCN adopts the neural network as the generalization value function, the method maintains strong ability of generalization and the ability to quickly achieve convergence.

6.2.3 Validation of Adaptability

In the experiment, to simulate a changeable environment, we change 1%, 5% and 10% QoS values of services in period of fixed time (between 2000th episode- 2500th episode). The result of three groups of experiments are shown in Fig. 4(c). The fluctuation of services has certain influence on the learning performance, but these effects are temporary. From an overall perspective ADQRCN has stronger adaptability when facing the fluctuations, which may be related to the forecast model.

7 Conclusion

The paper proposes an adaptive deep reinforcement learning framework to ensure the adaptability and efficiency in large-scale service composition. The adaptive deep reinforcement learning framework uses recurrent neural network simulation of reinforcement learning function and effective information storage to improve the ability to scale to a large and dynamic service environment. The main innovation of this paper include the following:

  • We propose the MDP-WSC model, which is closer to the real service composition problem and suitable for the large-scale scenario.

  • In view of the limitation of reinforcement learning, we integrate the perception of deep learning with reinforcement learning to solve large-scale service composition problem.