Keywords

1 Introduction

Designing optimal controllers for robotic systems is one of the major tasks in the robotics research field. Hence, it is desirable to have a controller that can control the robot for different tasks or contexts in real time, for example a soccer robot should be able to kick the ball for any desired kick distance which can be chosen from a continuous range of kick distances. We define a task as a context. Context is a vector of variables that do not change during a task’s execution, but might change from task to task. In this paper for example, the context is the distance the ball travels after being kicked and can be chosen by the agent. The kick task is one of the most important skills in the context of robotic soccer [1]. Typically the kick controllers are only applicable for a discretized number of desired distances. For example three sets of parameters for the kick controller is obtained which are applicable for long, mid and short distance kicks. Such a controller limits the robot to properly pass the ball to its teammates. Controlling the robot to kick the ball (near)optimally for different distances, allows the agents have a lot more control and options regarding their next decision, which could affects the game’s outcome. Our goal is to find a parametric function that given a desired kick distance, outputs the (near) optimal controller parameters. In the other word we would like to obtain a policy \(\pi (\theta |s)\) that sets the parameters \(\theta \) of a robot kick controller given a context s which is the desired kick distance. In order to optimize the robot controller parameters given an objective function, there are many algorithms proposed by the scientific community [2,3,4,5,6,7,8,9]. However, many of these algorithms usually optimize a parameter set for a single context, such as optimizing a kick for the longest distance or the highest accuracy [10]. In other words, these algorithms fail to generalize the optimized movement for a context to different contexts. In order to generalize the kick motion to, for example, different kicking distances, typically the parameters are optimized for several target contexts independently. Afterwards, to generalize movements to new unseen contexts and to obtain a continuous policy \(\pi (\theta |s)\), regression methods are commonly used [11, 12]. Although such approaches have been successfully used, they are time consuming and inefficient regarding the number of needed training samples. In such a method, data-points obtained from optimizing the kick controller for context s cannot be re-used to improve and accelerate the optimisation for context \(s'\). This is due to the fact that optimizing the controller parameters and generalizing them are two independent processes and the correlation between different contexts is ignored during the optimisation. Therefore in this paper we propose to use contextual relative entropy policy search (CREPS) algorithm which searches for the optimal parameters of the policy \(\pi (\theta |s)\) in one run optimisation process a. In the other word in CREPS, optimizing the controller parameters and generalizing them happens simultaneously and therefore the correlation between different contexts can be exploited in order to accelerate the optimisation. CREPS, however, has a major drawback related to its search distribution update. The distribution might collapse prematurely to a point-estimate, resulting in premature convergence. On the other hand, the CMA-ES algorithm [2] which is not a contextual algorithm has shown to be able to avoid premature convergence. Therefore we combine the update rules of CREPS and CMA-ES resulting to the contextual relative entropy policy search with covariance matrix adaptation (CREPS-CMA). We will show that CREPS-CMA avoids premature convergence. Hence we will use CREPS-CMA for optimising the kick controller. We will also show that a non-linear function of desired kick distance clearly outperforms a linear one. This effect has been also observed for the humanoid walking task [13]. Now our robot is able to kick the ball for a continuous range of desire kick distances. This is in contrast with our previous approach where we had 3 sets of parameters for short, mid and long distance kicks.

2 The Approach

We used a simulated Nao robot shown in Fig. 1 for our experiments. Our movement pipeline is composed of two main parts: a kick controller, which receives parameters \( {\varvec{\theta }} \) and converts them into joint commands for the robot’s servos; and a policy function, which maps a given context s for a specific kick distance into the corresponding parameter vector \( {\varvec{\theta }} \). The pipeline for the kick task, whose context is the kick distance s with a straight kick direction with respect to the torso, is shown in Fig. 2.

Fig. 1.
figure 1

The initial (left) and final (right) positions of an exemplary kick movement.

Fig. 2.
figure 2

The pipeline of our contextual kick movement.

2.1 Kick Controller

We have a kick controller which is a simple keyframe-based [10] linear model and we also have stability module as in [1] that stabilize the robot during performing the kick movement. A keyframe, as defined in [10], is a complete description of joint angles, either absolute or relative to the previous keyframe. Our keyframe based controller is defined by the following parameters:

  • The initial keyframe, represented as a vector \(\alpha \) of joint angles with dimension l,

  • The final keyframe, also represented as a vector \(\beta \) of joint angles with dimension l.

  • The action time t that is the amount of time the robot takes to move from the initial to the final keyframe. The joint angles are linearly interpolated across t to create the corresponding movement.

During performing kick only the legs joints move and remaining joints (arms and head joints) are kept constant. As each leg has 6 joints, \(\alpha \) and \(\beta \) are 12-dimensional vectors. Therefore considering the action time t, our kick controller has 25 parameters to set. The controller receives a 25-dimensional parameter vector \( {\varvec{\theta }} \), which is then interpolated and coded into motor commands. Figure 1 shows the initial and final positions of an exemplary kick. The stability module has its constant parameters which doesn’t change from task to task, please see [1] for more details of our stability module. Now we need to find a policy function of kick distance s that sets our controller parameters with the proper parameters \( {\varvec{\theta }} \) for any given desired kick distance.

2.2 Policy Function

Our goal is to find a function in form of

$$\begin{aligned} \mu ( {\varvec{s}} ) = \varvec{A} ^T \varphi ( {\varvec{s}} ), \end{aligned}$$

that given a context vector \( {\varvec{s}} \) with dimension \(d_s\), outputs a optimal parameter vector \( {\varvec{\theta }} \) with dimension \(d_ {\varvec{\theta }} \) such that it maximise our objective function \(R( {\varvec{\theta }} , {\varvec{s}} ): \{\mathbb {R}^{d_s},\mathbb {R}^{d_ {\varvec{\theta }} }\} \rightarrow \mathbb {R}\). Where \(\varphi ( {\varvec{s}} )\) is an arbitrary feature function of context \( {\varvec{s}} \) that outputs a feature vector with dimension \(d_\varphi \) and the gain matrix \(A_{\pi }\) is a \(d_ {\varvec{\theta }} \times d_\varphi \) matrix. Typically \(\varphi ( {\varvec{s}} ^{[i]}) = [1 \quad {\varvec{s}} ^{[i]}]\), which results in linear generalization over contexts. In order to achieve non-linear generalization over contexts, we can use normalized radial basis features (RBF) as a feature function:

$$\begin{aligned} \varphi ( {\varvec{s}} ^{[i]}) = \frac{\psi _j( {\varvec{s}} ^{[i]})}{\sum _{j=1}^{K}\psi _j( {\varvec{s}} ^{[i]})}, \quad \psi _j( {\varvec{s}} ^{[i]})=\exp (-\frac{( {\varvec{s}} ^{[i]}-c_j)^2}{2\sigma ^2}), \end{aligned}$$

where K is the number of RBFs and centres \(\{c_j\}_{j = 1\dots K}\) are equally spaced in the range of \( {\varvec{s}} \), based on the desired number of RBFs K, and \(\sigma ^2\) is the bandwidth of the RBF. The bandwidth represents how related contexts are. A large bandwidth means that contexts are very similar and therefore the relationship is (near)linear. A bandwidth of 0 is an extreme case where movements are not generalizable at all, and each context has its independent optimal parameters. Both K and \(\sigma ^2\) are hand-tuned parameters. RBF features have been shown to enable algorithms to learn non-linear policies which greatly outperform their linear counterparts on non-linear tasks, such as walking [13], so we expected a performance increase. Now the task is to learn the optimal gain matrix A. As we don’t have the labelled data to fit A, we need to use a reinforcement learning method.

figure a

2.3 Learning Policy Function

In order to learn the policy function \(\mu ( {\varvec{s}} )\) we use a contextual policy search algorithm called CREPS-CMA. CREPS-CMA is an extension of contextual REPS [8, 14] which is capable of multi-task learning. The goal of CREPS-CMA is to find a function \(\mu ( {\varvec{s}} )\) that given a context \( {\varvec{s}} \), it outputs a parameter vector \( {\varvec{\theta }} \) such that \(\{ {\varvec{s}} , {\varvec{\theta }} \}\) maximises the objective function \(R( {\varvec{s}} ,\theta )\). The only accessible information on the objective function \(R( {\varvec{s}} ,\theta )\) are evaluations \(\{R_k\}_{k=1\dots k}\) of samples \(\{ {\varvec{s}} _k, {\varvec{\theta }} _k\}_{k=1\dots k}\), where k is the index of the sample, ranging from 1 to the number of samples N. CREPS-CMA maintains a stochastic search distribution \(\pi ( {\varvec{\theta }} | {\varvec{s}} )\) over the parameter space \( {\varvec{\theta }} \) of the objective function which is used to generate samples \( {\varvec{\theta }} \) given \( {\varvec{s}} \). The search distribution \(\pi ( {\varvec{\theta }} | {\varvec{s}} )\) is modelled as a linear Gaussian policy, i.e.,

$$\begin{aligned} \pi ( {\varvec{\theta }} |s) = \mathcal {N}\left( \varvec{\theta } | \varvec{A} ^T \varphi ( {\varvec{s}} ),\varSigma _{\pi } \right) , \end{aligned}$$

where the mean of the distribution is our policy function \(\mu ( {\varvec{s}} )\) we are searching for and covariance matrix \(\varSigma _{\pi }\) controls the exploration of the algorithm. CREPS-CMA is an iterative algorithm. First it initializes the search distribution \(\pi ( {\varvec{\theta }} | {\varvec{s}} )\) by defining A matrix and covariance matrix \(\varSigma _{\pi }\) with arbitrary valuesFootnote 1. Afterwards in each iteration, given context samplesFootnote 2 \(\{ {\varvec{s}} _k\}_{k=1\dots k}\), the current search distribution \(q( \varvec{\theta } | {\varvec{s}} )\) is used to create samples \(\{ {\varvec{\theta }} _k\}_{k=1\dots k}\) of the parameter vector \( {\varvec{\theta }} \). Subsequently, the evaluation \(\{R_k\}_{k=1\dots k}\) of samples \(\{ {\varvec{s}} _k, {\varvec{\theta }} _k\}_{k=1\dots k}\) is obtained by querying the objective function \(R( {\varvec{s}} , {\varvec{\theta }} )\). And dataset \(\{ {\varvec{s}} _k, {\varvec{\theta }} _k,R_k\}_{k=1\dots k}\) is used to compute a weight \(\{d_k\}_{k=1\dots k}\) for all samples. Each weight is a pseudo-probability for the corresponding sample. Subsequently, using \(\{ {\varvec{s}} _k, {\varvec{\theta }} _k,d_k\}_{k=1\dots k}\), a new Gaussian search distribution \(\pi ( {\varvec{\theta }} | {\varvec{s}} )\) is estimated by estimating a new A matrix and covariance matrix \(\varSigma _{\pi }\). The new search distribution will give more probabilities to the samples \(\{ {\varvec{s}} _k, {\varvec{\theta }} _k\}_{k=1\dots k}\) with better returns \(\{R_k\}_{k=1\dots k}\). This process runs iteratively until the algorithm converges to a solution. After all we are interested in the matrix A to construct our policy function \(\mu ( {\varvec{s}} )\). Algorithm 1 shows a compact representation of contextual stochastic search methods. Now we briefly explain how CREPS-CMA computes weights and what are the update rules of the search policy.

2.4 CREPS-CMA

The key idea behind contextual REPS [8] is to ensure a smooth and stable learning process by bounding the relative entropy between the old search distribution \(q( {\varvec{\theta }} | {\varvec{s}} )\) and the newly estimated policy \(\pi ( {\varvec{\theta }} | {\varvec{s}} )\) while maximising the expected return. This results in a weight

$$\begin{aligned} d_k = \exp \left( ({\mathcal {R}_{ {\varvec{s}} {\varvec{\theta }} }} - V( {\varvec{s}} ) ) / \eta \right) \end{aligned}$$

for each sample \([ {\varvec{s}} _k, {\varvec{\theta }} _k]\), which we can use to estimate a new search distribution \(\pi ( {\varvec{\theta }} | {\varvec{s}} )\). \({\mathcal {R}_{ {\varvec{s}} {\varvec{\theta }} }} \) denotes the expected performance when evaluating parameter vector \( {\varvec{\theta }} \) in context \( {\varvec{s}} \) and \(V( {\varvec{s}} ) = \varvec{\varphi } ( {\varvec{s}} )^T \varvec{w} \) is a context dependent baseline which is subtracted from the return \({\mathcal {R}_{ {\varvec{s}} {\varvec{\theta }} }} \). The parameters \( \varvec{w} \) and \(\eta \) are Lagrangian multipliers that can be obtained by minimising the dual function, given as

$$\begin{aligned} \min _{\eta , \varvec{w} } g(\eta , \varvec{w} ) = \eta \epsilon + \hat{ \varvec{\varphi } }^T \varvec{w} + \eta \log \left( \sum _{K=1}^N \frac{1}{N}\exp \left( \dfrac{R^{[k]} - \varvec{\varphi } ( {\varvec{s}} ^{[k]})^T \varvec{w} }{\eta } \right) \right) . \end{aligned}$$

where \(\hat{ \varvec{\varphi } } = \sum _{K=1}^N \varvec{\varphi } ( {\varvec{s}} ^{[k]})\) is the expected feature vector for the given context samples. We optimize this convex dual function by gradient decent. Now given dataset \(\{ {\varvec{s}} _k, {\varvec{\theta }} _k,d_k\}_{k=1\dots N}\) and the old Gaussian search distribution

$$\begin{aligned} q( {\varvec{\theta }} | {\varvec{s}} ) = \mathcal {N}\left( \varvec{\theta } | \varvec{A} _{q}^T \varphi ( {\varvec{s}} ),\varSigma _{q} \right) , \end{aligned}$$

we want to find the new search distribution \(\pi ( {\varvec{\theta }} | {\varvec{s}} )\) by finding \(A_{\pi }\) and \(\varSigma _{\pi }\). Therefore we need two update rules, one for updating the context-dependent policy function \( \varvec{\mu } _{\pi }( {\varvec{s}} )\) of the search distribution and another one for updating the covariance matrix \(\varSigma _{\pi }\) of the distribution.

Context-Dependent Mean-Function Update Rule. The matrix A can be obtained by the weighted maximum likelihood, i.e.,

$$\begin{aligned} \varvec{A} = {( \varvec{\varPhi } ^T \varvec{D} \varvec{\varPhi } + \lambda \varvec{I} )}^{-1} \varvec{\varPhi } ^T \varvec{D} \varvec{U} , \end{aligned}$$
(1)

where \( \varvec{\varPhi } ^{T} = [\varphi ^{[1]},\dots , \varphi ^{[N]}]\) contains the feature vector for all context samples \(\{ {\varvec{s}} _k\}_{k=1\dots N}, \varvec{U} = [\theta ^{[1]},\dots , \theta ^{[N]}]\) contains all the sample parameters, \( \varvec{D} \) is the diagonal weighting matrix containing the weightings \(\{k\}_{k=1\dots N}\) and \(\lambda \varvec{I} \) is a regularization term. \(\lambda \) is a very small number such as \(1\text {e}{-}8\).

Covariance Matrix Update Rule. Standard contextual REPS directly uses the weighted sample covariance matrix as \( \varvec{\varSigma } _{\pi }\) which is obtained by

$$\begin{aligned}&\varvec{S} = \frac{\sum _{k = 1}^N d_k\big ( {\varvec{\theta }} _k- \varvec{A} ^T \varphi ( {\varvec{s}} _k)\big )\big ( {\varvec{\theta }} _k- \varvec{A} _{\pi }^T \varphi ( {\varvec{s}} _k)\big )^T}{Z}, \\&Z = \frac{(\sum _{k = 1}^N d_k)^2 - \sum _{k = 1}^N (d_k)^2}{\sum _{k = 1}^N (d_k)}. \nonumber \end{aligned}$$
(2)

It has been shown that the sample covariance matrix from Eq. 2 is not a good estimate of the true covariance matrix [15], since it biases the search distribution towards a specific region of the search space. In other words, the search distribution loses its exploration entropy along many dimensions of the parameter space, which causes premature convergence. This is a highly unwanted effect in policy search. To alleviate this problem, inspired by rank-\(\mu \) update rule of CMA-ES [2], which is not a contextual algorithm, we combine the old covariance matrix and the sample covariance matrix from Eq. 2, i.e.,

$$\begin{aligned} \varvec{\varSigma } _{\pi } = (1-\lambda ) \varvec{\varSigma } _q + \lambda \varvec{S} . \end{aligned}$$

There are different ways to determine the interpolation factor \(\lambda \in [0, 1]\) between the sample covariance matrix \( \varvec{S} \) and the old covariance matrix \( \varvec{\varSigma } _q\). For example, in [15], the factor \(\lambda \in [0, 1]\) is chosen in such a way that the entropy of the new search distribution is reduced by a certain amount, while also being scaled with the number of effective samples. We will extended REPS by using the rank-\(\mu \) covariance matrix adaptation method of CMA-ES algorithm [2] which has been shown to be effective for avoiding premature convergence, i.e.,

$$\begin{aligned} \lambda = min \left( 1, \frac{\phi _{\text {eff}}}{d_{ {\varvec{\theta }} }^2}\right) ,\phi _{\text {eff}} = \frac{1}{\sum _{k=1}^{N}(d^{[k]})^2} \end{aligned}$$

where \(\phi _{\text {eff}}\) is the number of effective samples and \(d_{ {\varvec{\theta }} }\) is the dimension of the parameter space \( {\varvec{\theta }} \).

3 Experiments

Footnote 3In this section, first we evaluate CREPS-CMA algorithm. Hence we use standard optimization test functions [16], such as the Sphere, the Rosenbrock and the Rastrigin (multi-modal) functions. We extend these functions to be applicable for the contextual setting. The task is to find the optimum 15 dimensional parameter vector \( {\varvec{\theta }} \) for a given 1 dimensional context \( {\varvec{s}} \). We will show that CREPS-CMA performs favourably. Afterwards, We use CREPS-CMA to optimize our kick controller for different desired kick distances for a simulated Nao robotFootnote 4 and will show our accuracy results, with both linear and non-linear policies. According to the results non-linear policy outperforms the linear one.

3.1 Standard Optimization Test Functions

We chose three standard optimization functions, which are the Sphere function

$$\begin{aligned} f( {\varvec{s}} ,\theta ) = \sum _{i=1}^{p} \varvec{x} _i^2, \end{aligned}$$

the Rosenbrock function

$$\begin{aligned} f( {\varvec{s}} ,\theta ) = \sum _{i=1}^{p-1}[100( \varvec{x} _{i+1}- \varvec{x} _i^2)^2+(1- \varvec{x} _i)^2], \end{aligned}$$

and also a multi-modal function, known as the Rastgirin function

$$\begin{aligned} f( {\varvec{s}} ,\theta ) = 10p + \sum _{i=1}^{p}[ \varvec{x} _i^2-10 \cos (2\pi \varvec{x} _i)], \end{aligned}$$

where p is the number of dimensions of \( {\varvec{\theta }} \) and \( \varvec{x} = {\varvec{\theta }} + \varvec{A} \varvec{s} \). The matrix A is a constant matrix that was chosen randomly. In our case, because the context \( {\varvec{s}} \) is 1 dimensional, A is a \(p \times 1\) dimensional vector. Our definition for \( \varvec{x} \) means the optimum \( \varvec{\theta } \) for these functions is linearly dependent on the given context \( {\varvec{s}} \). The initial search area of \( \varvec{\theta } \) for all experiments is restricted to the hypercube \(-5 \le {\varvec{\theta }} _i \le 5, i=1,\dots ,p\) and contexts are uniformly sampled from the interval \(0 \le {\varvec{s}} _i \le 3, i=1,\dots ,z\) where z is the dimension of the context space \( {\varvec{s}} \). In our experiments, the mean of the initial distributions has been chosen randomly in the defined search area. We compared CREPS-CMA with the standard Contextual REPS. In each iteration, we generated 50 new samples. The results in Fig. 3 show that CREPS-CMA could successfully learn the contextual tasks while standard Contextual REPS suffers from premature convergence.

Fig. 3.
figure 3

The performance comparison of CREPS and CREPS-CMA for optimising contextual versions of standard functions (a) Sphere, (b) Rosenbrock and (c) Rastrigin. The results show that CREPS-CMA clearly outperforms CREPS in all three benchmarks while CREPS suffers from premature convergence.

Fig. 4.
figure 4

The 15 RBFs setup used for generating features.

Fig. 5.
figure 5

The learned linear (left) and non-linear (right) policies for kick distances of 2.5 to 12.5 m. The y-axis represents the controller parameter values for a given desired kick distance, and the x-axis represents the desired kick distance.

3.2 Kick Task Results

We use a Nao humanoid robot simulated in RoboCup 3D simulation environment which is based on SimSparkFootnote 5: a generic physical multiagent system simulator. The robot has 22 degrees of freedom, six in each leg, four in each arm, and two in the neck. We use CREPS-CMA to train a simulated NAO robot by optimising the kick controller explained in Sect. 2 using both linear policies, i.e., \(\varphi ( {\varvec{s}} ^{[i]}) = [1\quad {\varvec{s}} ^{[i]}]\) and a RBF based non-linear policy. The desired kick distance s varies from 2.5 m to 12.5 m. For the non-linear policy, we choose \(K=15\) normalized RBFs and \(\sigma ^2\) is set to 0.5. Both K and the \(\sigma ^2\) parameters were chosen by trial and error to maximize the results accuracy. Figure 4 shows the setup of the used RBFs over the context range.

We maximize a context dependent objective function

$$\begin{aligned} R(s,\theta ) = -(x-s)^2-y^2, \end{aligned}$$

where s is the desired kick distance, and x and y are the ball distances travelled along the x- and y-axes using the kick controller with the given parameter set \(\theta \). We initialize the search distribution \(\pi \) with a hand tuned kick policy, which was able to kick the ball over 15 m. We optimized the kick with 1000 iterations. Each iteration generates 20 new samples where the contexts were sampled uniformly. Each sample was evaluated 5 times, and was averaged to smooth out the noisy returns. In order to simulate competition conditions, for evaluating each sample, we placed the robot in 5 different positions around the ball and it had to perceive the ball, move towards it, position itself in place and then kick it towards the target goal using the kick controller. We compared the performance of the linear policy with non-linear one. Figure 6 shows that the non-linear policy clearly outperforms the linear one and the accuracy of the non-linear policy is considerable.Footnote 6 The average error of the linear policy was \(0.82 \pm 0.10\,\text {m}\) while we achieved an average error of \(0.34 \pm 0.11\,\text {m}\) using the non-linear policy. As expected, using a non-linear policy improves the accuracy of the results with order of magnitude. In fact, the average error is more than halved. This also demonstrates the non-linearity nature of robotic tasks such as kick task and the usefulness of using RBF functions to capture this non-linearity. Figure 5 shows the learned linear and non-linear policies for generalizing the 25 parameter kick controller for different kick distances. We can see that the learned linear policy is a linear approximation of its corresponding non-linear policy.

Fig. 6.
figure 6

The performance of the learned linear (blue) and non-linear (red) policies. The x-axis represents the desired kick distance, in meters, while the y-axis represents the error with respect to desired kick distance, also in meters. (Color figure online)

4 Conclusion

We used a recently proposed contextual policy search algorithm to generalize a robot kick controller for different desired kick distances, where a context is described by a real-valued vector of distances. We have modified the algorithm, naming it CREPS-CMA. Using CREPS-CMA, we have successfully learned linear and non-linear policies over the context of kick distances. The non-linear policy outperforms its linear counterpart, and allows a humanoid robot to kick a ball with flexible distances and with satisfactory accuracy results, which could leads to a better control and coordination in a robotic soccer match. In this research, we also demonstrated the non-linearity of a kick task. In future we will use more complex kick controllers such as dynamic motor primitives.