Learning a Humanoid Kick with Controlled Distance

Abdolmaleki, Abbas; Simões, David; Lau, Nuno; Reis, Luis Paulo; Neumann, Gerhard

doi:10.1007/978-3-319-68792-6_4

Abbas Abdolmaleki^17,18,19,
David Simões¹⁷,
Nuno Lau¹⁷,
Luis Paulo Reis^18,19 &
…
Gerhard Neumann²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9776))

Included in the following conference series:

Robot World Cup

2301 Accesses
15 Citations

Abstract

We investigate the learning of a flexible humanoid robot kick controller, i.e., the controller should be applicable for multiple contexts, such as different kick distances, initial robot position with respect to the ball or both. Current approaches typically tune or optimise the parameters of the biped kick controller for a single context, such as a kick with longest distance or a kick with a specific distance. Hence our research question is that, how can we obtain a flexible kick controller that controls the robot (near) optimally for a continuous range of kick distances? The goal is to find a parametric function that given a desired kick distance, outputs the (near) optimal controller parameters. We achieve the desired flexibility of the controller by applying a contextual policy search method. With such a contextual policy search algorithm, we can generalize the robot kick controller for different distances, where the desired distance is described by a real-valued vector. We will also show that the optimal parameters of the kick controller is a non-linear function of the desired distances and a linear function will fail to properly generalize the kick controller over desired kick distances.

You have full access to this open access chapter, Download conference paper PDF

An Omni-Directional Kick Engine for Humanoid Robots with Parameter Optimization

Contextual Policy Search for Linear and Nonlinear Generalization of a Humanoid Walking Controller

Article 15 February 2016

Learning to Use Toes in a Humanoid Robot

Keywords

1 Introduction

Designing optimal controllers for robotic systems is one of the major tasks in the robotics research field. Hence, it is desirable to have a controller that can control the robot for different tasks or contexts in real time, for example a soccer robot should be able to kick the ball for any desired kick distance which can be chosen from a continuous range of kick distances. We define a task as a context. Context is a vector of variables that do not change during a task’s execution, but might change from task to task. In this paper for example, the context is the distance the ball travels after being kicked and can be chosen by the agent. The kick task is one of the most important skills in the context of robotic soccer [1]. Typically the kick controllers are only applicable for a discretized number of desired distances. For example three sets of parameters for the kick controller is obtained which are applicable for long, mid and short distance kicks. Such a controller limits the robot to properly pass the ball to its teammates. Controlling the robot to kick the ball (near)optimally for different distances, allows the agents have a lot more control and options regarding their next decision, which could affects the game’s outcome. Our goal is to find a parametric function that given a desired kick distance, outputs the (near) optimal controller parameters. In the other word we would like to obtain a policy $\pi (\theta |s)$ that sets the parameters $\theta $ of a robot kick controller given a context s which is the desired kick distance. In order to optimize the robot controller parameters given an objective function, there are many algorithms proposed by the scientific community [2,3,4,5,6,7,8,9]. However, many of these algorithms usually optimize a parameter set for a single context, such as optimizing a kick for the longest distance or the highest accuracy [10]. In other words, these algorithms fail to generalize the optimized movement for a context to different contexts. In order to generalize the kick motion to, for example, different kicking distances, typically the parameters are optimized for several target contexts independently. Afterwards, to generalize movements to new unseen contexts and to obtain a continuous policy $\pi (\theta |s)$, regression methods are commonly used [11, 12]. Although such approaches have been successfully used, they are time consuming and inefficient regarding the number of needed training samples. In such a method, data-points obtained from optimizing the kick controller for context s cannot be re-used to improve and accelerate the optimisation for context $s'$. This is due to the fact that optimizing the controller parameters and generalizing them are two independent processes and the correlation between different contexts is ignored during the optimisation. Therefore in this paper we propose to use contextual relative entropy policy search (CREPS) algorithm which searches for the optimal parameters of the policy $\pi (\theta |s)$ in one run optimisation process a. In the other word in CREPS, optimizing the controller parameters and generalizing them happens simultaneously and therefore the correlation between different contexts can be exploited in order to accelerate the optimisation. CREPS, however, has a major drawback related to its search distribution update. The distribution might collapse prematurely to a point-estimate, resulting in premature convergence. On the other hand, the CMA-ES algorithm [2] which is not a contextual algorithm has shown to be able to avoid premature convergence. Therefore we combine the update rules of CREPS and CMA-ES resulting to the contextual relative entropy policy search with covariance matrix adaptation (CREPS-CMA). We will show that CREPS-CMA avoids premature convergence. Hence we will use CREPS-CMA for optimising the kick controller. We will also show that a non-linear function of desired kick distance clearly outperforms a linear one. This effect has been also observed for the humanoid walking task [13]. Now our robot is able to kick the ball for a continuous range of desire kick distances. This is in contrast with our previous approach where we had 3 sets of parameters for short, mid and long distance kicks.

2 The Approach

We used a simulated Nao robot shown in Fig. 1 for our experiments. Our movement pipeline is composed of two main parts: a kick controller, which receives parameters $ {\varvec{\theta }} $ and converts them into joint commands for the robot’s servos; and a policy function, which maps a given context s for a specific kick distance into the corresponding parameter vector $ {\varvec{\theta }} $. The pipeline for the kick task, whose context is the kick distance s with a straight kick direction with respect to the torso, is shown in Fig. 2.

2.1 Kick Controller

We have a kick controller which is a simple keyframe-based [10] linear model and we also have stability module as in [1] that stabilize the robot during performing the kick movement. A keyframe, as defined in [10], is a complete description of joint angles, either absolute or relative to the previous keyframe. Our keyframe based controller is defined by the following parameters:

The initial keyframe, represented as a vector $\alpha $ of joint angles with dimension l,
The final keyframe, also represented as a vector $\beta $ of joint angles with dimension l.
The action time t that is the amount of time the robot takes to move from the initial to the final keyframe. The joint angles are linearly interpolated across t to create the corresponding movement.

During performing kick only the legs joints move and remaining joints (arms and head joints) are kept constant. As each leg has 6 joints, $\alpha $ and $\beta $ are 12-dimensional vectors. Therefore considering the action time t, our kick controller has 25 parameters to set. The controller receives a 25-dimensional parameter vector $ {\varvec{\theta }} $, which is then interpolated and coded into motor commands. Figure 1 shows the initial and final positions of an exemplary kick. The stability module has its constant parameters which doesn’t change from task to task, please see [1] for more details of our stability module. Now we need to find a policy function of kick distance s that sets our controller parameters with the proper parameters $ {\varvec{\theta }} $ for any given desired kick distance.

2.2 Policy Function

Our goal is to find a function in form of

$$\begin{aligned} \mu ( {\varvec{s}} ) = \varvec{A} ^T \varphi ( {\varvec{s}} ), \end{aligned}$$

that given a context vector $ {\varvec{s}} $ with dimension $d_s$, outputs a optimal parameter vector $ {\varvec{\theta }} $ with dimension $d_ {\varvec{\theta }} $ such that it maximise our objective function $R( {\varvec{\theta }} , {\varvec{s}} ): \{\mathbb {R}^{d_s},\mathbb {R}^{d_ {\varvec{\theta }} }\} \rightarrow \mathbb {R}$. Where $\varphi ( {\varvec{s}} )$ is an arbitrary feature function of context $ {\varvec{s}} $ that outputs a feature vector with dimension $d_\varphi $ and the gain matrix $A_{\pi }$ is a $d_ {\varvec{\theta }} \times d_\varphi $ matrix. Typically $\varphi ( {\varvec{s}} ^{[i]}) = [1 \quad {\varvec{s}} ^{[i]}]$, which results in linear generalization over contexts. In order to achieve non-linear generalization over contexts, we can use normalized radial basis features (RBF) as a feature function:

$$\begin{aligned} \varphi ( {\varvec{s}} ^{[i]}) = \frac{\psi _j( {\varvec{s}} ^{[i]})}{\sum _{j=1}^{K}\psi _j( {\varvec{s}} ^{[i]})}, \quad \psi _j( {\varvec{s}} ^{[i]})=\exp (-\frac{( {\varvec{s}} ^{[i]}-c_j)^2}{2\sigma ^2}), \end{aligned}$$

where K is the number of RBFs and centres $\{c_j\}_{j = 1\dots K}$ are equally spaced in the range of $ {\varvec{s}} $, based on the desired number of RBFs K, and $\sigma ^2$ is the bandwidth of the RBF. The bandwidth represents how related contexts are. A large bandwidth means that contexts are very similar and therefore the relationship is (near)linear. A bandwidth of 0 is an extreme case where movements are not generalizable at all, and each context has its independent optimal parameters. Both K and $\sigma ^2$ are hand-tuned parameters. RBF features have been shown to enable algorithms to learn non-linear policies which greatly outperform their linear counterparts on non-linear tasks, such as walking [13], so we expected a performance increase. Now the task is to learn the optimal gain matrix A. As we don’t have the labelled data to fit A, we need to use a reinforcement learning method.

2.3 Learning Policy Function

In order to learn the policy function $\mu ( {\varvec{s}} )$ we use a contextual policy search algorithm called CREPS-CMA. CREPS-CMA is an extension of contextual REPS [8, 14] which is capable of multi-task learning. The goal of CREPS-CMA is to find a function $\mu ( {\varvec{s}} )$ that given a context $ {\varvec{s}} $, it outputs a parameter vector $ {\varvec{\theta }} $ such that $\{ {\varvec{s}} , {\varvec{\theta }} \}$ maximises the objective function $R( {\varvec{s}} ,\theta )$. The only accessible information on the objective function $R( {\varvec{s}} ,\theta )$ are evaluations $\{R_k\}_{k=1\dots k}$ of samples $\{ {\varvec{s}} _k, {\varvec{\theta }} _k\}_{k=1\dots k}$, where k is the index of the sample, ranging from 1 to the number of samples N. CREPS-CMA maintains a stochastic search distribution $\pi ( {\varvec{\theta }} | {\varvec{s}} )$ over the parameter space $ {\varvec{\theta }} $ of the objective function which is used to generate samples $ {\varvec{\theta }} $ given $ {\varvec{s}} $. The search distribution $\pi ( {\varvec{\theta }} | {\varvec{s}} )$ is modelled as a linear Gaussian policy, i.e.,

$$\begin{aligned} \pi ( {\varvec{\theta }} |s) = \mathcal {N}\left( \varvec{\theta } | \varvec{A} ^T \varphi ( {\varvec{s}} ),\varSigma _{\pi } \right) , \end{aligned}$$

where the mean of the distribution is our policy function $\mu ( {\varvec{s}} )$ we are searching for and covariance matrix $\varSigma _{\pi }$ controls the exploration of the algorithm. CREPS-CMA is an iterative algorithm. First it initializes the search distribution $\pi ( {\varvec{\theta }} | {\varvec{s}} )$ by defining A matrix and covariance matrix $\varSigma _{\pi }$ with arbitrary values^{Footnote 1}. Afterwards in each iteration, given context samples^{Footnote 2} $\{ {\varvec{s}} _k\}_{k=1\dots k}$, the current search distribution $q( \varvec{\theta } | {\varvec{s}} )$ is used to create samples $\{ {\varvec{\theta }} _k\}_{k=1\dots k}$ of the parameter vector $ {\varvec{\theta }} $. Subsequently, the evaluation $\{R_k\}_{k=1\dots k}$ of samples $\{ {\varvec{s}} _k, {\varvec{\theta }} _k\}_{k=1\dots k}$ is obtained by querying the objective function $R( {\varvec{s}} , {\varvec{\theta }} )$. And dataset $\{ {\varvec{s}} _k, {\varvec{\theta }} _k,R_k\}_{k=1\dots k}$ is used to compute a weight $\{d_k\}_{k=1\dots k}$ for all samples. Each weight is a pseudo-probability for the corresponding sample. Subsequently, using $\{ {\varvec{s}} _k, {\varvec{\theta }} _k,d_k\}_{k=1\dots k}$, a new Gaussian search distribution $\pi ( {\varvec{\theta }} | {\varvec{s}} )$ is estimated by estimating a new A matrix and covariance matrix $\varSigma _{\pi }$. The new search distribution will give more probabilities to the samples $\{ {\varvec{s}} _k, {\varvec{\theta }} _k\}_{k=1\dots k}$ with better returns $\{R_k\}_{k=1\dots k}$. This process runs iteratively until the algorithm converges to a solution. After all we are interested in the matrix A to construct our policy function $\mu ( {\varvec{s}} )$. Algorithm 1 shows a compact representation of contextual stochastic search methods. Now we briefly explain how CREPS-CMA computes weights and what are the update rules of the search policy.

2.4 CREPS-CMA

The key idea behind contextual REPS [8] is to ensure a smooth and stable learning process by bounding the relative entropy between the old search distribution $q( {\varvec{\theta }} | {\varvec{s}} )$ and the newly estimated policy $\pi ( {\varvec{\theta }} | {\varvec{s}} )$ while maximising the expected return. This results in a weight

$$\begin{aligned} d_k = \exp \left( ({\mathcal {R}_{ {\varvec{s}} {\varvec{\theta }} }} - V( {\varvec{s}} ) ) / \eta \right) \end{aligned}$$

for each sample $[ {\varvec{s}} _k, {\varvec{\theta }} _k]$, which we can use to estimate a new search distribution $\pi ( {\varvec{\theta }} | {\varvec{s}} )$. ${\mathcal {R}_{ {\varvec{s}} {\varvec{\theta }} }} $ denotes the expected performance when evaluating parameter vector $ {\varvec{\theta }} $ in context $ {\varvec{s}} $ and $V( {\varvec{s}} ) = \varvec{\varphi } ( {\varvec{s}} )^T \varvec{w} $ is a context dependent baseline which is subtracted from the return ${\mathcal {R}_{ {\varvec{s}} {\varvec{\theta }} }} $. The parameters $ \varvec{w} $ and $\eta $ are Lagrangian multipliers that can be obtained by minimising the dual function, given as

$$\begin{aligned} \min _{\eta , \varvec{w} } g(\eta , \varvec{w} ) = \eta \epsilon + \hat{ \varvec{\varphi } }^T \varvec{w} + \eta \log \left( \sum _{K=1}^N \frac{1}{N}\exp \left( \dfrac{R^{[k]} - \varvec{\varphi } ( {\varvec{s}} ^{[k]})^T \varvec{w} }{\eta } \right) \right) . \end{aligned}$$

where $\hat{ \varvec{\varphi } } = \sum _{K=1}^N \varvec{\varphi } ( {\varvec{s}} ^{[k]})$ is the expected feature vector for the given context samples. We optimize this convex dual function by gradient decent. Now given dataset $\{ {\varvec{s}} _k, {\varvec{\theta }} _k,d_k\}_{k=1\dots N}$ and the old Gaussian search distribution

$$\begin{aligned} q( {\varvec{\theta }} | {\varvec{s}} ) = \mathcal {N}\left( \varvec{\theta } | \varvec{A} _{q}^T \varphi ( {\varvec{s}} ),\varSigma _{q} \right) , \end{aligned}$$

we want to find the new search distribution $\pi ( {\varvec{\theta }} | {\varvec{s}} )$ by finding $A_{\pi }$ and $\varSigma _{\pi }$. Therefore we need two update rules, one for updating the context-dependent policy function $ \varvec{\mu } _{\pi }( {\varvec{s}} )$ of the search distribution and another one for updating the covariance matrix $\varSigma _{\pi }$ of the distribution.

Context-Dependent Mean-Function Update Rule. The matrix A can be obtained by the weighted maximum likelihood, i.e.,

$$\begin{aligned} \varvec{A} = {( \varvec{\varPhi } ^T \varvec{D} \varvec{\varPhi } + \lambda \varvec{I} )}^{-1} \varvec{\varPhi } ^T \varvec{D} \varvec{U} , \end{aligned}$$

(1)

where $ \varvec{\varPhi } ^{T} = [\varphi ^{[1]},\dots , \varphi ^{[N]}]$ contains the feature vector for all context samples $\{ {\varvec{s}} _k\}_{k=1\dots N}, \varvec{U} = [\theta ^{[1]},\dots , \theta ^{[N]}]$ contains all the sample parameters, $ \varvec{D} $ is the diagonal weighting matrix containing the weightings $\{k\}_{k=1\dots N}$ and $\lambda \varvec{I} $ is a regularization term. $\lambda $ is a very small number such as $1\text {e}{-}8$.

Covariance Matrix Update Rule. Standard contextual REPS directly uses the weighted sample covariance matrix as $ \varvec{\varSigma } _{\pi }$ which is obtained by

$$\begin{aligned}&\varvec{S} = \frac{\sum _{k = 1}^N d_k\big ( {\varvec{\theta }} _k- \varvec{A} ^T \varphi ( {\varvec{s}} _k)\big )\big ( {\varvec{\theta }} _k- \varvec{A} _{\pi }^T \varphi ( {\varvec{s}} _k)\big )^T}{Z}, \\&Z = \frac{(\sum _{k = 1}^N d_k)^2 - \sum _{k = 1}^N (d_k)^2}{\sum _{k = 1}^N (d_k)}. \nonumber \end{aligned}$$

(2)

It has been shown that the sample covariance matrix from Eq. 2 is not a good estimate of the true covariance matrix [15], since it biases the search distribution towards a specific region of the search space. In other words, the search distribution loses its exploration entropy along many dimensions of the parameter space, which causes premature convergence. This is a highly unwanted effect in policy search. To alleviate this problem, inspired by rank-$\mu $ update rule of CMA-ES [2], which is not a contextual algorithm, we combine the old covariance matrix and the sample covariance matrix from Eq. 2, i.e.,

$$\begin{aligned} \varvec{\varSigma } _{\pi } = (1-\lambda ) \varvec{\varSigma } _q + \lambda \varvec{S} . \end{aligned}$$

There are different ways to determine the interpolation factor $\lambda \in [0, 1]$ between the sample covariance matrix $ \varvec{S} $ and the old covariance matrix $ \varvec{\varSigma } _q$. For example, in [15], the factor $\lambda \in [0, 1]$ is chosen in such a way that the entropy of the new search distribution is reduced by a certain amount, while also being scaled with the number of effective samples. We will extended REPS by using the rank-$\mu $ covariance matrix adaptation method of CMA-ES algorithm [2] which has been shown to be effective for avoiding premature convergence, i.e.,

$$\begin{aligned} \lambda = min \left( 1, \frac{\phi _{\text {eff}}}{d_{ {\varvec{\theta }} }^2}\right) ,\phi _{\text {eff}} = \frac{1}{\sum _{k=1}^{N}(d^{[k]})^2} \end{aligned}$$

where $\phi _{\text {eff}}$ is the number of effective samples and $d_{ {\varvec{\theta }} }$ is the dimension of the parameter space $ {\varvec{\theta }} $.

3 Experiments

^{Footnote 3}In this section, first we evaluate CREPS-CMA algorithm. Hence we use standard optimization test functions [16], such as the Sphere, the Rosenbrock and the Rastrigin (multi-modal) functions. We extend these functions to be applicable for the contextual setting. The task is to find the optimum 15 dimensional parameter vector $ {\varvec{\theta }} $ for a given 1 dimensional context $ {\varvec{s}} $. We will show that CREPS-CMA performs favourably. Afterwards, We use CREPS-CMA to optimize our kick controller for different desired kick distances for a simulated Nao robot^{Footnote 4} and will show our accuracy results, with both linear and non-linear policies. According to the results non-linear policy outperforms the linear one.

3.1 Standard Optimization Test Functions

We chose three standard optimization functions, which are the Sphere function

$$\begin{aligned} f( {\varvec{s}} ,\theta ) = \sum _{i=1}^{p} \varvec{x} _i^2, \end{aligned}$$

the Rosenbrock function

$$\begin{aligned} f( {\varvec{s}} ,\theta ) = \sum _{i=1}^{p-1}[100( \varvec{x} _{i+1}- \varvec{x} _i^2)^2+(1- \varvec{x} _i)^2], \end{aligned}$$

and also a multi-modal function, known as the Rastgirin function

$$\begin{aligned} f( {\varvec{s}} ,\theta ) = 10p + \sum _{i=1}^{p}[ \varvec{x} _i^2-10 \cos (2\pi \varvec{x} _i)], \end{aligned}$$

where p is the number of dimensions of $ {\varvec{\theta }} $ and $ \varvec{x} = {\varvec{\theta }} + \varvec{A} \varvec{s} $. The matrix A is a constant matrix that was chosen randomly. In our case, because the context $ {\varvec{s}} $ is 1 dimensional, A is a $p \times 1$ dimensional vector. Our definition for $ \varvec{x} $ means the optimum $ \varvec{\theta } $ for these functions is linearly dependent on the given context $ {\varvec{s}} $. The initial search area of $ \varvec{\theta } $ for all experiments is restricted to the hypercube $-5 \le {\varvec{\theta }} _i \le 5, i=1,\dots ,p$ and contexts are uniformly sampled from the interval $0 \le {\varvec{s}} _i \le 3, i=1,\dots ,z$ where z is the dimension of the context space $ {\varvec{s}} $. In our experiments, the mean of the initial distributions has been chosen randomly in the defined search area. We compared CREPS-CMA with the standard Contextual REPS. In each iteration, we generated 50 new samples. The results in Fig. 3 show that CREPS-CMA could successfully learn the contextual tasks while standard Contextual REPS suffers from premature convergence.

3.2 Kick Task Results

We use a Nao humanoid robot simulated in RoboCup 3D simulation environment which is based on SimSpark^{Footnote 5}: a generic physical multiagent system simulator. The robot has 22 degrees of freedom, six in each leg, four in each arm, and two in the neck. We use CREPS-CMA to train a simulated NAO robot by optimising the kick controller explained in Sect. 2 using both linear policies, i.e., $\varphi ( {\varvec{s}} ^{[i]}) = [1\quad {\varvec{s}} ^{[i]}]$ and a RBF based non-linear policy. The desired kick distance s varies from 2.5 m to 12.5 m. For the non-linear policy, we choose $K=15$ normalized RBFs and $\sigma ^2$ is set to 0.5. Both K and the $\sigma ^2$ parameters were chosen by trial and error to maximize the results accuracy. Figure 4 shows the setup of the used RBFs over the context range.

We maximize a context dependent objective function

$$\begin{aligned} R(s,\theta ) = -(x-s)^2-y^2, \end{aligned}$$

where s is the desired kick distance, and x and y are the ball distances travelled along the x- and y-axes using the kick controller with the given parameter set $\theta $. We initialize the search distribution $\pi $ with a hand tuned kick policy, which was able to kick the ball over 15 m. We optimized the kick with 1000 iterations. Each iteration generates 20 new samples where the contexts were sampled uniformly. Each sample was evaluated 5 times, and was averaged to smooth out the noisy returns. In order to simulate competition conditions, for evaluating each sample, we placed the robot in 5 different positions around the ball and it had to perceive the ball, move towards it, position itself in place and then kick it towards the target goal using the kick controller. We compared the performance of the linear policy with non-linear one. Figure 6 shows that the non-linear policy clearly outperforms the linear one and the accuracy of the non-linear policy is considerable.^{Footnote 6} The average error of the linear policy was $0.82 \pm 0.10\,\text {m}$ while we achieved an average error of $0.34 \pm 0.11\,\text {m}$ using the non-linear policy. As expected, using a non-linear policy improves the accuracy of the results with order of magnitude. In fact, the average error is more than halved. This also demonstrates the non-linearity nature of robotic tasks such as kick task and the usefulness of using RBF functions to capture this non-linearity. Figure 5 shows the learned linear and non-linear policies for generalizing the 25 parameter kick controller for different kick distances. We can see that the learned linear policy is a linear approximation of its corresponding non-linear policy.

4 Conclusion

We used a recently proposed contextual policy search algorithm to generalize a robot kick controller for different desired kick distances, where a context is described by a real-valued vector of distances. We have modified the algorithm, naming it CREPS-CMA. Using CREPS-CMA, we have successfully learned linear and non-linear policies over the context of kick distances. The non-linear policy outperforms its linear counterpart, and allows a humanoid robot to kick a ball with flexible distances and with satisfactory accuracy results, which could leads to a better control and coordination in a robotic soccer match. In this research, we also demonstrated the non-linearity of a kick task. In future we will use more complex kick controllers such as dynamic motor primitives.

Notes

1.
With initializing we can define the region of the space that we would like the algorithm starts searching.
2.
Please note that the way we sample contexts $ {\varvec{s}} _k$ depends on the task. Throughout this paper we use a uniform distribution to sample contexts $ {\varvec{s}} _k$ which is desired kick distance. The intuition behind it is that all the kick distances have same importance for us.
3.
Matlab source-code of CREPS-CMA algorithm is available on-line at https://dl.dropboxusercontent.com/u/16387578/ContextualREPSCMA.zip.
4.
https://www.ald.softbankrobotics.com/en.
5.
http://simspark.sourceforge.net/.
6.
Demonstration video of the non-linear kick controller using the magma challenge tool [17] is available on-line at https://www.dropbox.com/sh/0iimyykf6xejj6g/AADg9iCNJZAbu3Voe2UKsmQza?dl=0.

References

Ferreira, R., Reis, L.P., Moreira, A.P., Lau, N.: Development of an omnidirectional kick for a NAO humanoid robot. In: Pavón, J., Duque-Méndez, N.D., Fuentes-Fernández, R. (eds.) IBERAMIA 2012. LNCS, vol. 7637, pp. 571–580. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-34654-5_58
Chapter Google Scholar
Hansen, N., Muller, S.D., Koumoutsakos, P.: Reducing the time complexity of the derandomized evolution strategy with covariance matrix adaptation (CMA-ES). Evol. Comput. 11, 1–18 (2003)
Article Google Scholar
Sun, Y., Wierstra, D., Schaul, T., Schmidhuber, J.: Efficient natural evolution strategies. In: Proceedings of the 11th Annual Conference on Genetic and Evolutionary Computation (GECCO) (2009)
Google Scholar
Stulp, F., Sigaud, O.: Path integral policy improvement with covariance matrix adaptation. In: International Conference on Machine Learning (ICML) (2012)
Google Scholar
Rückstieß, T., Felder, M., Schmidhuber, J.: State-dependent exploration for policy gradient methods. In: Daelemans, W., Goethals, B., Morik, K. (eds.) ECML PKDD 2008. LNCS, vol. 5212, pp. 234–249. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-87481-2_16
Chapter Google Scholar
Mannor, S., Rubinstein, R., Gat, Y.: The cross entropy method for fast policy search. In: Proceedings of the 20th International Conference on Machine Learning (ICML) (2003)
Google Scholar
Theodorou, E., Buchli, J., Schaal, S.: A generalized path integral control approach to reinforcement learning. J. Mach. Learn. Res. 11, 3137–3181 (2010)
MathSciNet MATH Google Scholar
Kupcsik, A., Deisenroth, M.P., Peters, J., Neumann, G.: Data-efficient contextual policy search for robot movement skills. In: Proceedings of the National Conference on Artificial Intelligence (AAAI) (2013)
Google Scholar
Abdolmaleki, A., Lioutikov, R., Peters, J., Lua, N., Reis, L.P., Neumann, G.: Regularized covariance estimation for weighted maximum likelihood policy search methods. In: Advances in Neural Information Processing Systems (NIPS), MIT Press (2015)
Google Scholar
Depinet, M., MacAlpine, P., Stone, P.: Keyframe sampling, optimization, and behavior integration: towards long-distance kicking in the RoboCup 3D simulation league. In: Bianchi, R.A.C., Akin, H.L., Ramamoorthy, S., Sugiura, K. (eds.) RoboCup 2014. LNCS, vol. 8992, pp. 571–582. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-18615-3_47
Chapter Google Scholar
Wang, J.M., Fleet, D.J., Hertzmann, A.: Optimizing walking controllers. ACM Trans. Graph. (TOG) 28(5), 168 (2009)
Google Scholar
Niehaus, C., Röfer, T., Laue, T.: Gait optimization on a humanoid robot using particle swarm optimization. In: Proceedings of the Second Workshop on Humanoid Soccer Robots in conjunction with the, pp. 1–7 (2007)
Google Scholar
Abdolmaleki, A., Lua, N., Reis, L.P., Peters, J., Neumann, G.: Contextual policy search for generalizing a parameterized biped walking controller. In: IEEE International Conference on Autonomous Robot Systems and Competitions (ICARSC) (2015)
Google Scholar
Daniel, C., Neumann, G., Peters, J.: Hierarchical relative entropy policy search. In: International Conference on Artificial Intelligence and Statistics (AISTATS) (2012)
Google Scholar
Abdolmaleki, A., Lua, N., Reis, L.P., Neumann, G.: Regularized covariance estimation for weighted maximum likelihood policy search methods. In: Proceedings of the International Conference on Humanoid Robots (HUMANOIDS) (2015)
Google Scholar
Molga, M., Smutnicki, C.: Test functions for optimization needs (2005). http://www.zsd.ict.pwr.wroc.pl/files/docs/functions.pdf
The MagmaOffenburg RoboCup 3D Simulation Team. Magma challenge tool [computer software]. http://robocup.hs-offenburg.de/en/nc/downloads

Download references

Acknowledgment

The first author was supported by FCT under grant SFRH/BD/81155/2011. The work was also partially funded by the Operational Programme for Competitiveness and Internationalisation - COMPETE 2020 and by FCT Portuguese Foundation for Science and Technology under projects PEst-OE/EEI/UI0027/2013 and UID/CEC/00127/2013 (IEETA and LIACC). The work was also funded by project EuRoC, reference 608849 from call FP7-2013-NMP-ICT-FOF.

Author information

Authors and Affiliations

IEETA, DETI, University of Aveiro, Aveiro, Portugal
Abbas Abdolmaleki, David Simões & Nuno Lau
DSI, University of Minho, Braga, Portugal
Abbas Abdolmaleki & Luis Paulo Reis
LIACC, University of Porto, Porto, Portugal
Abbas Abdolmaleki & Luis Paulo Reis
CLAS, TU Darmstadt, Darmstadt, Germany
Gerhard Neumann

Authors

Abbas Abdolmaleki
View author publications
You can also search for this author in PubMed Google Scholar
David Simões
View author publications
You can also search for this author in PubMed Google Scholar
Nuno Lau
View author publications
You can also search for this author in PubMed Google Scholar
Luis Paulo Reis
View author publications
You can also search for this author in PubMed Google Scholar
Gerhard Neumann
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abbas Abdolmaleki .

Editor information

Editors and Affiliations

University of Bonn, Bonn, Germany
Sven Behnke
Department of Computing, Curtin University, Perth, Western Autralia, Australia
Raymond Sheh
Computer Engineering Department, Istanbul Technical University, Istanbul, Turkey
Sanem Sarıel
School of Engineering and Applied Science, University of Pennsylvania, Philadelphia, Pennsylvania, USA
Daniel D. Lee

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Abdolmaleki, A., Simões, D., Lau, N., Reis, L.P., Neumann, G. (2017). Learning a Humanoid Kick with Controlled Distance. In: Behnke, S., Sheh, R., Sarıel, S., Lee, D. (eds) RoboCup 2016: Robot World Cup XX. RoboCup 2016. Lecture Notes in Computer Science(), vol 9776. Springer, Cham. https://doi.org/10.1007/978-3-319-68792-6_4

Download citation

DOI: https://doi.org/10.1007/978-3-319-68792-6_4
Published: 01 November 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68791-9
Online ISBN: 978-3-319-68792-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Learning a Humanoid Kick with Controlled Distance

Abstract

Similar content being viewed by others

An Omni-Directional Kick Engine for Humanoid Robots with Parameter Optimization