Keywords

1 Introduction

When acting in complex everyday environments, robots are prone to experiencing failures. For instance, Fig. 1 illustrates two failure examples with our Toyota Human Support Robot (HSR)Footnote 1\({}^,\)Footnote 2, namely failures in grasping an object from a table and placing it on a confined shelf.

Fig. 1.
figure 1

Unsuccessful attempts at grasping and placing an object

Failures such as these can be caused by various factors [18], for instance incomplete knowledge about the environment or sensory limitations. Regardless of the causes, robots should have the ability to not only recover from such failures, but also use those as learning opportunities in order to improve. In practice, this is usually difficult to achieve, often due to limitations of the paradigm with which robot programs are created. For instance, the problem of planning manipulation trajectories for performing a specific manipulator motion is generally complex and can be solved by randomly searching for feasible solutionsFootnote 3, which however makes it difficult to predict and analyse the behaviour of a robot. Particularly in the context of manipulation, learning from demonstration [2] is a popular alternative to randomised planners and complex manipulation models that aims to replace explicitly programmed motion models by a process in which feasible and predictable motions are demonstrated to a robot by a human demonstrator.

In this paper, we study the problem of learning from demonstration in the context of domestic robots by considering the dynamic motion primitive (DMP) framework [6] for representing manipulator trajectories. Dynamic motion primitives encode demonstrated motion trajectories by a dynamic model, often in Cartesian space with respect to a manipulator’s end effector. Such an explicit representation is beneficial when considering failures and failure analysis since a motion primitive in this format can be used as a predictive model of a robot’s behaviour. Domestic robots are however mobile manipulators in principle, so a representation of pure manipulator trajectories is generally insufficient for executing purposeful tasks; instead, such robots need to synchronise arm and base motions. We thus present a minor extension to motion primitives that allows whole body motion of a robot, such that arm motions are favoured whenever possible, but base motions are introduced when a manipulator is unable to move anymore without reaching a singular configuration.

To validate the feasibility of our whole body motion framework, we present an extensive set of experiments in which a Toyota HSR was grasping various everyday objects in a common domestic environment. The experiments evaluate the whole body motion framework in a scenario in which the robot had to grasp objects from tables with two different heights, such that, due to the variation of object poses, arm motion was sufficient for success in some of the trials, while whole body motion was required in others.

2 Related Work

Learning from demonstration [2, 10] is a popular technique for robot knowledge acquisition where the aim is to let a robot learn an execution policy by generalising over one or more demonstrations by an expert operator. In the context of manipulation, such policies are often based on a representation of motion primitives. Motion primitives can be represented in different manners, such as combinations of basis functions extracted by performing principal component analysis on a set of demonstrations [11], distributions over trajectories [16], dynamic equations [6], as well as neural networks [4]. In this work, we use the dynamic motion primitive representation due to its easy interpretability and predictive nature, but also because of the potential to improve such learned models using reinforcement learning [9] and create sequences of complex motions by combining individual motion primitives together [12]. In this paper, we analyse motion primitives in the context of grasping everyday objects in a common domestic environment, such as the objects used in the RoboCup@Home competition [14], such that we synthesise grasps based on 3D object representations using point cloud data [3].

3 Methodology

3.1 DMP Preliminaries

In the dynamic motion primitives framework [6], a motion trajectory is represented by a second-order differential equation of the form

$$\begin{aligned} \tau \ddot{\mathbf {y}} = \alpha \left( \beta (\mathbf {g} - \mathbf {y}) - \dot{\mathbf {y}} \right) + \mathbf {f} \end{aligned}$$
(1)

where \(\tau \) is a time constant that can be used for controlling the motion duration, \(\alpha \) and \(\beta \) are positive constants, \(\mathbf {g}\) is a motion goal, \(\mathbf {y}\) is the current state of the system, and \(\mathbf {f}\) is a forcing term. This equation converges to \(\mathbf {g}\) if the forcing term vanishes, such that by varying the forcing term, arbitrary trajectories can be represented with the above model. Dynamic motion primitives can be learned, where learning reduces to finding a model of the forcing term that describes the desired motion; the model should ensure that the forcing term eventually tends to zero so that the system can converge to the desired goal. Motion primitives are usually learned from demonstrations, which is what we also do here.

In their general form, DMPs are robot-independent, as they represent trajectories in Cartesian space; the conversion to joint motion commands is done by an inverse kinematics solver. In other words, given a Cartesian velocity vector \(\mathbf {\dot{y}} = J\dot{\mathbf {q}}\), where J is the manipulator Jacobian and \(\dot{\mathbf {q}}\) is a vector of joint velocities, we can find the joint velocity vector as

$$\begin{aligned} \dot{\mathbf {q}} = J^{-1}\mathbf {\dot{y}} \end{aligned}$$
(2)

Particularly for non-redundant manipulator, singular solutions of the above equation are likely in practice. To handle the singularities of such manipulators, we use the weighted damped least squares method [15, 21] to determine the joint velocities.

DMPs themselves do not provide a direct way of dealing with kinematic and dynamic constraints on the robot joints, which have to be dealt with at the control level. In practice, this is not as significant an issue as it might seem due to two main reasons: (i) a primitive generated from a demonstrated trajectory generally represents feasible reproducible motions (though this is largely determined by the manner in which trajectories are demonstrated [2]); (ii) a primitive-based policy is often not used directly, but is improved in a reinforcement learning scenario [9]. What is a practical issue however, particularly in the context of domains such as domestic robotics, is the fact that \(\mathbf {g}\) is likely to lie outside the dexterous workspace of a robot’s manipulator (for instance, a robot may need to grasp a table object that is not reachable from the robot’s current position); in such cases, a motion primitive on its own is clearly insufficient for performing a task.

3.2 Combining DMPs and Whole Body Motion

Considering that we work with mobile robots, we overcome this problem by using a whole body motion framework that performs only arm motions when the manipulator Jacobian is well-conditioned, but introduces motion of the robot’s base when the Jacobian is nearing a singularity. Near-singularities of a manipulator can be detected by monitoring the smallest singular value \(\sigma _{min}\) of the Jacobian J defined above, which tends to zero in such configurations [8]; this follows directly from the fact that the condition number of a matrix is given by the ratio \(\frac{\sigma _{max}}{\sigma _{min}}\), such that singular or near-singular matrices have a large condition number.

The distribution of linear velocities amongst the mobile base platform and the end effector is governed by the following relations:

$$\begin{aligned} m_{cap}&= \frac{\sigma _{min} - \sigma _{l}}{\sigma _{h} - \sigma _{l}} \end{aligned}$$
(3)
$$\begin{aligned} \mathbf {v}_{ee}&= m_{cap}\mathbf {v} \end{aligned}$$
(4)
$$\begin{aligned} \mathbf {v}_{b}&= (1 - m_{cap})\mathbf {v} \end{aligned}$$
(5)

where \(m_{cap}\) is what we call a capability coefficient of the manipulator, \(\sigma _{min}\) is the smallest singular value of J, \(\sigma _{l}\) is a lower threshold on \(\sigma _{min}\)Footnote 4, \(\sigma _{h}\) is an upper threshold on \(\sigma _{min}\)Footnote 5, \(\mathbf {v}\) is the desired linear velocity of the end effector in the global frame of reference, \(\mathbf {v}_{ee}\) is a velocity command for the end effector, and \(\mathbf {v}_{b}\) is a velocity command for the mobile base. Here, it should be noted that the motion of the base is introduced only when \(\sigma _{min}\) drops below \(\sigma _{h}\). In a similar fashion, it is possible to incorporate limitations of the base motion, such as obstacles sensed by a distance sensor, in a modified version of the above policy to ensure safe base motion.

As mentioned above, the desired end effector velocities are converted to joint velocities using an inverse kinematics solverFootnote 6. The obtained joint velocities are then interpreted by a low-level joint velocity controller, which performs the actual joint motions. Base velocity commands are similarly interpreted by the robot’s base velocity controller.

3.3 Demonstrating and Learning Motion Primitives

Learning from demonstration can be done in various different ways, all of which impose different constraints on the robot and demonstrator [2]. In the context of this work, we use external observations for recording motion primitives, namely we record the motion of a demonstrator using a camera placed on the robot, which tracks the position of an ArUco marker board [5]. The demonstration process is illustrated in Fig. 2.

Fig. 2.
figure 2

DMP demonstration by observing an ArUco marker board

By recording the board poses at a predefined frequency, we obtain a set of n points \(D = \{ d_i \; | \; d_i \in \mathbb {R}^6, 1 \le i \le n \}\). D is thus a representation of the demonstrated trajectory in Cartesian space, which is then used for learning the forcing term of a motion primitive. As done in [6], we represent the forcing term as a sum of weighted basis functions

$$\begin{aligned} f(x) = \frac{\sum _{i=1}^{k}{\varPsi _i(x)w_i}}{\sum _{i=1}^{k}{\varPsi _i(x)}}x(\mathbf {g} - \mathbf {y_0}) \end{aligned}$$
(6)

where \(y_0\) is the initial position, g is the goal, \(w_i\) are weights that need to be learned, and each \(\varPsi _i(x)\) has the form

$$\begin{aligned} \varPsi _i(x) = \exp \left( -\frac{1}{2\sigma _i^2}(x - c_i)^2 \right) \end{aligned}$$
(7)

such that the weighting terms are learned using weighted linear regression.

It should be noted that we learn a separate primitive for each Cartesian dimension, which means that a complete motion primitive is a combination of six dimension-specific primitive motions that need to be followed together when the motion has to be reproduced. Here, it is important to mention that we actually only control the position of the end effector, which means that the orientation primitives are not taken into account at runtime, but this is only because the external observation method makes it difficult to represent the orientation reasonably; using a different demonstration method, such as kinesthetic teaching [2], would make it possible to use those in practice as wellFootnote 7.

3.4 Purposeful Manipulation in Domestic Environments

We represent manipulation trajectories using motion primitives that can be demonstrated and learned in order to have a predictable and generalisable model that allows a robot to perform tasks in a domestic environment. In particular, assuming that we have m motion primitives \(M_i, 1 \le i \le m\) and s skills \(S_j, 1 \le j \le s\), we associate each \(M_i\) with a skill \(S_j\), where this mapping need not be a bijective one; in particular, each \(M_i\) can be associated with multiple skills (for instance, a single primitive can be used for grasping and placing), but also each \(S_j\) can utilise multiple primitives (for instance, we could have multiple grasping primitives that change depending on the context)Footnote 8.

As can be noticed above, the representation of motion primitives we apply does not directly use any dynamic information about the environment, such as obstacles that are in the way of a robot. We do not deal with such factors in this paper, but note that information about obstacles can be incorporated in the form of repulsive fields [17]. We consider this to be an essential future extension for the practicality of our framework in complex everyday scenarios.

4 Evaluation

To evaluate motion primitives for performing purposeful tasks in the context of domestic robots, we recorded a grasping primitive for a Toyota Human Support Robot (HSR) and performed a set of grasping experiments in which the robot had to grasp objects from two different surfaces - an ordinary dining table and a living room tableFootnote 9. To detect objects, we used the SSD object detector [13] trained on the COCO dataset; our experimental evaluation is not concerned with recognising objects, so the pretrained model suffices for our purpose and we did not retrain it on a new dataset. For calculating the object poses, we used a grasp synthesis method that assumes a fixed grasping orientation (resulting in a sideways grasp) and determines the position of the object to be grasped from the object’s point cloud by averaging the positions of the points:

$$\begin{aligned} \begin{pmatrix} x \\ y \\ z \end{pmatrix} = \begin{pmatrix} \frac{1}{N}\sum \nolimits _{i=1}^{o}O_{x_i} \\ \frac{1}{N}\sum \nolimits _{i=1}^{o}O_{y_i} \\ \frac{1}{N}\sum \nolimits _{i=1}^{o}O_{z_i} \end{pmatrix} \end{aligned}$$
(8)

where O is the object’s point cloud and o is the number of pointsFootnote 10.

The experimental setup was the same for both surfaces, namely a single object was placed on the surface, such that the robot had to (i) detect the object, (ii) determine its pose, (iii) choose a grasping pose, and (iv) execute a grasping motion using the previously recorded motion primitive. The pose of the object was varied in each experimental trial; in some of the trials, the robot could reach the object with pure arm motion, while the object was out of reach and thus whole body motion was required in all other trials. Runs in which the object could not be detected were repeated until the object was found. Due to the design of the HSR, the robot has to align with the object before performing a grasping motion; in addition, the arm goes to a pregrasp position before executing a motion primitive, such that two different pregrasp positions were used depending on the surface. The sequence of steps for grasping an object on the dining table is depicted in Fig. 3.

Fig. 3.
figure 3

Object grasping sequence

The overall experimental setup is illustrated in Fig. 4, such that we used the objects in Fig. 5 for evaluation. We performed 10 experimental trials per object; this resulted in 150 trials per surface or 300 trials in total. Trials in which the object was grasped and remained in the gripper until manual removal were considered successful provided that the manipulator did not have a strong collision with the table; trials in which there was a collision or the object was not grasped successfully were counted as failures.

Fig. 4.
figure 4

Setup and objects used in the grasping experiments. In the case of the coffee table, the white board was placed in order to block out some reflections due to bad lighting conditions. In both cases, only one object at a time was placed at different positions on the table, such that the robot had to detect the object, estimate its pose, and then grasp it (10 times for each object and surface combination), resulting in 300 grasping experimental trials in total.

Fig. 5.
figure 5

Objects used in the grasping experiments

Plots of representative planned and executed trajectories when grasping the noodle box are shown in Fig. 6. As can be seen there, the executed trajectories do not exactly match the predicted trajectories, although this can be tolerated in most practical scenarios.

Fig. 6.
figure 6

Planned and executed DMP-based trajectories by the Toyota HSR in the noodle box grasping experiment

The results of the baseline grasping experiment are shown in Table 1.

Table 1. Successful grasps in the grasping experiment (out of 10)

A large number of failed attempts in the experiments were caused by a slip of the object due to an incorrectly detected object pose; for instance, this was the case for the bowl when turned upside down, the mug and noodle box on the dining table, as well as the shampoo on the coffee table. Occasional failures were due to collisions with the table that triggered a safety stop. While grasping the e-stop, most failures were due to the fact that the object is too low for a sideways grasp; on the coffee table, neither the e-stop nor the light ball could be grasped due to this issue. Failures such as these clearly indicate the necessity for a more general grasp planner that is able to adapt based on the context and task; for instance, a top-down grasp is more suitable for objects such as the emergency stop and the light ball, as a sideways grasp is clearly suboptimal for such objects. During the experiments, we additionally noticed a degradation of the arm controller over time, but our experimental results are not affected by this phenomenon since this effect was abstracted away by resetting the controller.

5 Discussion and Future Work

The evaluation of dynamic motion primitives extended with a whole body motion policy has shown the feasibility of the approach for everyday object manipulation with a mobile manipulator, but has also illustrated various limitations of the framework as presented here that need to be addressed for increasing the method’s practical applicability. As would be expected, the quality of the demonstrated motion primitive significantly affects the motion executed by the manipulator. We have found that the method based on recording trajectories using external observations is not ideal since it requires a trial-and-error procedure for finding a suitable primitive; we are thus investigating ways to use admittance control for demonstrating trajectories using kinesthetic teaching.

As illustrated by the experimental analysis, motion primitives can be reasonably generalised, but there are natural limits to the generalisation capabilities. In our experiments, the motion primitive was originally recorded for manipulation on the dining table, where a concave trajectory is executed, while a convex trajectory is needed for manipulation on the coffee table. In order to generalise the primitive, we manually specified two different pregrasp positions of the manipulator for the two tables, but an adaptive strategy that automatically selects pregrasp positions and motion primitives is needed for fully autonomous operation; various ideas in this direction are presented in [20]. As discussed before, it is also necessary to consider dynamic information about the environment, such as obstacles and external forces acting on the robot during the execution of trajectories, in order to increase the practical usefulness of primitive-based execution. Improving faulty motion primitives based on corrective demonstrations [7] and teacher feedback [1] is another possible extension. Finally, our primary motivation for using motion primitives is having a model that can be used for detecting and diagnosing robot execution failures; a consistency-based method as discussed in [19] provides some ideas for that, although its direct applicability needs to be investigated in a separate study.

6 Conclusions

In this paper, we analysed learning from demonstration in general as well as the use of dynamic motion primitives for representing motion trajectories of a mobile manipulator in particular and investigated the framework in the context of a specific mobile manipulator - a Toyota Human Support Robot. To allow executing trajectories that require synchronised arm and base motion, we discussed a small extension of manipulation-only motion primitives to whole body motion using which motions of a mobile base are introduced if the manipulator is approaching a singular configuration. We additionally presented an extensive set of experiments in which the robot was grasping different everyday objects in a common domestic environment. Future work will address the manner in which trajectories are demonstrated to the robot, but the adaptation of primitives to different contexts and tasks as well as the use of primitives for predicting execution failures need to be studied as well.