Keywords

1 Introduction

Motion capture has been a successful technique for creating realistic animations, and the precise recordings of complex human movements are widely utilized in film, man-machine interaction and so forth [18]. However, the capture environment generally offers limited flexibility to produce various motion clips, and the acquisition of such motion capture (mocap) data, e.g., skeletal joints, is prohibitively expensive. Under this circumstances, the need of motion editing approaches for reusing these previously recorded data is growing. Motion transition is a technique which could produce new motions by optimally combining multiple individual motion clips in a time domain. However, effective motion transition is still a non-trivial task due to its spatio-temporal articulated complexity. Most of existing works either require much physical constraints or involve large optimization procedures to achieve that, and some of which often lack of flexibilities in fulfilling the spatio-temporal consistency adaptively. Therefore, flexible skeletal motion transition is highly desirable and improving the transition performance will benefit many computer animation tasks.

The major of this paper is to create seamlessly motion transitions satisfying user demands. Following the procedures of constructing a motion graph, our proposed approach improves the state-of-the-art methods by providing the following two contributions: (1) A hybrid deep learning framework is proposed to extract the spatio-temporal features of each motion style, through which the appropriate transition points can be well obtained; (2) A novel quadruples-like data structure is exploited for motion graph building, which benefits much for the motion splitting and indexing. Without many manual pre-processing, the proposed approach can automatically detect and create transitions between different styles.

The remaining part of this paper is organized as follows: Sect. 2 will survey the related work, and Sect. 3 presents the proposed approach in detail. In Sect. 4, the experimental results are carefully introduced. Finally, we draw a conclusion in Sect. 5.

2 Related Work

Motion graph based approaches establish a connected graph for clip corpus systematically, and consider the transition problem as the specific node selection of certain properties. Along this way, Rose et al. [9] described the motions with ‘verbs’ and ‘adverbs’ to blends between different motion styles, and the complementary work has been done in [5]. Since the motion graph is constructed for efficient interpolation and prunes, many modifications have been developed to satisfy various user specifications. Lai et al. [6] found the graph nodes by clustering clip groups and generated transitions via constrained simulation. Peng et al. [8] exploited a flexible transition graph by using a two-pass clustering method. Recently, Saida and Foudil [10] imposed the inverse kinematic in constructing a motion graph to overcome the unnatural transitions. Although these approaches could produce visually pleasant performance, the global transition points searching scheme often limits their application domain in efficient motion generation.

Deep learning methods have received wide attentions in recent years, and there is a huge interest in applying deep learning techniques to synthesize novel data. A pioneer work attempted for motion transition was done by Taylor et al. [16], who proposed a conditional RBM to exploit the transition states in time series. Later, a factored conditional RBM [15] was further designed to reduce the parameter numbers in conditional RBM while appropriately preserving the stability. Gan et al. [2] suggested sampling from variational posterior in training temporal sigmoid belief networks (TSBN) which allowed for generating natural motion transitions. Song et al. [12] factorized the weight tensor of TSBN with labels to make the motion style more explicit. Experimentally, these deep learning based approaches are able to characterize the inherent motion states, but which often limited their applications in simple motion generation. Therefore, it is imperative to develop a flexible transition approach combining the flexible motion graph and representative deep neural networks in practice. Holden et al. [3] combined the one layer convolutional autoencoder with a deep feedforward network to control and synthesize motion data. This inspired us to utilize both of the strengths of convolutional operation and RBM to extract the local spatial-temporal features of the skeletal data.

3 The Proposed Methodology

As shown in Fig. 1, our framework involves two phases: offline training and online matching. The former phase aims to extract representative features for motion splitting and construct a motion graph for transition point detection, while the latter phase supports the user interface and allows interactive operations for diverse motion transition. In the following, the skeletal data representation and the topology structure about model description are illustrated in details.

Fig. 1.
figure 1

The framework of the proposed transition system.

3.1 Hybrid Deep Neural Network

The popular mocap databases include CMUFootnote 1, HDM05Footnote 2 and BVH libraryFootnote 3, and these public available datasets almost share the similar data structure. Specifically, input formats of the j-th joint of i-th frame can be considered as a data point represented by a 3D position in space, \(\mathbf {D_{ij}}=\{D_{ij}^x, D_{ij}^y, D_{ij}^z\}\). In this section, we provide a hybrid deep neural network to map the motion sequences into a 2D low-dimensional space, which can contribute more to the time efficiency in detecting transition nodes. Deep learning based models are of sufficient strength to extract representative features from the dataset. Since convolution operations have been making a difference in computer graphics and speech recognition, we are inspired to use the kernels to percept the local features of the skeletal data. As shown in Fig. 2, the proposed hybrid model integrates a convolutional RBM (CRBM) with deep belief network (DBN), in which the CRBM is utilized for inherent hierarchical feature learning, while the DBN is employed for low-dimensional feature space mapping.

Fig. 2.
figure 2

The flowchart of the proposed hybrid deep neural network.

Within the proposed model, CRBM is a vital link between skeletal data and DBN for its great power in apperceiving local informations. Considering the spatio-temporal properties, the convolutional kernel size is set to be 5*3. That is, our model groups 5 frames as a block and shift the contiguous three joints adhering to the body structure. Besides, the pooling operation endues the model with some special superiorities, such as transformation invariance, denoising and irrelevant information removal. Specifically, a simplified version of the probabilistic max-pooling [7] operation is employed in our model.

Low-Level: Hierachical Real-Valued CRBM. In our model, the values of joint position are recorded in three-dimensional coordinates. Therefore, it is necessary to extend the two-dimensional convolution operation into three channels. Intuitively, the training of CRBM is similar to RBM, except for the weights \(K_{c, d}\) that shared among different blocks in one ‘Input’ map, where subscript d corresponds to the indices of different kernels. Similar to RBM, the real-valued CRBM is also an energy-based model, and its energy function can be defined as:

$$\begin{aligned} \begin{aligned} E(V, H; \theta )={}&-\sum _c\sum _{\alpha }\sum _d \frac{V_{\alpha }^c}{\delta } \otimes K_{c, d} \otimes C_{\alpha }^d {}\\&-\sum _c \sum _{\alpha } \frac{(V_{\alpha }^c-b_c)^2}{2\delta ^2}{-}\sum _d \sum _{\alpha } C_{\alpha }^d\cdot b_d \end{aligned} \end{aligned}$$
(1)

Physically, each part connecting ‘Input’ and ‘Conv’ layer can be treated as a tiny RBM, and the inference on the bottom-top direction is convolutional-like. Let \(C_\alpha ^d\) denote the \(\alpha \)-th unit in d-th ‘Conv’ map, its value can be obtained by weighting all the blocks from the same position on different maps in ‘Input’ layer:

$$\begin{aligned} C_\alpha ^d=\sum _{c=1}^3 \frac{V_\alpha ^c}{\delta } \otimes K_{c, d}+b_d \end{aligned}$$
(2)

As an Gaussian-binary undirected graph model, each unit in hidden layers should contain two states, which are activated by random thresholds. In our work, the distribution of ‘Conv’ layer is calculated by dividing a sum of block related to the pooling size:

$$\begin{aligned} \tilde{C_{\alpha }^c}=P(C_{\alpha }^d=1\vert V_{\alpha }^c, \theta )=\frac{e^{C_{\alpha }^c}}{1+\sum _{bd}e^{C_{\alpha }^c}} \end{aligned}$$
(3)

where variable ‘bd’ is a 4-connected region in each ‘Conv’ map. The units in pooling layer won’t be activated by random threshold for preserving the stabilization of our model. Subsequently, the pooling layer can be calculated by:

$$\begin{aligned} {P_{xy}}=\max \tilde{(C_\alpha ^d)}_{2\times 2} \end{aligned}$$
(4)

the annotation \((\cdot )_{2\times 2}\) refer to one pooling block in ‘Conv’ layer. To overcome the dimensional inconsistency, it is doable to fill the border of the maps in ‘Conv’ layer with zeros, and the bandwidth is set at 4 in our work. Different from positive phase, the visible layers can be sampled from a specified Gaussian distribution:

$$\begin{aligned} \tilde{V_{\alpha }^c}\sim N\left( \sum _{d} \tilde{C_{\alpha }^d} \otimes rot(K_{c, d}, 180^{\circ })+b_c,\delta ^2 \right) \end{aligned}$$
(5)

In practice, the original data always being normalized to have zero mean and unit variance for convenient inference. That is, the \(\delta \) is set to be 1 in our work.

As the Eqs. 3 and 5 show the forward and backward inferences in Gibbs sampling method, the update of the biases and kernel can be defined as:

$$\begin{aligned} \begin{aligned} b_c&=D_{\alpha }^c-{\tilde{V_{\alpha }^c}}^{(k)}\\ b_d&=\tilde{C_{\alpha ,d}}-{\tilde{C_{\alpha }^d}}^{(k)}\\ K_{c,d}&=D_{\alpha }^c \cdot \tilde{C_{\alpha ,d}}-{\tilde{V_{\alpha }^c}}^{(k)}\cdot {\tilde{C_{\alpha }^d}}^{(k)} \end{aligned} \end{aligned}$$
(6)

the annotation \(\{\cdot \}^{(k)}\) means the k’th Gibbs sampling in contrastive divergence learning methods.

High-Level: Stacked DBN. Activating the hidden units into binary states by random thresholds may induce uncertainties during the forward inferences. Specifically, one same frame will take up two distinct positions in latent space while traversing the model twice. To eliminate the influence of random activation, we do not set the value of visible units as the binary hidden states in the former layer-wise RBM but the distribution of the hidden layer. In general, all units in pooling layer are fully connected to the first layer of DBN part. Each of the last four layers functions as not only binary hidden units in preceding RBM architecture but also the Gaussian visible inputs of the subsequent training. The training procedure also based on the contrastive divergence learning but for Gaussian-Bernoulli configuration [1].

3.2 Motion Graph Mining via Quadruple Structure

Motion graph is very helpful in satisfying specific paths, which consists of motion spliting, transition points detection and transition frames generation [4]. The principle of constructing motion graph is ensuring good connectedness while preserving efficient search for user constraints. To this end, we propose a quadruple data structure which is benefit to motion splitting and retrieving. Meanwhile, the proposed hybrid network aims at detecting the appropriate transition points.

Motion Splitting via Quadruple Structure. Differing from regular sampling [5] or clustering [14], we propose a quadruple-like structure to represent each node in motion graph, featuring on motion reusing and fast implementation:

$$\begin{aligned} Clip_{style}\,=\,{<}\text {RotAng}, \text {Start}, \text {End}, \text {Index}{>} \end{aligned}$$
(7)

where ‘RotAng’ indicates the deflection which should exceed 10 degrees at the truncation frames, ‘Start’ and ‘End’ indicate the beginning and ending frame of each clip indexed from the whole motion sequence, and ‘Index’ records the sign of the parent motion sequence. To implement rotation on motion sequences, simple IK (Inverse Kinematics) and geometry are need to obtain a group of Euler angles which can be fast implemented by referring to the solution suggested in [11]. For details, the pseudocode for quadruple-like structure based motion splitting are summarized in Algorithm 1.

figure a

Appropriate Transition Point Detection. Motion transition often appears between two similar motion frame, and the appropriate transition points play an important role for new motion creation. Since our proposed hybrid deep model is more efficient to map the skeletal data into 2D latent spaces, the ‘end to end’ function for frame feature representation can be given by:

$$\begin{aligned} \begin{aligned} F_{x, y}^n&= sigmoid(F_{x,y}^{n-1}\times W_l+b_l)\\ \text {w.r.t.}\quad F_{x, y}^0&=\left[ \left( \sum _d D_{ij}\otimes K_{c,d}+b_d\right) _{mp}\right] _{fc} \end{aligned} \end{aligned}$$
(8)

where ‘mp’ is max-pooling, and ‘fc’ represents the full connection between ‘Pool’ and ‘DBN1’ layer. Thus the similar frame can be measured by naive Euclidean metric:

$$\begin{aligned} Dis_{s, t}=\sum _{i=1}^2 \parallel F_{s, i}-F_{t, i}\parallel ^2 \end{aligned}$$
(9)

The mimimum won’t be the best transition points for that the interpolated phase maybe opposed to the moving direction. It is reasonable to reset the distance \(Dis_{s, t}\) to be infinite when this happened. This constraint ensures us to detect the most appropriate transition points by selected the minimum value in the updated distance matrix. Besides, the transition can also be computed like the traditional motion graphs [9].

3.3 Fulfilling User Demands

Similar to Algorithm 1, we treat the user input as combinations of moving straight and turning with different degrees. For the specified style while generating, we can turn to different groups indexed by ‘Quad’. As the paths input from user interface are not irregular as the trajectories in captured motions, the deflections while moving are equal to the angles between x-axis and path segments. However, it should be noted that the lengths of this kind of path segments should not be too long for the needs of ensuring splitting the turns apart from the straights. A reasonable suggestion is to adopt the locomotive distance between less than 15 frame. The remaining operations are similar to Algorithm 1: splitting paths into small blocks and combining the continuous blocks into final clips. As a result, the clips are attached with turning degrees which can be searched in the ‘Quad’ spaces indexing the original motion samples. The boundary of the turning points of each movement can be noted by the x-value or z-value of each clip.

4 The Experimental Results

To evaluate the transition performance, three kind of motion styles (i.e., walk, run, jump) are picked from public available CMU dataset. Unlike discrimative deep learning models, the capabilities of the traning dataset for generative models is not so large for that the purpose of the learning procedure aims at estimating the probability distribution of the representative data. Following the size of the datasets adopted in RBM based methods [12, 16], our training data is chosen from subject 2, 6, 7, 8, 9, 13, 14, 16, 37, 91, 104, 105, 111, 118, 127, 139, 141, 143, while the testing data is selected from subject 13, 16 and 35. The experiments are implemented on a desktop with a 3.30 GHz Intel Core and 8G RAM machine, implementing the coding language with Matlab 2015b.

4.1 Evaluation of Transition Quality

Without considering the flexible control of the motion path, we only evaluate the qualities of transitions between different motion styles. To this end, we compare our methods with both conventional blending methods [5] and deep learning based methods [13, 16]. The linear blending is performed within two windows containing 40 frames which averagely come from two transition motions respectively. The latter two RBM based methods both generate transition frames by imposing Gaussian noises on the top hidden layer. Our methods only do simple spherical linear interpolation (Slerp) once detecting two similar frames through the hybrid CRBM-DBN model.

To analysis the naturality of the generated movements, we extract the transitional trajectories of the left toe and hand. In Fig. 3, the positions of the two end-effectors are annotated along the vertical axis while the horizontal axis shows the length of transition phase. As normal human movements always behave with slightly body shaking, both joint trajectories have noises with various degrees. Their amplitudes and frequencies are appropriate indicators for quality measurement. In view of the this, compared with the generative deep learning models, the linear blend and Slerp methods can create smoother transitions for that the noises induced by the former are of higher amplitudes. In addition to the drawbacks of intensive noises, the transitions in ‘mhmublv’ and ‘TDBN’ are out of good control, as we don’t know what degree of Gaussian noises should be imposed on hidden units and the transitions may be randomly generated early or delayed. Secondly, the quality of blending within several aligned frames is problematic. If the aligned frame block is too small, the style change will be abrupt. Relatively, the long time-series frame block will lead to foot slide, because the move speed can hardly keep pace with the blending frames. Thus, the transition length affects the blending quality greatly. This issue has also been surveyed by Wang et al. [17] who revealed that the transition length can ranges from 0.03 to 2 s where the optimal blend length is chosen by exhaustive method. In our framework, the blending length is controllable, the detected similar points can accelerate or decelerate the transitional speed. In Fig. 3, we can see our method (red line) create more gently transition than linear blending within aligned block containing 40 frames.

Fig. 3.
figure 3

Comparisons of the transition phases among conventional linear blending method, deep learning generative models and our hybrid framework. The length of the transition phase includes 400 frames and the tranjectories of the left toe and hand are both projected onto the xz-plane. The green line is drwan by the linear blending technique referring to the motion graph [5]. The blue line labelled as ‘mhmublv’ is generated by the conditional RBM method [16]. The black line labelled as ‘TDBN’ is relevant to dynamical DBN [13]. (Color figure online)

4.2 Interactive Path Synthesis with Transition

In practice, the motion transition may meet different moving path in an interactive environment. In this case, we treat the path constraints as a searching problem of turning angles in original database. Specifically, the quadruple structure matches the user specifications through the drift angles while moving, and three kinds of user inputs, linear path, complicated curves and interactive control, can be realized by our approach. As a natural operation, it is reasonable to interpolate a large number of frames (e.g., 100) between the transition points and segment them into 5 parts equally. Then, according to the transition length, down sampling each part can generate satisfactory performance.

Linear path: To clearly demonstrate that our model permits smooth transitions between arbitrary styles, we firstly picked up three kinds of movements performed by ‘\(16\_22\)’, ‘\(16\_35\)’, ‘\(16\_09\)’ corresponding to jump, walk, run along with the z-axis which are shown in Fig. 4. And then two typical transition examples between several pairs of graph nodes were shown in Fig. 5. They mainly contain transitions between walk, run and jump styles. Both simple transitions reveal that our transition system can generate compelling results with periodic motions and arbitrary length, and also could transit one motion style to another. It can be seens that the transitions between different styles are natural enough to well satisfy the visual comfort. In addition, the transition results obtained by our model can well avoid foot-slide and simultaneously prevent abrupt changes while transiting between different motion styles.

Fig. 4.
figure 4

Original motion samples performed by ‘16_22. amc’, ‘16_35. amc’ and ‘16_09. amc’ respectively.

Fig. 5.
figure 5

Transition performance associated with simple motion and linear path.

Fig. 6.
figure 6

Complex transition performances. (a) all sample trajectories; (b) walking along a square paths and jumping at the end; (c) transition of b path; (d) path with three turning behaviors and a transition from walk to run; (d) transition of d path.

Complicated curves: Complex transitions consist of many simple clips, and this makes it difficult to produce smooth frames between piece-pairs while satisfying complex user inputs. In Fig. 6(a), we have extracted the trajectories of the original motion samples performed by subject 16 in three different styles. User inputs are matched with both trajectories and styles through the quadruple structure introduced in Sect. 3.2. Then, in Fig. 6(b) (d), we simulate two distinct user inputs by red line and the matched motion trajectories are displayed in blue lines. The final synthesis of the skeleton movements are shown in Fig. 6(c) (e) respectively. Fortunately, it can found in the experiments that our proposed transition approach is able to synthesize different motion style in a user specified path, and transition motions is really human-like and natural physically. Therefore, the proposed transition framework is well suitable to the specific problem of generating different styles of locomotion along arbitrary paths.

5 Conclusion

In this paper, we have presented a flexible framework that allows efficient human motion transition from motion capture data. Within the proposed approach, a hybrid deep neural network is presented to extract the inherent skeletal feature and map the high-dimensional data into an efficient 2D latent space, through which the transition points between diverse motion styles can be well detected. Meanwhile, a quadruples-like data structure is exploited for motion graph building, which significantly benefits for the motion splitting and indexing. Matching user constraints through the move deflection can get better flexibility and efficiency than calculate the distance between trajectories directly. As a result, different kinds of diverse motions can be well transited with different user input, while preserving the smoothness and quality of the original data. The experiments have shown its outstanding performance.