Towards Learning a Realistic Rendering of Human Behavior

Esser, Patrick; Haux, Johannes; Milbich, Timo; Ommer, Björn

doi:10.1007/978-3-030-11012-3_32

Patrick Esser¹⁴,
Johannes Haux¹⁴,
Timo Milbich¹⁴ &
…
Björn Ommer¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11130))

Included in the following conference series:

European Conference on Computer Vision

1491 Accesses
5 Citations

Abstract

Realistic rendering of human behavior is of great interest for applications such as video animations, virtual reality and gaming engines. Commonly animations of persons performing actions are rendered by articulating explicit 3D models based on sequences of coarse body shape representations simulating a certain behavior. While the simulation of natural behavior can be efficiently learned, the corresponding 3D models are typically designed in manual, laborious processes or reconstructed from costly (multi-)sensor data. In this work, we present an approach towards a holistic learning framework for rendering human behavior in which all components are learned from easily available data. To enable control over the generated behavior, we utilize motion capture data and generate realistic motions based on user inputs. Alternatively, we can directly copy behavior from videos and learn a rendering of characters using RGB camera data only. Our experiments show that we can further improve data efficiency by training on multiple characters at the same time. Overall our approach shows a new path towards easily available, personalized avatar creation.

P. Esser, J. Haux and T. Milbich—Equal contribution.

This work has been supported in part by DFG project 371923335 and a hardware donation from NVIDIA.

You have full access to this open access chapter, Download conference paper PDF

Full-Body High-Resolution Anime Generation with Progressive Structure-Conditional Generative Adversarial Networks

Latent Dynamics for Artefact-Free Character Animation via Data-Driven Reinforcement Learning

Generating Human Action Videos by Coupling 3D Game Engines and Probabilistic Graphical Models

Article 23 October 2019

César Roberto de Souza, Adrien Gaidon, … Antonio Manuel López

1 Introduction

Recently there has been great progress in the field of generating and synthesizing images [11, 14, 18, 23, 30, 34, 42] and videos [33, 35] which is important for applications such as image manipulation, video animation and rendering of virtual environments. Over the past years, in particular the gaming industry eagerly improved their customer experience by developing more and more realistic 3D gaming engines. These progressively push the level of detail of the rendered scenes, with an emphasis on natural movements and realistic appearances of the in-game characters. Characters are typically rendered with the help of detailed, explicit 3D models, which consist of surface models and textures, and animated using tailored motion models to simulate human behavior and activity (Fig. 1).

Recent work [13] has shown that it is possible to learn natural human behaviour (e.g. walking, jumping, etc.) from motion capture data (MoCap) of human actors. On the other hand, designing a realistic 3D model of a person is still a laborious process. Traditionally, explicit 3D models of the shape are manually created by specialized graphic designers, followed by the design of textures reflecting a person’s individual appearance which is eventually mapped onto the raw body model. In order to circumvent this laborious design process, passive 3D reconstruction methods can be utilized [1, 3, 5, 12]. While these methods are able to achieve impressive results, they rely on data recorded by costly multi-view camera settings [1, 5, 7, 9], depth sensors [15, 26, 29, 37] or even active 3D laser scans [20].

Given the tremendous success of generative models [11, 14, 17, 18, 41] in the era of deep learning, the question arises, why not also learn to generate realistic renderings of a person, instead of only learning its natural movements? By conditioning the image generation process of a generative model on additional input data, mappings between different data domains are learned [14, 16, 42], which, for instance, allows for controlling and manipulating object shape, turning sketches into images and images into paintings. Thus, by being able to condition a rendering of a person on its body articulation, the laborious 3D modeling process can be circumvented by directly learning a mapping from the distinct postures composing human behaviors onto realistic images of humans.

In this work, we propose an approach towards a completely data driven framework for realistic control and rendering of human behavior which can be trained from Motion Capture data and easily available RGB video data. Instead of learning an explicit 3D model of a human, we train a conditional U-Net [28] architecture to learn a mapping from shape representations to target images conditioned on a latent representation of a variational auto-encoder for appearance. For training our generative rendering framework, we utilize single-camera RGB video sequences of humans, whose appearance is mapped onto a desired pose representation in the form of 2D keypoints. Our approach then enables complete control of a virtual character. Based on user inputs about desired motions, our system generates realistic movements together with a rendering of the character in a virtual environment. Since our rendering approach requires only example videos of the character, it can be easily used for personalized avatar creation. Furthermore, because it operates on 2D keypoints, it can also be used for video reenactment where the behavior shown in a video is copied.

We evaluate our model both quantitatively and qualitatively in an interactive virtual environment and for simulating human activity from recorded video data.

2 Related Work

There is a large corpus of works available for tackling the problem of creating realistic 2D renderings of 3D objects and persons, as well as their natural animation. In most cases this is still a laborious and manual process.

3D Modeling: The common approach for learning 3D models from visual data of real objects or humans is a three-dimensional reconstruction of the object of interest based on a collection of input images obtained from different camera settings. Many works utilize multi-view camera recordings of their subjects in combination with specialized shape models describing the human body configuration [3, 12, 22, 43] or without additional shape priors [1, 5, 7, 9]. While these approaches result in impressive reconstructions, obtaining the required input data is costly and asks for specialized equipment and recording settings.

3D scanners and cameras with RGB-D sensors like the KinectFusion system [15, 26] use depth information to increase the level of detail of the resulting 3D models [20, 29, 37]. However, again specialized equipment is needed and extra information used. In contrast, our approach is able to learn renderings of a subject from easily obtained RGB camera recordings.

There are only a few works which operate on monocular image data to learn 3D models. Moreover, these works typically neglect the individual appearance of the test subjects by modeling only posture skeletons [31, 40] or a fixed, neutral template appearance [24, 27]. Only the approach of Alldieck et al. [2] additionally learns the individual subject appearances and body shapes. However, their method depends on the SMPL model [22] which was trained on a large set of 3D body scans, while our rendering model is solely trained on easily available 2D input data.

Conditional Image Generation: Generative models [11, 18] offer an orthogonal avenue to the task of rendering in-game characters. Instead of an explicit 3D model of a test subject, such models are able to generate images from a latent space and allow for interpolating between the training images such as viewpoint and subjects [17, 32]. Moreover, conditioning the image generation on additional input data (such as class label, contours, etc.) allows for mappings between the conditioning and target domain [14, 25, 41] and thus grants control over the generative process. Similarly to [4, 10, 19, 23, 30] we condition our generative model on pose information.

3 Method

The degree of realism which can be obtained when generating renderings of a certain behavior by a given test subject hinges on two crucial components: (i) a realistic appearance model considering both the shape and texture of the person and (ii) a motion model which is able to capture the dynamics of the behavior and is able to control the deformation of the appearance model. Furthermore, such a motion model must be able to describe the distinct body articulations involved while performing an action and needs to simulate the natural transitions between them, i.e. the actual behavior.

In our framework, visualized in Fig. 2, both components are represented by deep neural networks, which are trainable from Motion Capture data and easily available RGB video data, respectively. Since the description of human shape and motion is a well-studied problem with efficient methods available, we integrate a pose estimation algorithm, OpenPose [6], for encoding human articulation and a motion model, PFNet [13], into our framework. PFNet is not only suitable for simulating natural human activity in the form of a keypoint representation, but also allows for direct interaction with the model, which is a crucial requirement for real-time game engines. Note however, that our framework is also able to synthesize offline, i.e. recorded videos of a given behavior, which can be easily represented by a sequence of corresponding shape descriptors.

For the rendering of a person we train a generative network conditioned on pose representations obtained from the motion model and on latent variables representing the appearance of a person.

In the following, we first briefly describe PFNet. To be able to render the resulting motion, we need to introduce a projection between world- and view coordinates of the rendered person in the scene. Furthermore, the domain shift from keypoint representations used for training our generative rendering model to those returned by the previous step must be addressed. Finally we present our model for conditional image generation.

3.1 Simulating Natural Human Behavior

When generating sequences of human poses, such as for animations or in computer games, one usually has a good idea of what kind of action should be the result. To generate highly realistic human action sequences so called Phase- Functioned Neural Networks (PFNN) [13] make use of the intrinsic periodicity of these motions. PFNNs such as that of [13] can be very shallow neural networks. Instead of learning a single set of model parameters these networks are trained to learn four sets of weights $\theta _i$, which are interpolated given a phase p by means of a Catmul-Rom spline (c-spline) interpolation $\varTheta $:

$$\begin{aligned} \theta = \varTheta (\theta _0,\theta _1,\theta _2,\theta _3). \end{aligned}$$

(1)

In this work a three layer fully connected neural network $\varPhi _{\mathrm {PFN}}$ with 512 hidden units and Exponential Linear unit activation functions (ELU) $\sigma _{ELU}$ is employed, using these interpolated weights $\theta $. At each time step t the network computes the joint locations of a human skeleton, in the following called stick man $\hat{y}^{3D}$, as well as the phase update $\varDelta _{p}$. Letting the network choose the rate of change of p can be seen as having the network choose the rhythm of the current motion pattern. To get a smooth update also the past and estimated future trajectory $\mathcal {T}$ at a total of 12 time steps is also given as an input. To control, which kind of behaviour is generated, the network additionally accepts a control input c, which allows a user to specify the desired motion pattern, such as walking or running and its orientation.

$$\begin{aligned} {\hat{y}^{3D}_{t}},\varDelta {p_t}=\varPhi _{\mathrm {PFN}} ({\hat{y}^{3D}_{t-1}},c_{t},p_{t},\mathcal {T}_{t-6:t+6},\theta ) \end{aligned}$$

(2)

3.2 Domain Adaption

Shown in Fig. 4 is an example of keypoints rendered as small squares given through $\hat{y}^{3D}$. Their 2d screen positions are now used to define the 2d pose input $\hat{y}^{2D}$ to generate the final render of the person. As our rendering model is trained given only the keypoints of visible body parts (see Sect. 3.3) we filter the keypoints returned by PFNN in two ways: (i) Keypoints of arms and knees are marked as visible if the are not occluded by a polygon defined by the keypoints of the hip and each shoulder. It is visualised in orange in Fig. 4. Body keypoints are assumed to be always visible. Our experiments show that although being very simple this approximation is sufficient. (ii) Let ${\mathbf {x}_{eye}}$ be the vector describing the position of an eye keypoint, ${\mathbf {x}_{nose}}$ be the point in space describing the position of the nose and ${\mathbf {x}_{cam}}$ be the position of the camera. Each eye $i \in {0, 1}$ is visible if

$$\begin{aligned} 0&> \left( \frac{{\mathbf {x}_{cam} - {\mathbf {x}_{nose}}}}{||\mathbf {x}_{cam} - {\mathbf {x}_{nose}}||} \times (-1)^{i} {\mathbf {e}_{y}} \right) \cdot \frac{{\mathbf {x}_{nose}} - {\mathbf {x}^{(i)}_{eye}}}{||{\mathbf {x}_{nose}}-{\mathbf {x}^{(i)}_{eye}}||} \end{aligned}$$

(3)

$$\begin{aligned} \text {with:}&\nonumber \\ i&= {\left\{ \begin{array}{ll} 0 &{} \text {if left eye} \\ 1 &{} \text {if right eye} \\ \end{array}\right. } \;, \end{aligned}$$

(4)

where ${\mathbf {e}_{y}}$ is the unit vector in y or up direction, $\times $ is the cross product and $\cdot $ is the scalar product. See Fig. 5 for a visualization. The nose is marked visible if one or both eyes can be seen. Occluded keypoints are marked red in Fig. 4, visible ones green. Note that the skeleton looks away from the camera and the left arm is occluded by the body.

3.3 Rendering

Our goal is to learn the rendering of characters from natural images. To achieve this, we train a neural network to map normalized 2D pose configurations to natural images. With this network available, we then project the 3D keypoints obtained in the previous step to 2D coordinates according to the desired camera view. After normalization of this configuration, we can apply the network to obtain the rendering. Finally, the normalization transformation is undone to be able to blend the character into the scene at the correct position. The different steps are explained in more detail in the following.

Coordinate Normalization. In order to train the network, it would be very inefficient to predict images at different positions because the renderings should be translation invariant. Therefore, we use the 2D joint coordinates to define a region of interest which covers the joints and add 10% padding, to account for the character volume. This results in a transformation $M_\text {coord}$ which can be used to transform points as well as images using bilinear resampling.

Image and Mask Prediction. For training, we assume that we have a large number of images of characters in different poses. As it will be important to provide images of a wide variety of poses, we utilize a network architecture which allows us to train a single network on multiple characters to increase the number of available poses.

For each training image of a character, we extract 2D joint positions and segmentation masks. Using large scale labeled data for these tasks such as [21], reliable estimation is possible. However, because we cannot predict positions of occluded joints, we must be careful to simulate occluded joints during test time as described in Sect. 3.2.

Due to automatic estimation of joints and mask, we only require a large number of images of characters, which can be achieved efficiently by video recordings. In order to make use of training data obtained from multiple characters, our network must be able to disentangle the pose from the character’s appearance, a task which has also been considered in [10]. Following these approaches, our network has two inputs, one for the joint positions and one for the character image. For preprocessing, the joint positions are converted into a stickman image to be able to utilize skip connections as in a U-Net architecture [28]. Furthermore, body parts are cropped from the character image to make sure that the network has to use the stickman image to infer joint locations. The training objective of the network is then given by reconstruction tasks on the original image as well as the segmentation mask. For the mask, we use a pixelwise L1 loss and for the images we use a perceptual loss which is highly effective as a differentiable metric for perceptual similarity of images [38].

Merge with Scene. Finally, we use the inverse coordinate normalization $M_\text {coord}^{-1}$ to transform the rendering and the mask back to their original screen coordinates. In order to integrate the character with the virtual environment, we use alpha blending between the rendered virtual environment and the character rendering. See also Fig. 2. In the case of behavior cloning from video, we first utilize the mask of the generated image to perform Image-Inpainting on the original frame. The full pipeline for this case is shown in Fig. 3.

4 Experiments

We evaluate our rendering framework qualitatively and quantitatively using dataset consisting of video sequences of three persons. The subjects were filmed performing various actions like walking, running, dancing, jumping and crouching. The recordings were done according to three settings: First a free setting, where actors were encouraged to perform a wide variety of movements without restrictions. For evaluation purposes, we also collected videos in a restricted setting, where the actors were restricted to standing still and performing only a walking motion. Filming was done using a Panasonic GH2 in full HD 1080p at 24 frames per second. Each individual is shown in approximately 10 000 frames. For training our conditional generative model, all frames are annotated with keypoints using the openpose [6] library. Additionally a mask covering the person in each frame is calculated using the deeplab [8] toolbox. Note that filming could also be done with a cellphone camera or something similar, making this approach feasible for a wide range of audiences.

4.1 Qualitative Results in Virtual Environment

For qualitative evaluation of the rendering capabilities of our framework in a virtual, interactive environment, we adapted the testing API of PFNet [13]. Our rendering model is trained on all training images of our 3 subjects. While inference we extract the simulated keypoints of the API, project it from 3D world- to 2D view coordinates while accounting for the domain shift and condition the appearance rendering on the resulting output. Figures 6 and 7 show our renderings for two different simulated walking scenarios, given the shape conditioning and different person appearances. Further, in Figs. 8 and 9 we demonstrate the need for self-occlusion handling. Additionally Fig. 10 shows an ablation experiment, where the same behavior as in Fig. 9 is rendered using nearest neighbor frames from the training set. This experiment clearly demonstrates the ability of our model to interpolate between the training images for generating smooth transitions in appearance while simulating a given behavior. A video with examples can be found at https://compvis.github.io/hbugen2018/.

4.2 Qualitative Results on Video Data

We now show additional qualitative results by simulating different behaviors as shown in video sequences. We trained our model using the full training sets of our test subjects and applied it on keypoint trajectories extracted from the PennAction dataset [39]. It contains 2326 video sequences of 15 different sports categories. The dataset exhibits unconstrained activities including complex articulations and self-occlusion. Figure 11 shows example renderings randomly selected from different activities simulated by different persons. A video with examples can be found at https://compvis.github.io/hbugen2018/. Further in Figs. 12, 13, 14 and 15 re-enactments of exemplary target behaviors are illustrated by temporal sampling the source videos. Conditioned on the estimated pose from the individual frames, we infer a new appearance and project the rendering back into the source frame. Thus we are able to simulate the given activities by any person of our choice.

Table 1. Structural similarity scores (SSIM) for different training settings. ‘query’ refers to test person data only and ‘augmented’ refers to additional data augmentation. ‘NN’ denotes the nearest neighbor retrieval results.

Full size table

4.3 Quantitative Evaluation of Pose Generalization

Let us now quantitatively evaluate the ability of our model to generalize to unseen postures. For this experiment we train our model on three different training subsets of varying variance in body articulation featuring only a single person: (i) Only images showing the person while standing with relaxed arms, (ii) Only images the showing a test person walking up and down and (iii) the person’s full training set. Moreover, we also train models for each of these settings with additional data augmentation by adding the full training sets of the remaining test subjects. We then compute the mean structural similarity score (SSIM) [36] between groundtruth test images and renderings of our model based on their extracted postures. As a baseline we use nearest neighbor retrievals from the different training sets also based on the extracted keypoints. Table 1 summarizes the results. As one can see, with increasing variability of the training poses also the quality of the renderings improves. Moreover, data augmentation in form of additional images persons helps our model to interpolate between the training poses of the actual test subject and thus improves its generalization ability. Note that on average our model outperforms the baseline by 9.5%, which proves that our model actually understands the mapping between shape and appearance.

5 Conclusion

In this work we presented an approach towards a holistic learning framework for rendering human behavior. Both rendering the appearance of a person and its simulating natural movements while performing a given behavior are represented by deep neural networks, which can be trained from easily available RBG video data. Our model utilizes a conditional generative model to learn a mapping between abstract pose representation and the appearance of a person. Using this model, we are able to simulate any kind of behavior conditioned on the appearance of a given test subject while either directly controlling the behavior in a virtual environment or reenacting recorded video sequences.

References

Allain, B., Franco, J.-S., Boyer, E.: An efficient volumetric framework for shape tracking. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Google Scholar
Alldieck, T., Magnor, M., Xu, W., Theobalt, C., Pons-Moll, G.: Video based reconstruction of 3D people models. In: IEEE Conference on Computer Vision and Pattern Recognition (2018)
Google Scholar
Anguelov, D., Srinivasan, P., Koller, D., Thrun, S., Rodgers, J., Davis, J.: SCAPE: shape completion and animation of people. ACM Trans. Graph. 24(3), 408–416 (2005)
Article Google Scholar
Balakrishnan, G., Zhao, A., Dalca, A.V., Durand, F., Guttag, J.: Synthesizing images of humans in unseen poses. arXiv preprint arXiv:1804.07739 (2018)
Cagniart, C., Boyer, E., Ilic, S.: Probabilistic deformable surface tracking from multiple videos. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 326–339. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15561-1_24
Chapter Google Scholar
Cao, Z., Simon, T., Wei, S.-E., Sheikh, Y.: Realtime multi-person 2D pose estimation using part affinity fields. In: CVPR (2017)
Google Scholar
Carranza, J., Theobalt, C., Magnor, M.A., Seidel, H.-P.: Free-viewpoint video of human actors. In: ACM SIGGRAPH 2003 Papers (2003)
Google Scholar
Chen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: DeepLab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. arXiv e-prints, June 2016
Google Scholar
de Aguiar, E., Stoll, C., Theobalt, C., Ahmed, N., Seidel, H.-P., Thrun, S.: Performance capture from sparse multi-view video. ACM Trans. Graph. 27(3), 98:1–98:10 (2008). Article No. 98
Article Google Scholar
Esser, P., Sutter, E., Ommer, B.: A variational U-Net for conditional appearance and shape generation. arXiv preprint arXiv:1804.04694 (2018)
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems (NIPS) (2014)
Google Scholar
Hasler, N., Stoll, C., Sunkel, M., Rosenhahn, B., Seidel, H.-P.: A statistical model of human pose and body shape. Comput. Graph. Forum 28(2), 337–346 (2009)
Article Google Scholar
Holden, D., Komura, T., Saito, J.: Phase-functioned neural networks for character control. ACM Trans. Graph. 36(4), 42:1–42:13 (2017). Article No. 42
Article Google Scholar
Isola, P., Zhu, J.-Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. arXiv (2016)
Google Scholar
Izadi, S., et al.: KinectFusion: real-time 3D reconstruction and interaction using a moving depth camera. In: Proceedings of the 24th Annual ACM Symposium on User Interface Software and Technology, pp. 559–568. ACM (2011)
Google Scholar
Johnson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9906, pp. 694–711. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46475-6_43
Chapter Google Scholar
Karras, T., Aila, T., Laine, S., Lehtinen, J.: Progressive growing of GANs for improved quality, stability, and variation. In: Proceedings of the International Conference on Learning Representations (ICLR) (2018)
Google Scholar
Kingma, D.P., Welling, M.: Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114 (2013)
Lassner, C., Pons-Moll, G., Gehler, P.V.: A generative model of people in clothing. In: Proceedings of the IEEE International Conference on Computer Vision, vol. 6 (2017)
Google Scholar
Li, H., Vouga, E., Gudym, A., Luo, L., Barron, J.T., Gusev, G.: 3D self-portraits. ACM Trans. Graph. 32(6), 187:1–187:9 (2013). Article No. 187
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. ArXiv e-prints, May 2014
Google Scholar
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34(6), 248:1–248:16 (2015). Article No. 248. Proceedings of the SIGGRAPH Asia
Article Google Scholar
Ma, L., Sun, Q., Georgoulis, S., Gool, L.V., Schiele, B., Fritz, M.: Disentangled person image generation. In: Conference on Computer Vision and Pattern Recognition (2018)
Google Scholar
Mehta, D., et al.: VNect: real-time 3D human pose estimation with a single RGB camera. ACM Trans. Graph. 36(4), 14 p. (2017). Article No. 44. https://doi.org/10.1145/3072959.3073596
Article Google Scholar
Mirza, M., Osindero, S.: Conditional generative adversarial nets. CoRR, abs/1411.1784 (2014)
Google Scholar
Newcombe, R.A., et al.: KinectFusion: real-time dense surface mapping and tracking. In: Proceedings of the 2011 10th IEEE International Symposium on Mixed and Augmented Reality (2011)
Google Scholar
Popa, A., Zanfir, M., Sminchisescu, C.: Deep multitask architecture for integrated 2D and 3D human sensing. In: CVPR (2017)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-Net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Shapiro, A., et al.: Rapid avatar capture and simulation using commodity depth sensors. Comput. Animat. Virtual Worlds 25(3–4), 201–211 (2014)
Article Google Scholar
Siarohin, A., Sangineto, E., Lathuiliere, S., Sebe, N.: Deformable GANs for pose-based human image generation. In: Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Tomè, D., Russell, C., Agapito, L.: Lifting from the deep: convolutional 3D pose estimation from a single image. In: CVPR (2017)
Google Scholar
Tran, L., Yin, X., Liu, X.: Disentangled representation learning GAN for pose-invariant face recognition. In: Proceeding of IEEE Computer Vision and Pattern Recognition (2017)
Google Scholar
Tulyakov, S., Liu, M.-Y., Yang, X., Kautz, J.: MoCoGan: decomposing motion and content for video generation. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)
Google Scholar
Ulyanov, D., Lebedev, V., Vedaldi, A., Lempitsky, V.: Texture networks: feed-forward synthesis of textures and stylized images. In: Proceedings of the 33rd International Conference on Machine Learning (2016)
Google Scholar
Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: Advances in Neural Information Processing Systems (2016)
Google Scholar
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004)
Article Google Scholar
Zeng, M., Zheng, J., Cheng, X., Liu, X.: Templateless quasi-rigid shape modeling with implicit loop-closure. In: CVPR (2013)
Google Scholar
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. arXiv preprint (2018)
Google Scholar
Zhang, W., Zhu, M., Derpanis, K.: From actemes to action: a strongly-supervised representation for detailed action understanding. In: International Conference on Computer Vision (ICCV) (2013)
Google Scholar
Zhou, X., Huang, Q., Sun, X., Xue, X., Wei, Y.: Towards 3D human pose estimation in the wild: a weakly-supervised approach. In: IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Zhu, J.-Y., Krähenbühl, P., Shechtman, E., Efros, A.A.: Generative visual manipulation on the natural image manifold. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9909, pp. 597–613. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46454-1_36
Chapter Google Scholar
Zhu, J.-Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networkss. In: 2017 IEEE International Conference on Computer Vision (ICCV) (2017)
Google Scholar
Zuffi, S., Black, M.J.: The stitched puppet: a graphical model of 3D human shape and pose. In: IEEE Conference on Computer Vision and Pattern Recognition (CVPR 2015) (2015)
Google Scholar

Download references

Author information

Authors and Affiliations

HCI, IWR, Heidelberg University, Heidelberg, Germany
Patrick Esser, Johannes Haux, Timo Milbich & Björn Ommer

Authors

Patrick Esser
View author publications
You can also search for this author in PubMed Google Scholar
Johannes Haux
View author publications
You can also search for this author in PubMed Google Scholar
Timo Milbich
View author publications
You can also search for this author in PubMed Google Scholar
Björn Ommer
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Patrick Esser .

Editor information

Editors and Affiliations

Technical University of Munich, Garching, Germany
Laura Leal-Taixé
Technische Universität Darmstadt, Darmstadt, Germany
Stefan Roth

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Esser, P., Haux, J., Milbich, T., Ommer, B. (2019). Towards Learning a Realistic Rendering of Human Behavior. In: Leal-Taixé, L., Roth, S. (eds) Computer Vision – ECCV 2018 Workshops. ECCV 2018. Lecture Notes in Computer Science(), vol 11130. Springer, Cham. https://doi.org/10.1007/978-3-030-11012-3_32

Download citation

DOI: https://doi.org/10.1007/978-3-030-11012-3_32
Published: 29 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-11011-6
Online ISBN: 978-3-030-11012-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics