Skip to main content

Talking Face Video Generation with Editable Expression

  • Conference paper
  • First Online:
Image and Graphics (ICIG 2021)

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12890))

Included in the following conference series:

Abstract

In rencent years, the convolutional neural network have been proved to be a great success in generating talking face. Existing methods have combined a single face image with speech to generate talking face video. The challenge with these methods is that only the lips change in the video, lacking other facial expressions such as blinking and eyebrow movements. In order to solve this problem, this paper propose a embedding system to tackle the task of talking face video generation by using a still image of a person and an audio clip containing speech. We can modify some of the natural expressions through high-level structure, i.e., the facial landmarks. Compared with the direct audio-to-image method, our approach avoids spurious correlations between audio-visual signals that were unrelated to the speech content. In addition, to generate the face of the network, a face sequence generation method based on single sample learning is designed.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Bulat, A., Tzimiropoulos, G.: How far are we from solving the 2d & 3d face alignment problem?(and a dataset of 230,000 3d facial landmarks). In: CVPR, pp. 1021–1030 (2017)

    Google Scholar 

  2. Chang, H., Lu, J., Yu, F., Finkelstein, A.: Pairedcyclegan: asymmetric style transfer for applying and removing makeup. In: CVPR, pp. 40–48 (2018)

    Google Scholar 

  3. Chen, L., Maddox, R.K., Duan, Z., Xu, C.: Hierarchical cross-modal talking face generation with dynamic pixel-wise loss. In: CVPR, pp. 7832–7841 (2019)

    Google Scholar 

  4. Chung, J.S., Nagrani, A., Zisserman, A.: Voxceleb2: deep speaker recognition. arXiv preprint arXiv:1806.05622 (2018)

  5. Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10112, pp. 87–103. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54184-6_6

    Chapter  Google Scholar 

  6. Esser, P., Sutter, E., Ommer, B.: A variational u-net for conditional appearance and shape generation. In: CVPR, pp. 8857–8866 (2018)

    Google Scholar 

  7. Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: ICCV, pp. 1501–1510 (2017)

    Google Scholar 

  8. Huang, X., Liu, M.Y., Belongie, S., Kautz, J.: Multimodal unsupervised image-to-image translation. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 172–189 (2018)

    Google Scholar 

  9. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: CVPR, pp. 1125–1134 (2017)

    Google Scholar 

  10. Kim, H., et al.: Deep video portraits. ACM TOG 37(4), 1–14 (2018)

    Article  Google Scholar 

  11. Kumar, R., Sotelo, J., Kumar, K., de Brébisson, A., Bengio, Y.: Obamanet: photo-realistic lip-sync from text (2017). arXiv preprint arXiv:1801.01442

  12. Li, Y., Wang, N., Liu, J., Hou, X.: Demystifying neural style transfer (2017). arXiv preprint arXiv:1701.01036

  13. Li, Z., Aaron, A., Katsavounidis, I., Moorthy, A., Manohara, M.: Toward a practical perceptual video quality metric. Netflix Tech. Blog. 6, 2 (2016)

    Google Scholar 

  14. Ma, L., Jia, X., Sun, Q., Schiele, B., Tuytelaars, T., Van Gool, L.: Pose guided person image generation. In: Advances in Neural Information Processing Systems, pp. 406–416 (2017)

    Google Scholar 

  15. Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: CVPR, pp. 2337–2346 (2019)

    Google Scholar 

  16. Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition (2015)

    Google Scholar 

  17. Suwajanakorn, S., Seitz, S.M., Kemelmacher-Shlizerman, I.: Synthesizing obama: learning lip sync from audio. ACM TOG 36(4), 95 (2017)

    Article  Google Scholar 

  18. Thies, J., Zollhöfer, M., Nießner, M., Valgaerts, L., Stamminger, M., Theobalt, C.: Real-time expression transfer for facial reenactment. ACM Trans. Graph. 34(6), 183–1 (2015)

    Article  Google Scholar 

  19. Thies, J., Zollhofer, M., Stamminger, M., Theobalt, C., Nießner, M.: Face2face: real-time face capture and reenactment of rgb videos. In: CVPR, pp. 2387–2395 (2016)

    Google Scholar 

  20. Ulyanov, D., Vedaldi, A., Lempitsky, V.: Instance normalization: the missing ingredient for fast stylization (2016). arXiv preprint arXiv:1607.08022

  21. Wang, F., et al.: Residual attention network for image classification. In: CVPR, pp. 3156–3164 (2017)

    Google Scholar 

  22. Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional gans. In: CVPR, pp. 8798–8807 (2018)

    Google Scholar 

  23. Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE TIP 13(4), 600–612 (2004)

    Google Scholar 

  24. Wiles, O., Koepke, A.S., Zisserman, A.: X2face: a network for controlling face generation using images, audio, and pose codes. In: ECCV, pp. 670–686 (2018)

    Google Scholar 

  25. Wu, W., Zhang, Y., Li, C., Qian, C., Loy, C.C.: Reenactgan: learning to reenact faces via boundary transfer. In: ECCV, pp. 603–619 (2018)

    Google Scholar 

  26. Zhang, H., Goodfellow, I., Metaxas, D., Odena, A.: Self-attention generative adversarial networks (2018). arXiv preprint arXiv:1805.08318

  27. Zhu, J.Y., Park, T., Isola, P., Efros, A.A.: Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2223–2232 (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bin Liu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Song, L., Liu, B., Yu, N. (2021). Talking Face Video Generation with Editable Expression. In: Peng, Y., Hu, SM., Gabbouj, M., Zhou, K., Elad, M., Xu, K. (eds) Image and Graphics. ICIG 2021. Lecture Notes in Computer Science(), vol 12890. Springer, Cham. https://doi.org/10.1007/978-3-030-87361-5_61

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-87361-5_61

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-87360-8

  • Online ISBN: 978-3-030-87361-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics