LessonAble: Leveraging Deep Fakes in MOOC Content Creation

Sannino, Ciro; Gravina, Michela; Marrone, Stefano; Fiameni, Giuseppe; Sansone, Carlo

doi:10.1007/978-3-031-06427-2_3

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13231))

Included in the following conference series:

International Conference on Image Analysis and Processing

2098 Accesses

Abstract

This paper introduces LessonAble, a pipelined methodology leveraging the concept of Deep Fakes for generating MOOC (Massive Online Open Course) visual contents directly from a lesson narrative. To achieve this, the proposed pipeline consists of three main modules: audio generation, video generation and lip-syncing. In this work, we use the NVIDIA Tacotron2 Text-to-Speech model to generate custom speech from text, adapt the famous First Order Motion Model to generate the video sequence from different driving sequences and target images, and modify the Wav2Lip model to deal with lip-syncing. Moreover, we introduce some novel strategies to support the use of markdown-like formatting to guide the pipeline in the generation of expression aware (i.e. curious, happy, etc.) contents. Despite the use and adaptation of third parties modules, developing such a pipeline presented interesting challenges, all analysed and reported in this work. The result is an extremely intuitive tool to support MOOC content generation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

HARIVO: Harnessing Text-to-Image Models for Video Generation

MUGEN: A Playground for Video-Audio-Text Multimodal Understanding and GENeration

Learning Video Context as Interleaved Multimodal Sequences

Notes

References

Bernard, M., Titeux, H.: Phonemizer: Text to phones transcription for multiple languages in python. J. Open Source Softw. 6(68), 3958 (2021). https://doi.org/10.21105/joss.03958, https://doi.org/10.21105/joss.03958
Favaro, A., Sbattella, L., Tedesco, R., Scotti, V.: ITAcotron 2: transfering English speech synthesis architectures and speech features to Italian. In: Proceedings of The Fourth International Conference on Natural Language and Speech Processing (ICNLSP 2021), pp. 83–88. Association for Computational Linguistics, Trento, Italy, 12–13 Nov 2021. https://aclanthology.org/2021.icnlsp-1.10
Fried, O., et al.: Text-based editing of talking-head video. CoRR abs/1906.01524 (2019). http://arxiv.org/abs/1906.01524
Jamaludin, A., Chung, J.S., Zisserman, A.: You said that? synthesising talking faces from audio. Int. J. Comput. Vis. 127, December 2019. https://doi.org/10.1007/s11263-019-01150-y
Nguyen, T.T., Nguyen, C.M., Nguyen, D.T., Nguyen, D.T., Nahavandi, S.: Deep learning for deepfakes creation and detection. CoRR abs/1909.11573 (2019). http://arxiv.org/abs/1909.11573
Post, M.: A call for clarity in reporting BLEU scores. CoRR abs/1804.08771 (2018). http://arxiv.org/abs/1804.08771
Prajwal, K.R., Mukhopadhyay, R., Namboodiri, V., Jawahar, C.V.: A lip sync expert is all you need for speech to lip generation in the wild. CoRR abs/2008.10010 (2020). https://arxiv.org/abs/2008.10010
Prajwal, K.R., Mukhopadhyay, R., Philip, J., Jha, A., Namboodiri, V., Jawahar, C.V.: Towards automatic face-to-face translation. CoRR abs/2003.00418 (2020), https://arxiv.org/abs/2003.00418
Reich, J.: Rebooting mooc research. Science 347(6217), 34–35 (2015). https://doi.org/10.1126/science.1261627. https://www.science.org/doi/abs/10.1126/science.1261627
Shen, J., et al.: Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. CoRR abs/1712.05884 (2017). http://arxiv.org/abs/1712.05884
Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., Sebe, N.: Animating arbitrary objects via deep motion transfer. CoRR abs/1812.08861 (2018). http://arxiv.org/abs/1812.08861
Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., Sebe, N.: First order motion model for image animation. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (2019). https://proceedings.neurips.cc/paper/2019/file/31c0b36aef265d9221af80872ceb62f9-Paper.pdf
Thies, J., Elgharib, M., Tewari, A., Theobalt, C., Nießner, M.: Neural voice puppetry: Audio-driven facial reenactment. CoRR abs/1912.05566 (2019). http://arxiv.org/abs/1912.05566
Wiles, O., Koepke, A.S., Zisserman, A.: X2face: A network for controlling face generation by using images, audio, and pose codes. CoRR abs/1807.10550 (2018). http://arxiv.org/abs/1807.10550

Download references

Acknowledgements

We acknowledge the CINECA award under the ISCRA initiatives, for the availability of high-performance computing resources and support within the projects IsC80_FEAD-D and IsC93_FEAD-DII. We also acknowledge the NVIDIA AI Technology Center, EMEA, for its support and access to computing resources, and the Federica Web Learning University center for providing professor Sansone’s videos.

Author information

Authors and Affiliations

DIETI, University of Naples Federico II, Naples, Italy
Ciro Sannino, Michela Gravina, Stefano Marrone & Carlo Sansone
NVIDIA AI Technology Center, Luxembourg, Italy
Giuseppe Fiameni

Authors

Ciro Sannino
View author publications
You can also search for this author in PubMed Google Scholar
Michela Gravina
View author publications
You can also search for this author in PubMed Google Scholar
Stefano Marrone
View author publications
You can also search for this author in PubMed Google Scholar
Giuseppe Fiameni
View author publications
You can also search for this author in PubMed Google Scholar
Carlo Sansone
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Michela Gravina .

Editor information

Editors and Affiliations

Boston University, Boston, MA, USA
Stan Sclaroff
National Research Council, Lecce, Italy
Cosimo Distante
National Research Council, Lecce, Italy
Marco Leo
University of Catania, Catania, Italy
Giovanni M. Farinella
Technische Universität München, Garching, Germany
Federico Tombari

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Sannino, C., Gravina, M., Marrone, S., Fiameni, G., Sansone, C. (2022). LessonAble: Leveraging Deep Fakes in MOOC Content Creation. In: Sclaroff, S., Distante, C., Leo, M., Farinella, G.M., Tombari, F. (eds) Image Analysis and Processing – ICIAP 2022. ICIAP 2022. Lecture Notes in Computer Science, vol 13231. Springer, Cham. https://doi.org/10.1007/978-3-031-06427-2_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-06427-2_3
Published: 15 May 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06426-5
Online ISBN: 978-3-031-06427-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

LessonAble: Leveraging Deep Fakes in MOOC Content Creation