Abstract
This paper introduces LessonAble, a pipelined methodology leveraging the concept of Deep Fakes for generating MOOC (Massive Online Open Course) visual contents directly from a lesson narrative. To achieve this, the proposed pipeline consists of three main modules: audio generation, video generation and lip-syncing. In this work, we use the NVIDIA Tacotron2 Text-to-Speech model to generate custom speech from text, adapt the famous First Order Motion Model to generate the video sequence from different driving sequences and target images, and modify the Wav2Lip model to deal with lip-syncing. Moreover, we introduce some novel strategies to support the use of markdown-like formatting to guide the pipeline in the generation of expression aware (i.e. curious, happy, etc.) contents. Despite the use and adaptation of third parties modules, developing such a pipeline presented interesting challenges, all analysed and reported in this work. The result is an extremely intuitive tool to support MOOC content generation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bernard, M., Titeux, H.: Phonemizer: Text to phones transcription for multiple languages in python. J. Open Source Softw. 6(68), 3958 (2021). https://doi.org/10.21105/joss.03958, https://doi.org/10.21105/joss.03958
Favaro, A., Sbattella, L., Tedesco, R., Scotti, V.: ITAcotron 2: transfering English speech synthesis architectures and speech features to Italian. In: Proceedings of The Fourth International Conference on Natural Language and Speech Processing (ICNLSP 2021), pp. 83–88. Association for Computational Linguistics, Trento, Italy, 12–13 Nov 2021. https://aclanthology.org/2021.icnlsp-1.10
Fried, O., et al.: Text-based editing of talking-head video. CoRR abs/1906.01524 (2019). http://arxiv.org/abs/1906.01524
Jamaludin, A., Chung, J.S., Zisserman, A.: You said that? synthesising talking faces from audio. Int. J. Comput. Vis. 127, December 2019. https://doi.org/10.1007/s11263-019-01150-y
Nguyen, T.T., Nguyen, C.M., Nguyen, D.T., Nguyen, D.T., Nahavandi, S.: Deep learning for deepfakes creation and detection. CoRR abs/1909.11573 (2019). http://arxiv.org/abs/1909.11573
Post, M.: A call for clarity in reporting BLEU scores. CoRR abs/1804.08771 (2018). http://arxiv.org/abs/1804.08771
Prajwal, K.R., Mukhopadhyay, R., Namboodiri, V., Jawahar, C.V.: A lip sync expert is all you need for speech to lip generation in the wild. CoRR abs/2008.10010 (2020). https://arxiv.org/abs/2008.10010
Prajwal, K.R., Mukhopadhyay, R., Philip, J., Jha, A., Namboodiri, V., Jawahar, C.V.: Towards automatic face-to-face translation. CoRR abs/2003.00418 (2020), https://arxiv.org/abs/2003.00418
Reich, J.: Rebooting mooc research. Science 347(6217), 34–35 (2015). https://doi.org/10.1126/science.1261627. https://www.science.org/doi/abs/10.1126/science.1261627
Shen, J., et al.: Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. CoRR abs/1712.05884 (2017). http://arxiv.org/abs/1712.05884
Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., Sebe, N.: Animating arbitrary objects via deep motion transfer. CoRR abs/1812.08861 (2018). http://arxiv.org/abs/1812.08861
Siarohin, A., Lathuilière, S., Tulyakov, S., Ricci, E., Sebe, N.: First order motion model for image animation. In: Wallach, H., Larochelle, H., Beygelzimer, A., d’Alché-Buc, F., Fox, E., Garnett, R. (eds.) Advances in Neural Information Processing Systems, vol. 32. Curran Associates, Inc. (2019). https://proceedings.neurips.cc/paper/2019/file/31c0b36aef265d9221af80872ceb62f9-Paper.pdf
Thies, J., Elgharib, M., Tewari, A., Theobalt, C., Nießner, M.: Neural voice puppetry: Audio-driven facial reenactment. CoRR abs/1912.05566 (2019). http://arxiv.org/abs/1912.05566
Wiles, O., Koepke, A.S., Zisserman, A.: X2face: A network for controlling face generation by using images, audio, and pose codes. CoRR abs/1807.10550 (2018). http://arxiv.org/abs/1807.10550
Acknowledgements
We acknowledge the CINECA award under the ISCRA initiatives, for the availability of high-performance computing resources and support within the projects IsC80_FEAD-D and IsC93_FEAD-DII. We also acknowledge the NVIDIA AI Technology Center, EMEA, for its support and access to computing resources, and the Federica Web Learning University center for providing professor Sansone’s videos.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Sannino, C., Gravina, M., Marrone, S., Fiameni, G., Sansone, C. (2022). LessonAble: Leveraging Deep Fakes in MOOC Content Creation. In: Sclaroff, S., Distante, C., Leo, M., Farinella, G.M., Tombari, F. (eds) Image Analysis and Processing – ICIAP 2022. ICIAP 2022. Lecture Notes in Computer Science, vol 13231. Springer, Cham. https://doi.org/10.1007/978-3-031-06427-2_3
Download citation
DOI: https://doi.org/10.1007/978-3-031-06427-2_3
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-06426-5
Online ISBN: 978-3-031-06427-2
eBook Packages: Computer ScienceComputer Science (R0)