Abstract
Synthesizing voice with the help of machine learning techniques has made rapid progress over the last years [1]. Given the current increase in using conferencing tools for online teaching, we question just how easy (i.e. needed data, hardware, skill set) it would be to create a convincing voice fake. We analyse how much training data a participant (e.g. a student) would actually need to fake another participants voice (e.g. a professor). We provide an analysis of the existing state of the art in creating voice deep fakes and align the identified as well as our own optimization techniques in the context of two different voice data sets. A user study with more than 100 participants shows how difficult it is to identify real and fake voice (on avg. only 37% can recognize a professor’s fake voice). From a longer-term societal perspective such voice deep fakes may lead to a disbelief by default.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Wang, Y., et al.: Towards end-to-end speech synthesis (2017)
Stupp, C.: Fraudsters Used AI to Mimic CEO’s Voice in Unusual Cybercrime Case (2019). https://www.wsj.com/articles/fraudsters-use-ai-to-mimic-ceos-voice-in-unusual-cybercrime-case-11567157402. Accessed 14 July 2021
Shen, J., et al.: Natural TTS synthesis by conditioning Wavenet on MEL spectrogram predictions (2018)
Łańcucki, A.: Fastpitch: Parallel text-to-speech with pitch prediction (2021)
Ren, Y., et al.: Fastspeech 2: Fast and high-quality end-to-end text to speech (2021)
van den Oord, A., et al.: A generative model for raw audio, Wavenet (2016)
Barnekow, V., Binder, D., Kromrey, N., Munaretto, P., Schaad, A., Schmieder, F.: Creation and detection of german voice deepfakes (2021)
NVIDIA. Deep Learning Performance Documentation (2021). https://docs.nvidia.com/deeplearning/performance/mixed-precision-training. Accessed 31 Mar 2021
Prenger, R., Valle, R., Catanzaro, B.: A flow-based generative network for speech synthesis, Waveglow (2018)
Kumar, K., et al.: Generative adversarial networks for conditional waveform synthesis, Melgan (2019)
Yamamoto, R., Song, E., Kim, J.-M.: Parallel wavegan: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram (2020)
Maccarone, T.J.: The biphase explained: understanding the asymmetries in coupled fourier components of astronomical time series. Monthly Notices Roy. Astron. Soc. 435(4), 3547–3558 (2013). ISSN: 0035–8711. https://doi.org/10.1093/mnras/stt1546
AlBadawy, E.A., Lyu, S., Farid, H.: Detecting AI-synthesized speech using bispectral analysis. In: CVPR Workshops, pp. 104–109 (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Barnekow, V., Binder, D., Kromrey, N., Munaretto, P., Schaad, A., Schmieder, F. (2022). Creation and Detection of German Voice Deepfakes. In: Aïmeur, E., Laurent, M., Yaich, R., Dupont, B., Garcia-Alfaro, J. (eds) Foundations and Practice of Security. FPS 2021. Lecture Notes in Computer Science, vol 13291. Springer, Cham. https://doi.org/10.1007/978-3-031-08147-7_24
Download citation
DOI: https://doi.org/10.1007/978-3-031-08147-7_24
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-08146-0
Online ISBN: 978-3-031-08147-7
eBook Packages: Computer ScienceComputer Science (R0)