research-article

Neural Style Transfer Based Voice Mimicking for Personalized Audio Stories

Authors:
Syeda Maryam Fatima

Habib University, Karachi, Pakistan

Habib University, Karachi, Pakistan
View Profile

,
Marina Shehzad

Habib University, Karachi, Pakistan

Habib University, Karachi, Pakistan
View Profile

,
Syed Sami Murtuza

Habib University, Karachi, Pakistan

Habib University, Karachi, Pakistan
View Profile

,
Syeda Saleha Raza

Habib University, Karachi, Pakistan

Habib University, Karachi, Pakistan
View Profile

AI4TV '20: Proceedings of the 2nd International Workshop on AI for Smart TV Content Production, Access and DeliveryOctober 2020Pages 11–16https://doi.org/10.1145/3422839.3423063

Published:12 October 2020Publication History

AI4TV '20: Proceedings of the 2nd International Workshop on AI for Smart TV Content Production, Access and Delivery

Pages 11–16

ABSTRACT

This paper demonstrates a CNN based neural style transfer on audio dataset to make storytelling a personalized experience by asking users to record a few sentences that are used to mimic their voice. User audios are converted to spectrograms, the style of which is transferred to the spectrogram of a base voice narrating the story. This neural style transfer is similar to the style transfer on images. This approach stands out as it needs a small dataset and therefore, also takes less time to train the model. This project is intended specifically for children who prefer digital interaction and are also increasingly leaving behind the storytelling culture and for working parents who are not able to spend enough time with their children. By using a parent's initial recording to narrate a given story, it is designed to serve as a conjunction between storytelling and screen-time to incorporate children's interest through the implicit ethical themes of the stories, connecting children to their loved ones simultaneously ensuring an innocuous and meaningful learning experience.

References

[n.d.]. LibROSA¶. https://librosa.github.io/librosa/Google Scholar
[n.d.]. Luka The Reading Companion for Kids. https://www.facebook.com/ worldofluka/Google Scholar
Kuan Chen, Bo Chen, Jiahao Lai, and Kai Yu. 2018. High-quality Voice Conversion Using Spectrogram-Based WaveNet Vocoder.. In Interspeech. 1993--1997.Google Scholar
Mireia Farrús Cabeceran, Michael Wagner, Daniel Erro Eslava, and Francisco Javier Hernando Pericás. 2010. Automatic speaker recognition as a measurement of voice imitation and conversion. The Intenational Journal of Speech. Language and the Law 1, 17 (2010), 119--142.Google Scholar
Yang Gao, Rita Singh, and Bhiksha Raj. 2018. Voice impersonation using generative adversarial networks. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2506--2510.Google ScholarDigital Library
Leon A. Gatys, Alexander S. Ecker, and Matthias Bethge. 2015. A Neural Algorithm of Artistic Style. CoRR abs/1508.06576 (2015). arXiv:1508.06576 http://arxiv.org/abs/1508.06576Google Scholar
Eric Grinstein, Ngoc Duong, Alexey Ozerov, and Patrick Pérez. 2018. Audio style transfer. https://arxiv.org/abs/1710.11385Google Scholar
WIRED Insider. 2018. How Lyrebird Uses AI to Find Its (Artificial) Voice. https: //www.wired.com/brandlab/2018/10/lyrebird-uses-ai-find-artificial-voice/Google Scholar
Ye Jia, Yu Zhang, Ron Weiss, Quan Wang, Jonathan Shen, Fei Ren, Patrick Nguyen, Ruoming Pang, Ignacio Lopez Moreno, Yonghui Wu, et al. 2018. Transfer learning from speaker verification to multispeaker text-to-speech synthesis. In Advances in neural information processing systems. 4480--4490.Google Scholar
Younggun Lee, Taesu Kim, and Soo-Young Lee. 2018. Voice Imitating Text-toSpeech Neural Networks. arXiv preprint arXiv:1806.00927 (2018).Google Scholar
Mazzzystar. 2019. mazzzystar/randomCNN-voice-transfer. https://github.com/ mazzzystar/randomCNN-voice-transferGoogle Scholar
A. V. Oppenheim. 1970. Speech spectrograms using the fast Fourier transform. IEEE Spectrum 7, 8 (1970), 57--62.Google ScholarDigital Library
Marco Pasini. 2019. Voice Translation and Audio Style Transfer with GANs. Medium (Nov 2019). https://towardsdatascience.com/voice-translation-andaudio-style-transfer-with-gans-b63d58f61854Google Scholar
Hossein Salehghaffari. 2018. Speaker Verification using Convolutional Neural Networks. arXiv preprint arXiv:1803.05427 (2018).Google Scholar
Hideyuki Tachibana, Katsuya Uenoyama, and Shunsuke Aihara. 2018. Efficiently trainable text-to-speech system based on deep convolutional networks with guided attention. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4784--4788.Google ScholarDigital Library
Yuxuan Wang, RJ Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, et al. 2017. Tacotron: Towards end-to-end speech synthesis. arXiv preprint arXiv:1703.10135 (2017).Google Scholar
A. Zhang. 2017. Speech Recognition (Version 3.8) [Software]. PyPI https://pypi.org/project/SpeechRecognition/ (2017). https://pypi.org/project/SpeechRecognition/Google Scholar

Index Terms

Neural Style Transfer Based Voice Mimicking for Personalized Audio Stories

Recommendations

Interactive video stories from user generated content: a school concert use case
ICIDS'12: Proceedings of the 5th international conference on Interactive Storytelling

This paper describes a web-based narrative system able to generate video compilations, framed as event stories, from a shared repository of video recordings of the event itself and possibly of related events. For this, it employs narrative techniques ...
Read More
Interactive TV narratives: Opportunities, progress, and challenges

This article is motivated by the question whether television should do more than simply offer interactive services alongside (and separately from) traditional linear programs, in the context of its dominance being seriously challenged and threatened by ...
Read More
Interactive documentaries: A Golden Age
SPECIAL ISSUE: TV and Video Entertainment Environments

This article is motivated by the opportunity presented by recent advances in information and communication technology—particularly by faster broadband connections and faster digital media processing capabilities—for interactive television to extend and ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
AI4TV '20: Proceedings of the 2nd International Workshop on AI for Smart TV Content Production, Access and Delivery
October 2020
50 pages
ISBN:9781450381468
DOI:10.1145/3422839
Program Chairs:
Raphaël Troncy
EURECOM, France
,
Jorma Laaksonen
Aalto University, Finland
,
Hamed R.-Tavakoli
Nokia Technologies, Finland
,
Lyndon Nixon
MODUL Technology GmbH, Austria
,
Vasileios Mezaris
CERTH-ITI, Greece
,
Mohammad Hosseini
Comcast, USA
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 12 October 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
digital storytelling
neural style transfer, voice mimicking, personalized storytelling
Qualifiers
- research-article
Conference
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 4
  Total Citations
  View Citations
- 125
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Neural Style Transfer Based Voice Mimicking for Personalized Audio Stories

AI4TV '20: Proceedings of the 2nd International Workshop on AI for Smart TV Content Production, Access and Delivery

ABSTRACT

References

Cited By

Index Terms

Recommendations

Interactive video stories from user generated content: a school concert use case

Interactive TV narratives: Opportunities, progress, and challenges

Interactive documentaries: A Golden Age