research-article

VoCo: text-based insertion and replacement in audio narration

Authors:

Gautham J. Mysore,

Stephen Diverdi,

Adam FinkelsteinAuthors Info & Claims

ACM Transactions on Graphics (TOG), Volume 36, Issue 4

Article No.: 96, Pages 1 - 13

https://doi.org/10.1145/3072959.3073702

Published: 20 July 2017 Publication History

Abstract

Editing audio narration using conventional software typically involves many painstaking low-level manipulations. Some state of the art systems allow the editor to work in a text transcript of the narration, and perform select, cut, copy and paste operations directly in the transcript; these operations are then automatically applied to the waveform in a straightforward manner. However, an obvious gap in the text-based interface is the ability to type new words not appearing in the transcript, for example inserting a new word for emphasis or replacing a misspoken word. While high-quality voice synthesizers exist today, the challenge is to synthesize the new word in a voice that matches the rest of the narration. This paper presents a system that can synthesize a new word or short phrase such that it blends seamlessly in the context of the existing narration. Our approach is to use a text to speech synthesizer to say the word in a generic voice, and then use voice conversion to convert it into a voice that matches the narration. Offering a range of degrees of control to the editor, our interface supports fully automatic synthesis, selection among a candidate set of alternative pronunciations, fine control over edit placements and pitch profiles, and even guidance by the editors own voice. The paper presents studies showing that the output of our method is preferred over baseline methods and often indistinguishable from the original voice.

Supplementary Material

ZIP File (a96-jin.zip)

Supplemental files.

Download
66.80 MB

References

[1]

Acapela Group. 2016. http://www.acapela-group.com. (2016). Accessed: 2016-04-10.

[2]

Ryo Aihara, Toru Nakashika, Tetsuya Takiguchi, and Yasuo Ariki. 2014. Voice conversion based on Non-negative matrix factorization using phoneme-categorized dictionary. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2014).

[3]

Floraine Berthouzoz, Wilmot Li, and Maneesh Agrawala. 2012. Tools for placing cuts and transitions in interview video. ACM Trans. on Graphics (TOG) 31, 4 (2012), 67.

Digital Library

[4]

Paulus Petrus Gerardus Boersma et al. 2002. Praat, a system for doing phonetics by computer. Glot international 5 (2002).

[5]

Christoph Bregler, Michele Covell, and Malcolm Slaney. 1997. Video Rewrite: Driving Visual Speech with Audio. In Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH '97). 353--360.

Digital Library

[6]

Juan Casares, A Chris Long, Brad A Myers, Rishi Bhatnagar, Scott M Stevens, Laura Dabbish, Dan Yocum, and Albert Corbett. 2002. Simplifying video editing using metadata. In Proceedings of the 4th conference on Designing interactive systems: processes, practices, methods, and techniques. ACM, 157--166.

Digital Library

[7]

Ling-Hui Chen, Zhen-Hua Ling, Li-Juan Liu, and Li-Rong Dai. 2014. Voice conversion using deep neural networks with layer-wise generative training. Audio, Speech, and Language Processing, IEEE/ACM Transactions on 22, 12 (2014), 1859--1872.

Digital Library

[8]

Alistair D Conkie and Stephen Isard. 1997. Optimal coupling of diphones. In Progress in speech synthesis. Springer, 293--304.

[9]

Srinivas Desai, E Veera Raghavendra, B Yegnanarayana, Alan W Black, and Kishore Prahallad. 2009. Voice conversion using Artificial Neural Networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2009).

Digital Library

[10]

Thierry Dutoit, Andre Holzapfel, Matthieu Jottrand, Alexis Moinet, J Prez, and Yannis Stylianou. 2007. Towards a Voice Conversion System Based on Frame Selection. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2007).

[11]

G David Forney. 1973. The Viterbi algorithm. Proc. IEEE 61, 3 (1973), 268--278.

[12]

Kei Fujii, Jun Okawa, and Kaori Suigetsu. 2007. High-Individuality Voice Conversion Based on Concatenative Speech Synthesis. International Journal of Electrical, Computer, Energetic, Electronic and Communication Engineering 1, 11 (2007), 1617 -- 1622.

[13]

François G. Germain, Gautham J. Mysore, and Takako Fujioka. 2016. Equalization Matching of Speech Recordings in Real-World Environments. In 41st IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP 2016).

[14]

Aaron Hertzmann, Charles E Jacobs, Nuria Oliver, Brian Curless, and David H Salesin. 2001. Image analogies. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques. ACM, 327--340.

Digital Library

[15]

Andrew J Hunt and Alan W Black. 1996. Unit selection in a concatenative speech synthesis system using a large speech database. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 1996). 373--376.

Digital Library

[16]

Zeyu Jin, Adam Finkelstein, Stephen DiVerdi, Jingwan Lu, and Gautham J. Mysore. 2016. CUTE: a concatenative method for voice conversion using exemplar-based unit selection. In 41st IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP 2016).

[17]

Alexander Kain and Michael W Macon. 1998. Spectral voice conversion for text-to-speech synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 1998). 285--288.

[18]

Hideki Kawahara, Masanori Morise, Toru Takahashi, Ryuichi Nisimura, Toshio Irino, and Hideki Banno. 2008. TANDEM-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2008). 3933--3936.

[19]

John Kominek and Alan W Black. 2004. The CMU Arctic speech databases. In Fifth ISCA Workshop on Speech Synthesis.

[20]

Robert F. Kubichek. 1993. Mel-cepstral distance measure for objective speech quality assessment. In Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing. 125--128.

[21]

Sergey Levine, Christian Theobalt, and Vladlen Koltun. 2009. Real-time Prosody-driven Synthesis of Body Language. ACM Trans. Graph. 28, 5, Article 172 (Dec. 2009), 10 pages.

Digital Library

[22]

Jingwan Lu, Fisher Yu, Adam Finkelstein, and Stephen DiVerdi. 2012. HelpingHand: Example-based Stroke Stylization. ACM Trans. Graph. 31, 4, Article 46 (July 2012), 10 pages.

Digital Library

[23]

Michal Lukáč, Jakub Fišer, Jean-Charles Bazin, Ondřej Jamriška, Alexander Sorkine-Hornung, and Daniel Sýkora. 2013. Painting by Feature: Texture Boundaries for Example-based Image Creation. ACM Trans. Graph. 32, 4, Article 116 (July 2013), 8 pages.

Digital Library

[24]

Anderson F Machado and Marcelo Queiroz. 2010. Voice conversion: A critical survey. Proc. Sound and Music Computing (SMC) (2010), 1--8.

[25]

Lindasalwa Muda, Mumtaj Begam, and Irraivan Elamvazuthi. 2010. Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques. arXiv preprint arXiv:1003.4083 (2010).

[26]

Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).

[27]

Amy Pavel, Dan B. Goldman, Björn Hartmann, and Maneesh Agrawala. 2015. SceneSkim: Searching and Browsing Movies Using Synchronized Captions, Scripts and Plot Summaries. In Proceedings of the 28th annual ACM symposium on User interface software and technology (UIST 2015). 181--190.

Digital Library

[28]

Amy Pavel, Björn Hartmann, and Maneesh Agrawala. 2014. Video digests: A browsable, skimmable format for informational lecture videos. In Proceedings of the 27th annual ACM symposium on User interface software and technology (UIST 2014). 573--582.

Digital Library

[29]

Bhiksha Raj, Tuomas Virtanen, Sourish Chaudhuri, and Rita Singh. 2010. Non-negative matrix factorization based compensation of music for automatic speech recognition. In Interspeech 2010. 717--720.

[30]

Marc Roelands and Werner Verhelst. 1993. Waveform similarity based overlap-add (WSOLA) for time-scale modification of speech: structures and evaluation. In EUROSPEECH 1993. 337--340.

[31]

Steve Rubin, Floraine Berthouzoz, Gautham J. Mysore, Wilmot Li, and Maneesh Agrawala. 2013. Content-based Tools for Editing Audio Stories. In Proceedings of the 26th Annual ACM Symposium on User Interface Software and Technology (UIST 2013). 113--122.

Digital Library

[32]

Kåre Sjölander. 2003. An HMM-based system for automatic segmentation and alignment of speech. In Proceedings of Fonetik 2003. 93--96.

[33]

Matthew Stone, Doug DeCarlo, Insuk Oh, Christian Rodriguez, Adrian Stere, Alyssa Lees, and Chris Bregler. 2004. Speaking with Hands: Creating Animated Conversational Characters from Recordings of Human Performance. ACM Trans. Graph. 23, 3 (Aug. 2004), 506--513.

Digital Library

[34]

Yannis Stylianou, Olivier Cappé, and Eric Moulines. 1998. Continuous probabilistic transform for voice conversion. IEEE Transactions on Speech and Audio Processing 6, 2 (1998), 131--142.

[35]

Paul Taylor. 2009. Text-to-Speech Synthesis. Cambridge University Press.

[36]

Tomoki Toda, Alan W Black, and Keiichi Tokuda. 2007a. Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory. IEEE Transactions on Audio, Speech, and Language Processing 15, 8 (2007), 2222--2235.

Digital Library

[37]

Tomoki Toda, Yamato Ohtani, and Kiyohiro Shikano. 2007b. One-to-many and many-to-one voice conversion based on eigenvoices. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2007). IV-1249.

[38]

Tomoki Toda, Hiroshi Saruwatari, and Kiyohiro Shikano. 2001. Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2001). 841--844.

Digital Library

[39]

Keiichi Tokuda, Yoshihiko Nankaku, Tomoki Toda, Heiga Zen, Junichi Yamagishi, and Keiichiro Oura. 2013. Speech Synthesis Based on Hidden Markov Models. Proc. IEEE 101, 5 (May 2013), 1234--1252.

[40]

Steve Whittaker and Brian Amento. 2004. Semantic Speech Editing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI 2004). 527--534.

Digital Library

[41]

Zhizheng Wu, Tuomas Virtanen, Tomi Kinnunen, Engsiong Chng, and Haizhou Li. 2013. Exemplar-based unit selection for voice conversion utilizing temporal information. In INTERSPEECH 2013. 3057--3061.

Cited By

Liang ZMa ZDu CYu KChen X(2024)E$^{3}$TTS: End-to-End Text-Based Speech Editing TTS System and Its ApplicationsIEEE/ACM Transactions on Audio, Speech, and Language Processing10.1109/TASLP.2024.348546632(4810-4821)Online publication date: 2024
https://doi.org/10.1109/TASLP.2024.3485466
Li YYu CSun GZu WTian ZWen YPan WZhang CWang JYang YSun F(2024)Cross-Utterance Conditioned VAE for Speech GenerationIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2024.345359832(4263-4276)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TASLP.2024.3453598
Khanjani ZWatson GJaneja V(2023)Audio deepfakes: A surveyFrontiers in Big Data10.3389/fdata.2022.10010635Online publication date: 9-Jan-2023
https://doi.org/10.3389/fdata.2022.1001063
Show More Cited By

Index Terms

VoCo: text-based insertion and replacement in audio narration
1. Human-centered computing
  1. Human computer interaction (HCI)
    1. Interaction techniques

Recommendations

Synthesizing Obama: learning lip sync from audio

Given audio of President Barack Obama, we synthesize a high quality video of him speaking with accurate lip sync, composited into a target video clip. Trained on many hours of his weekly address footage, a recurrent neural network learns the mapping ...
Summarization of Spontaneous Speech using Automatic Speech Recognition and a Speech Prosody based Tokenizer
IC3K 2016: Proceedings of the International Joint Conference on Knowledge Discovery, Knowledge Engineering and Knowledge Management

This paper addresses speech summarization of highly spontaneous speech. The audio signal is transcribed using

an Automatic Speech Recognizer, which operates at relatively high word error rates due to the complexity

of the recognition task and high ...
Do Prosody and Embodiment Influence the Perceived Naturalness of Conversational Agents’ Speech?
For conversational agents’ speech, either all possible sentences have to be prerecorded by voice actors or the required utterances can be synthesized. While synthesizing speech is more flexible and economic in production, it also potentially reduces the ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Graphics

ACM Transactions on Graphics Volume 36, Issue 4

August 2017

2155 pages

ISSN:0730-0301

EISSN:1557-7368

DOI:10.1145/3072959

Issue’s Table of Contents

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 July 2017

Published in TOG Volume 36, Issue 4

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

28
Total Citations
View Citations
695
Total Downloads

Downloads (Last 12 months)41
Downloads (Last 6 weeks)1

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liang ZMa ZDu CYu KChen X(2024)E$^{3}$TTS: End-to-End Text-Based Speech Editing TTS System and Its ApplicationsIEEE/ACM Transactions on Audio, Speech, and Language Processing10.1109/TASLP.2024.348546632(4810-4821)Online publication date: 2024
https://doi.org/10.1109/TASLP.2024.3485466
Li YYu CSun GZu WTian ZWen YPan WZhang CWang JYang YSun F(2024)Cross-Utterance Conditioned VAE for Speech GenerationIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2024.345359832(4263-4276)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1109/TASLP.2024.3453598
Khanjani ZWatson GJaneja V(2023)Audio deepfakes: A surveyFrontiers in Big Data10.3389/fdata.2022.10010635Online publication date: 9-Jan-2023
https://doi.org/10.3389/fdata.2022.1001063
Chen TShangguan LLi ZJamieson K(2023)The Design and Implementation of a Steganographic Communication System over In-Band Acoustical ChannelsACM Transactions on Sensor Networks10.1145/358716219:4(1-25)Online publication date: 10-Jul-2023
https://dl.acm.org/doi/10.1145/3587162
Shaaban OYildirim RAlguttar A(2023)Audio Deepfake ApproachesIEEE Access10.1109/ACCESS.2023.333386611(132652-132682)Online publication date: 2023
https://doi.org/10.1109/ACCESS.2023.3333866
Matthews T(2023)Deepfakes, Fake Barns, and Knowledge from VideosSynthese10.1007/s11229-022-04033-x201:2Online publication date: 23-Jan-2023
https://doi.org/10.1007/s11229-022-04033-x
Wang BJin ZMysore G(2022)Record Once, Post Everywhere: Automatic Shortening of Audio Stories for Social MediaProceedings of the 35th Annual ACM Symposium on User Interface Software and Technology10.1145/3526113.3545680(1-11)Online publication date: 29-Oct-2022
https://dl.acm.org/doi/10.1145/3526113.3545680
Wang TYi JFu RTao JWen Z(2022)CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech EditingIEEE/ACM Transactions on Audio, Speech, and Language Processing10.1109/TASLP.2022.319071730(2241-2254)Online publication date: 2022
https://doi.org/10.1109/TASLP.2022.3190717
Tan DDeng LZheng NYeung YJiang XChen XLee T(2022)CorrectSpeech: A Fully Automated System for Speech Correction and Accent Reduction2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)10.1109/ISCSLP57327.2022.10038107(81-85)Online publication date: 11-Dec-2022
https://doi.org/10.1109/ISCSLP57327.2022.10038107
Wang TYi JDeng LFu RTao JWen Z(2022)Context-Aware Mask Prediction Network for End-to-End Text-Based Speech EditingICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP43922.2022.9746765(6082-6086)Online publication date: 23-May-2022
https://doi.org/10.1109/ICASSP43922.2022.9746765
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents