skip to main content
research-article

VoCo: text-based insertion and replacement in audio narration

Published: 20 July 2017 Publication History

Abstract

Editing audio narration using conventional software typically involves many painstaking low-level manipulations. Some state of the art systems allow the editor to work in a text transcript of the narration, and perform select, cut, copy and paste operations directly in the transcript; these operations are then automatically applied to the waveform in a straightforward manner. However, an obvious gap in the text-based interface is the ability to type new words not appearing in the transcript, for example inserting a new word for emphasis or replacing a misspoken word. While high-quality voice synthesizers exist today, the challenge is to synthesize the new word in a voice that matches the rest of the narration. This paper presents a system that can synthesize a new word or short phrase such that it blends seamlessly in the context of the existing narration. Our approach is to use a text to speech synthesizer to say the word in a generic voice, and then use voice conversion to convert it into a voice that matches the narration. Offering a range of degrees of control to the editor, our interface supports fully automatic synthesis, selection among a candidate set of alternative pronunciations, fine control over edit placements and pitch profiles, and even guidance by the editors own voice. The paper presents studies showing that the output of our method is preferred over baseline methods and often indistinguishable from the original voice.

Supplementary Material

ZIP File (a96-jin.zip)
Supplemental files.

References

[1]
Acapela Group. 2016. http://www.acapela-group.com. (2016). Accessed: 2016-04-10.
[2]
Ryo Aihara, Toru Nakashika, Tetsuya Takiguchi, and Yasuo Ariki. 2014. Voice conversion based on Non-negative matrix factorization using phoneme-categorized dictionary. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2014).
[3]
Floraine Berthouzoz, Wilmot Li, and Maneesh Agrawala. 2012. Tools for placing cuts and transitions in interview video. ACM Trans. on Graphics (TOG) 31, 4 (2012), 67.
[4]
Paulus Petrus Gerardus Boersma et al. 2002. Praat, a system for doing phonetics by computer. Glot international 5 (2002).
[5]
Christoph Bregler, Michele Covell, and Malcolm Slaney. 1997. Video Rewrite: Driving Visual Speech with Audio. In Proceedings of the 24th Annual Conference on Computer Graphics and Interactive Techniques (SIGGRAPH '97). 353--360.
[6]
Juan Casares, A Chris Long, Brad A Myers, Rishi Bhatnagar, Scott M Stevens, Laura Dabbish, Dan Yocum, and Albert Corbett. 2002. Simplifying video editing using metadata. In Proceedings of the 4th conference on Designing interactive systems: processes, practices, methods, and techniques. ACM, 157--166.
[7]
Ling-Hui Chen, Zhen-Hua Ling, Li-Juan Liu, and Li-Rong Dai. 2014. Voice conversion using deep neural networks with layer-wise generative training. Audio, Speech, and Language Processing, IEEE/ACM Transactions on 22, 12 (2014), 1859--1872.
[8]
Alistair D Conkie and Stephen Isard. 1997. Optimal coupling of diphones. In Progress in speech synthesis. Springer, 293--304.
[9]
Srinivas Desai, E Veera Raghavendra, B Yegnanarayana, Alan W Black, and Kishore Prahallad. 2009. Voice conversion using Artificial Neural Networks. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2009).
[10]
Thierry Dutoit, Andre Holzapfel, Matthieu Jottrand, Alexis Moinet, J Prez, and Yannis Stylianou. 2007. Towards a Voice Conversion System Based on Frame Selection. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2007).
[11]
G David Forney. 1973. The Viterbi algorithm. Proc. IEEE 61, 3 (1973), 268--278.
[12]
Kei Fujii, Jun Okawa, and Kaori Suigetsu. 2007. High-Individuality Voice Conversion Based on Concatenative Speech Synthesis. International Journal of Electrical, Computer, Energetic, Electronic and Communication Engineering 1, 11 (2007), 1617 -- 1622.
[13]
François G. Germain, Gautham J. Mysore, and Takako Fujioka. 2016. Equalization Matching of Speech Recordings in Real-World Environments. In 41st IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP 2016).
[14]
Aaron Hertzmann, Charles E Jacobs, Nuria Oliver, Brian Curless, and David H Salesin. 2001. Image analogies. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques. ACM, 327--340.
[15]
Andrew J Hunt and Alan W Black. 1996. Unit selection in a concatenative speech synthesis system using a large speech database. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 1996). 373--376.
[16]
Zeyu Jin, Adam Finkelstein, Stephen DiVerdi, Jingwan Lu, and Gautham J. Mysore. 2016. CUTE: a concatenative method for voice conversion using exemplar-based unit selection. In 41st IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP 2016).
[17]
Alexander Kain and Michael W Macon. 1998. Spectral voice conversion for text-to-speech synthesis. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 1998). 285--288.
[18]
Hideki Kawahara, Masanori Morise, Toru Takahashi, Ryuichi Nisimura, Toshio Irino, and Hideki Banno. 2008. TANDEM-STRAIGHT: A temporally stable power spectral representation for periodic signals and applications to interference-free spectrum, F0, and aperiodicity estimation. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2008). 3933--3936.
[19]
John Kominek and Alan W Black. 2004. The CMU Arctic speech databases. In Fifth ISCA Workshop on Speech Synthesis.
[20]
Robert F. Kubichek. 1993. Mel-cepstral distance measure for objective speech quality assessment. In Proceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing. 125--128.
[21]
Sergey Levine, Christian Theobalt, and Vladlen Koltun. 2009. Real-time Prosody-driven Synthesis of Body Language. ACM Trans. Graph. 28, 5, Article 172 (Dec. 2009), 10 pages.
[22]
Jingwan Lu, Fisher Yu, Adam Finkelstein, and Stephen DiVerdi. 2012. HelpingHand: Example-based Stroke Stylization. ACM Trans. Graph. 31, 4, Article 46 (July 2012), 10 pages.
[23]
Michal Lukáč, Jakub Fišer, Jean-Charles Bazin, Ondřej Jamriška, Alexander Sorkine-Hornung, and Daniel Sýkora. 2013. Painting by Feature: Texture Boundaries for Example-based Image Creation. ACM Trans. Graph. 32, 4, Article 116 (July 2013), 8 pages.
[24]
Anderson F Machado and Marcelo Queiroz. 2010. Voice conversion: A critical survey. Proc. Sound and Music Computing (SMC) (2010), 1--8.
[25]
Lindasalwa Muda, Mumtaj Begam, and Irraivan Elamvazuthi. 2010. Voice recognition algorithms using mel frequency cepstral coefficient (MFCC) and dynamic time warping (DTW) techniques. arXiv preprint arXiv:1003.4083 (2010).
[26]
Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499 (2016).
[27]
Amy Pavel, Dan B. Goldman, Björn Hartmann, and Maneesh Agrawala. 2015. SceneSkim: Searching and Browsing Movies Using Synchronized Captions, Scripts and Plot Summaries. In Proceedings of the 28th annual ACM symposium on User interface software and technology (UIST 2015). 181--190.
[28]
Amy Pavel, Björn Hartmann, and Maneesh Agrawala. 2014. Video digests: A browsable, skimmable format for informational lecture videos. In Proceedings of the 27th annual ACM symposium on User interface software and technology (UIST 2014). 573--582.
[29]
Bhiksha Raj, Tuomas Virtanen, Sourish Chaudhuri, and Rita Singh. 2010. Non-negative matrix factorization based compensation of music for automatic speech recognition. In Interspeech 2010. 717--720.
[30]
Marc Roelands and Werner Verhelst. 1993. Waveform similarity based overlap-add (WSOLA) for time-scale modification of speech: structures and evaluation. In EUROSPEECH 1993. 337--340.
[31]
Steve Rubin, Floraine Berthouzoz, Gautham J. Mysore, Wilmot Li, and Maneesh Agrawala. 2013. Content-based Tools for Editing Audio Stories. In Proceedings of the 26th Annual ACM Symposium on User Interface Software and Technology (UIST 2013). 113--122.
[32]
Kåre Sjölander. 2003. An HMM-based system for automatic segmentation and alignment of speech. In Proceedings of Fonetik 2003. 93--96.
[33]
Matthew Stone, Doug DeCarlo, Insuk Oh, Christian Rodriguez, Adrian Stere, Alyssa Lees, and Chris Bregler. 2004. Speaking with Hands: Creating Animated Conversational Characters from Recordings of Human Performance. ACM Trans. Graph. 23, 3 (Aug. 2004), 506--513.
[34]
Yannis Stylianou, Olivier Cappé, and Eric Moulines. 1998. Continuous probabilistic transform for voice conversion. IEEE Transactions on Speech and Audio Processing 6, 2 (1998), 131--142.
[35]
Paul Taylor. 2009. Text-to-Speech Synthesis. Cambridge University Press.
[36]
Tomoki Toda, Alan W Black, and Keiichi Tokuda. 2007a. Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory. IEEE Transactions on Audio, Speech, and Language Processing 15, 8 (2007), 2222--2235.
[37]
Tomoki Toda, Yamato Ohtani, and Kiyohiro Shikano. 2007b. One-to-many and many-to-one voice conversion based on eigenvoices. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2007). IV-1249.
[38]
Tomoki Toda, Hiroshi Saruwatari, and Kiyohiro Shikano. 2001. Voice conversion algorithm based on Gaussian mixture model with dynamic frequency warping of STRAIGHT spectrum. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2001). 841--844.
[39]
Keiichi Tokuda, Yoshihiko Nankaku, Tomoki Toda, Heiga Zen, Junichi Yamagishi, and Keiichiro Oura. 2013. Speech Synthesis Based on Hidden Markov Models. Proc. IEEE 101, 5 (May 2013), 1234--1252.
[40]
Steve Whittaker and Brian Amento. 2004. Semantic Speech Editing. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems (CHI 2004). 527--534.
[41]
Zhizheng Wu, Tuomas Virtanen, Tomi Kinnunen, Engsiong Chng, and Haizhou Li. 2013. Exemplar-based unit selection for voice conversion utilizing temporal information. In INTERSPEECH 2013. 3057--3061.

Cited By

View all
  • (2024)E$^{3}$TTS: End-to-End Text-Based Speech Editing TTS System and Its ApplicationsIEEE/ACM Transactions on Audio, Speech, and Language Processing10.1109/TASLP.2024.348546632(4810-4821)Online publication date: 2024
  • (2024)Cross-Utterance Conditioned VAE for Speech GenerationIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2024.345359832(4263-4276)Online publication date: 1-Jan-2024
  • (2023)Audio deepfakes: A surveyFrontiers in Big Data10.3389/fdata.2022.10010635Online publication date: 9-Jan-2023
  • Show More Cited By

Index Terms

  1. VoCo: text-based insertion and replacement in audio narration

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Graphics
    ACM Transactions on Graphics  Volume 36, Issue 4
    August 2017
    2155 pages
    ISSN:0730-0301
    EISSN:1557-7368
    DOI:10.1145/3072959
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 20 July 2017
    Published in TOG Volume 36, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. audio
    2. human computer interaction

    Qualifiers

    • Research-article

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)41
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 17 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)E$^{3}$TTS: End-to-End Text-Based Speech Editing TTS System and Its ApplicationsIEEE/ACM Transactions on Audio, Speech, and Language Processing10.1109/TASLP.2024.348546632(4810-4821)Online publication date: 2024
    • (2024)Cross-Utterance Conditioned VAE for Speech GenerationIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2024.345359832(4263-4276)Online publication date: 1-Jan-2024
    • (2023)Audio deepfakes: A surveyFrontiers in Big Data10.3389/fdata.2022.10010635Online publication date: 9-Jan-2023
    • (2023)The Design and Implementation of a Steganographic Communication System over In-Band Acoustical ChannelsACM Transactions on Sensor Networks10.1145/358716219:4(1-25)Online publication date: 10-Jul-2023
    • (2023)Audio Deepfake ApproachesIEEE Access10.1109/ACCESS.2023.333386611(132652-132682)Online publication date: 2023
    • (2023)Deepfakes, Fake Barns, and Knowledge from VideosSynthese10.1007/s11229-022-04033-x201:2Online publication date: 23-Jan-2023
    • (2022)Record Once, Post Everywhere: Automatic Shortening of Audio Stories for Social MediaProceedings of the 35th Annual ACM Symposium on User Interface Software and Technology10.1145/3526113.3545680(1-11)Online publication date: 29-Oct-2022
    • (2022)CampNet: Context-Aware Mask Prediction for End-to-End Text-Based Speech EditingIEEE/ACM Transactions on Audio, Speech, and Language Processing10.1109/TASLP.2022.319071730(2241-2254)Online publication date: 2022
    • (2022)CorrectSpeech: A Fully Automated System for Speech Correction and Accent Reduction2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP)10.1109/ISCSLP57327.2022.10038107(81-85)Online publication date: 11-Dec-2022
    • (2022)Context-Aware Mask Prediction Network for End-to-End Text-Based Speech EditingICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP43922.2022.9746765(6082-6086)Online publication date: 23-May-2022
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media