Automatic video self modeling for voice disorder

Shen, Ju; Ti, Changpeng; Raghunathan, Anusha; Cheung, Sen-ching S.; Patel, Rita

doi:10.1007/s11042-014-2015-1

Automatic video self modeling for voice disorder

Published: 15 May 2014

Volume 74, pages 5329–5351, (2015)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Ju Shen¹,
Changpeng Ti¹,
Anusha Raghunathan²,
Sen-ching S. Cheung¹ &
…
Rita Patel³

502 Accesses
1 Citation
Explore all metrics

Abstract

Video self modeling (VSM) is a behavioral intervention technique in which a learner models a target behavior by watching a video of him- or herself. In the field of speech language pathology, the approach of VSM has been successfully used for treatment of language in children with Autism and in individuals with fluency disorder of stuttering. Technical challenges remain in creating VSM contents that depict previously unseen behaviors. In this paper, we propose a novel system that synthesizes new video sequences for VSM treatment of patients with voice disorders. Starting with a video recording of a voice-disorder patient, the proposed system replaces the coarse speech with a clean, healthier speech that bears resemblance to the patient’s original voice. The replacement speech is synthesized using either a text-to-speech engine or selecting from a database of clean speeches based on a voice similarity metric. To realign the replacement speech with the original video, a novel audiovisual algorithm that combines audio segmentation with lip-state detection is proposed to identify corresponding time markers in the audio and video tracks. Lip synchronization is then accomplished by using an adaptive video re-sampling scheme that minimizes the amount of motion jitter and preserves the spatial sharpness. Results of both objective measurements and subjective evaluations on a dataset with 31 subjects demonstrate the effectiveness of the proposed techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech Recognition for Individuals with Voice Disorders

LipSpeaker: Helping Acquired Voice Disorders People Speak Again

Modular Joint Training for Speech-Driven 3D Facial Animation

Notes

The script can be found in http://vis.uky.edu/nsf-autism/speaktome

References

Aleksic P, Katsaggelos A (2003) roduct HMMs for audio-visual continuous speech recognition using facial animation parameters. In: International conference on multimedia and expo (ICME). pp 481–484
Alvero AM, Austin J (2004) The effects of conducting behavioral observations on the behavior of the observer. J Appl Behav Anal 37:457–468
Article Google Scholar
Arsic I., Thiran J. (2006) Mutual information engenlips for audio-visual speech. In: 14th European signal processing conference
Bandura A (1997) Self-effiicacy: the exercise of control. Freeman, New York
Google Scholar
Bartels JGASC, Bilmes J (2004) Dbn based multi-stream models for audio-visual speech recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP). pp 993–996
Bonastre JF, Scheffer N, Fredouille C, Matrouf D (2004) Nist’04 speaker recognition evaluation campaign: new lia speaker detection platform based on alize toolkit. In: Proceedings of NIST speaker evaluation
Boone DR, McFarlane SC (2006) The voice and voice therapy. Prentice Hall
Buggey T (2009) Seeing is believing: video self modeling for people with autism and other developmental disabilities. Wodbine House
CereProc, http://www.cereproc.com: Text to Speech Technology
Chen J, Tiddeman B, Zhao G (2008) Real-time lip contour extraction and tracking using an improved active contour model. In: Lecture notes in computer science, vol 5359. pp 236–245
Cui S, Manica R, Tabor RF, Chan DYC (2012) Interpreting atomic force microscopy measurements of hydrodynamic and surface forces with nonlinear parametric estimation. In: Review of scientific instrument, vol 83. p 103–702
Deng Z, Neumann U (2008) Expressive speech animation synthesis with phoneme-level controls. Comput Graph Forum 27:2096–2113
Article Google Scholar
Dowrick PW (1983) Self-modeling using video: psychological and social applications, Wiley
Duy N, David H (2006) Real-time face detection and lip feature extraction using field-programmable gate arrays. In: IEEE transactions systems man cybernet. pp 902–912
Eveno N, Caplier A, Coulon PY (2002) Key points based segmentation of lips, In: IEEE international conference on multimedia and expo, 2002
Eveno N, Caplier A, Coulon PY (2004) Accurate and quasi-automatic lip tracking. IEEE Trans Circ Syst Video Tech 14:706–715
Article Google Scholar
Hammal Z, Eveno N, Caplier A, Coulon P (2005) Parametric models for facial features segmentation. IEEE J Sig Process
Hapner E, Portone-Maira CJM (2009) A study of voice therapy dropout. j voice. J Voice 23:337–40
Article Google Scholar
Hitchcock CH, Dowrick PW, Prater MA (2003) Video self-modeling intervention in school-based settings: a review. Remedial Spec Educ 24(1):36–45
Article Google Scholar
Howitt A (2000) Automatic syllable detection for vowel landmarks, PhD Thesis
Kass M, Witkin A, Terzopoulos D (1988) Snakes: active contour models. Int J Comput Vis 14:321–331
Article Google Scholar
Kaucic R, Dalton B, Blake A (1996) Real-time lip tracking for audio-visual speech recognition applications. In: Lecture notes in computer science, vol 1065. pp 376–387
Kazuhiro N, Noriaki M, Kazuyoshi T, Naofumi T (2002) A real-time lip reading lsi for word recognition. In: Proceedings IEEE conference ASIC. pp 303–306
Krouse HJ (2001) Video modeling to educate patients. J Adv Nurs 33:748–757
Article Google Scholar
Leung AWC, Liew SH, WL (2000) Lip contour extraction using deformalbe model. In: International conference on image processing
Li L, Zhou Y, Zhang H (2010) Adaptive learning of region-based plsa model for total scene annotation. arXiv: 1311.5590 (preprint)
Ma J, Cole R, Pellom B, Ward W, Wise B (2005) Accurate visible speech synthesis based on concatenating variable length motion capture data. IEEE Trans Vis Comput Graph 15:485–500
Google Scholar
MacKenzie K, Millar A, Wilson JA, Sellars C, Deary IJ (2001) Is voice therapy an effective treatment for dysphonia? A randomized controlled trial. Br Med J 323:658–661
Article Google Scholar
McDaniel RW, Rhodes v A (1998) Development of a preparatory sensor information videotape for women receiving chemotherapy for breast cancer. Cancer Nurs 21:143–148
Article Google Scholar
Mermelstein P (1975) Automatic segmentation of speech into syllabic units. J. Acoust. Soc. Am. 58:880–883
Article Google Scholar
Mertens P (1987) Automatic segmentation of speech into syllables. ECST. pp 2009–2013
Nielsen D, Sigurdsson SO, Austin J (2009) Preventing back injuries in hospital settings: the effects of video modeling on safe patient lifting by nurses. J Appl Behav Anal 42(3):551–561
Article Google Scholar
Patel R, Bless D, Thibeault S (2011) A novel intensive approach to voice therapy. J Voice 25:562–569
Article Google Scholar
Queiroz R, Cohen M, Musse SR (2009) An extensible framework for interactive facial animation with facial expressions, lip synchronization and eye behavior. ACM Comput Entertain 7(4):58:1–58:20
Google Scholar
Ramachandran VS, Rogers-Ramachandra DC, Cobb S (1995) Touching the phantom. Nature 377:489–490
Article Google Scholar
Ramig LO, Verdolini K (1998) Treatment efficacy: voice disorders. J Speech Lang Hear Res 41:101–116
Article Google Scholar
Roy N, Bless D, Heisey D, Ford C (1993) Manual circumlaryngeal therapy for functional dysphonia: an evaluation of short- and long-term treatment outcomes. J Voice 11:321–331
Article Google Scholar
Roy N, Weinrich B, Gray S, Tanner K, Stemple JC, Sapienza CM (2003) Three treatments for 2 teachers with voice disorders: a randomized clinical trial. J Speech Lang Hear Res 46:670–688
Article Google Scholar
Shen J, Raghunathan A, Cheung SC, Patel R (2011) Automatic content generation for video self modeling. In: Proceedings of IEEE international conference on multimedia expo (ICME 2011)
Shen J, Ti C, Cheung SC, Patel R (2012) Automatic lip-synchronized video-self-modeling intervention for voice disorders. In: 2012 IEEE 14th international conference on e-health networking, applications and services (Healthcom). pp 244–249
Shen J, Cheung SC (2013) Layer depth denoising and completion for structured-light rgb-d cameras. In: 2013 IEEE conference on computer vision and pattern recognition (CVPR). pp 1187–1194. doi: 10.1109/CVPR.2013.157
Shen J, Su P, Cheung S, Zhao J (2013) Virtual mirror rendering with stationary rgb-d cameras and stored 3d background. IEEE transactions on image processing: a publication of the IEEE signal processing society
Sundermann D, Bonafonte A, Ney H, Hoge H (2004) Time domain vocal tract length normalization. In: Signal processing and information technology
TIMIT acoustic-phonetic continuous speech corpus
Verdolini K, Ramig LO (2001) Review: occupational risks for voice problems. Logopedics, Phoniatrics, Vocology 26(1):37–46
Article Google Scholar
Viola P., Jones M. (2001) Rapid object detection using a boosted cascade of simple features. In: Computer vision and pattern recognition. pp 511–518
Vogl W, Ma B, Sitti M (2006) Augmented reality user interface for an atomic force microscope based nanorobotic system. IEEE Trans Nanotechnol 5(4):397–406
Article Google Scholar
Xie Z, NP (2006) Robust acoustic-based syllable detection. In: INTERSPEECH’06
Yang J, Fei Z (2010) Hdar: Hole detection and adaptive geographic routing for ad hoc networks. In: Proceedings of 19th international conference on computer communications and networks (ICCCN) 2010. pp 1–6
Yang J, Fei Z (2013) Broadcasting with prediction and selective forwarding in vehicular networks. In: International journal of distributed sensor networks
Zheng Q, Chellappa R (1995) Automatic feature point extraction and tracking in image sequences for arbitrary camera motion. Int J Comput Vis 15:31–76
Article Google Scholar
Zhou Y, Li L, Zhao T, Zhang H (2010) Region-based high-level semantics extraction with cedd. In: 2010 2nd IEEE international conference on network infrastructure and digital content. IEEE, pp 404–408

Download references

Acknowledgments

Part of this material is based upon work supported by the National Science Foundation under Grant No. 1237134. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Author information

Authors and Affiliations

Center for Visualization and Virtual Environments, University of Kentucky, 329 Rose Street, Lexington, KY, 40506, USA
Ju Shen, Changpeng Ti & Sen-ching S. Cheung
Intel corporation, Bowers Avenue, 1900, Prarie City Road, Folsom, CA, 95630, USA
Anusha Raghunathan
Department of Speech and Hearing Science, Indiana University, 200 South Jordan Avenue, Bloomington, IN, 47405, USA
Rita Patel

Authors

Ju Shen
View author publications
You can also search for this author in PubMed Google Scholar
Changpeng Ti
View author publications
You can also search for this author in PubMed Google Scholar
Anusha Raghunathan
View author publications
You can also search for this author in PubMed Google Scholar
Sen-ching S. Cheung
View author publications
You can also search for this author in PubMed Google Scholar
Rita Patel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ju Shen.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shen, J., Ti, C., Raghunathan, A. et al. Automatic video self modeling for voice disorder. Multimed Tools Appl 74, 5329–5351 (2015). https://doi.org/10.1007/s11042-014-2015-1

Download citation

Received: 22 August 2013
Revised: 05 April 2014
Accepted: 08 April 2014
Published: 15 May 2014
Issue Date: July 2015
DOI: https://doi.org/10.1007/s11042-014-2015-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic video self modeling for voice disorder

Abstract

Access this article

Similar content being viewed by others

Speech Recognition for Individuals with Voice Disorders

LipSpeaker: Helping Acquired Voice Disorders People Speak Again

Modular Joint Training for Speech-Driven 3D Facial Animation

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Automatic video self modeling for voice disorder

Abstract

Access this article

Similar content being viewed by others

Speech Recognition for Individuals with Voice Disorders

LipSpeaker: Helping Acquired Voice Disorders People Speak Again

Modular Joint Training for Speech-Driven 3D Facial Animation

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation