Abstract
Video self modeling (VSM) is a behavioral intervention technique in which a learner models a target behavior by watching a video of him- or herself. In the field of speech language pathology, the approach of VSM has been successfully used for treatment of language in children with Autism and in individuals with fluency disorder of stuttering. Technical challenges remain in creating VSM contents that depict previously unseen behaviors. In this paper, we propose a novel system that synthesizes new video sequences for VSM treatment of patients with voice disorders. Starting with a video recording of a voice-disorder patient, the proposed system replaces the coarse speech with a clean, healthier speech that bears resemblance to the patient’s original voice. The replacement speech is synthesized using either a text-to-speech engine or selecting from a database of clean speeches based on a voice similarity metric. To realign the replacement speech with the original video, a novel audiovisual algorithm that combines audio segmentation with lip-state detection is proposed to identify corresponding time markers in the audio and video tracks. Lip synchronization is then accomplished by using an adaptive video re-sampling scheme that minimizes the amount of motion jitter and preserves the spatial sharpness. Results of both objective measurements and subjective evaluations on a dataset with 31 subjects demonstrate the effectiveness of the proposed techniques.
Similar content being viewed by others
Notes
The script can be found in http://vis.uky.edu/nsf-autism/speaktome
References
Aleksic P, Katsaggelos A (2003) roduct HMMs for audio-visual continuous speech recognition using facial animation parameters. In: International conference on multimedia and expo (ICME). pp 481–484
Alvero AM, Austin J (2004) The effects of conducting behavioral observations on the behavior of the observer. J Appl Behav Anal 37:457–468
Arsic I., Thiran J. (2006) Mutual information engenlips for audio-visual speech. In: 14th European signal processing conference
Bandura A (1997) Self-effiicacy: the exercise of control. Freeman, New York
Bartels JGASC, Bilmes J (2004) Dbn based multi-stream models for audio-visual speech recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP). pp 993–996
Bonastre JF, Scheffer N, Fredouille C, Matrouf D (2004) Nist’04 speaker recognition evaluation campaign: new lia speaker detection platform based on alize toolkit. In: Proceedings of NIST speaker evaluation
Boone DR, McFarlane SC (2006) The voice and voice therapy. Prentice Hall
Buggey T (2009) Seeing is believing: video self modeling for people with autism and other developmental disabilities. Wodbine House
CereProc, http://www.cereproc.com: Text to Speech Technology
Chen J, Tiddeman B, Zhao G (2008) Real-time lip contour extraction and tracking using an improved active contour model. In: Lecture notes in computer science, vol 5359. pp 236–245
Cui S, Manica R, Tabor RF, Chan DYC (2012) Interpreting atomic force microscopy measurements of hydrodynamic and surface forces with nonlinear parametric estimation. In: Review of scientific instrument, vol 83. p 103–702
Deng Z, Neumann U (2008) Expressive speech animation synthesis with phoneme-level controls. Comput Graph Forum 27:2096–2113
Dowrick PW (1983) Self-modeling using video: psychological and social applications, Wiley
Duy N, David H (2006) Real-time face detection and lip feature extraction using field-programmable gate arrays. In: IEEE transactions systems man cybernet. pp 902–912
Eveno N, Caplier A, Coulon PY (2002) Key points based segmentation of lips, In: IEEE international conference on multimedia and expo, 2002
Eveno N, Caplier A, Coulon PY (2004) Accurate and quasi-automatic lip tracking. IEEE Trans Circ Syst Video Tech 14:706–715
Hammal Z, Eveno N, Caplier A, Coulon P (2005) Parametric models for facial features segmentation. IEEE J Sig Process
Hapner E, Portone-Maira CJM (2009) A study of voice therapy dropout. j voice. J Voice 23:337–40
Hitchcock CH, Dowrick PW, Prater MA (2003) Video self-modeling intervention in school-based settings: a review. Remedial Spec Educ 24(1):36–45
Howitt A (2000) Automatic syllable detection for vowel landmarks, PhD Thesis
Kass M, Witkin A, Terzopoulos D (1988) Snakes: active contour models. Int J Comput Vis 14:321–331
Kaucic R, Dalton B, Blake A (1996) Real-time lip tracking for audio-visual speech recognition applications. In: Lecture notes in computer science, vol 1065. pp 376–387
Kazuhiro N, Noriaki M, Kazuyoshi T, Naofumi T (2002) A real-time lip reading lsi for word recognition. In: Proceedings IEEE conference ASIC. pp 303–306
Krouse HJ (2001) Video modeling to educate patients. J Adv Nurs 33:748–757
Leung AWC, Liew SH, WL (2000) Lip contour extraction using deformalbe model. In: International conference on image processing
Li L, Zhou Y, Zhang H (2010) Adaptive learning of region-based plsa model for total scene annotation. arXiv: 1311.5590 (preprint)
Ma J, Cole R, Pellom B, Ward W, Wise B (2005) Accurate visible speech synthesis based on concatenating variable length motion capture data. IEEE Trans Vis Comput Graph 15:485–500
MacKenzie K, Millar A, Wilson JA, Sellars C, Deary IJ (2001) Is voice therapy an effective treatment for dysphonia? A randomized controlled trial. Br Med J 323:658–661
McDaniel RW, Rhodes v A (1998) Development of a preparatory sensor information videotape for women receiving chemotherapy for breast cancer. Cancer Nurs 21:143–148
Mermelstein P (1975) Automatic segmentation of speech into syllabic units. J. Acoust. Soc. Am. 58:880–883
Mertens P (1987) Automatic segmentation of speech into syllables. ECST. pp 2009–2013
Nielsen D, Sigurdsson SO, Austin J (2009) Preventing back injuries in hospital settings: the effects of video modeling on safe patient lifting by nurses. J Appl Behav Anal 42(3):551–561
Patel R, Bless D, Thibeault S (2011) A novel intensive approach to voice therapy. J Voice 25:562–569
Queiroz R, Cohen M, Musse SR (2009) An extensible framework for interactive facial animation with facial expressions, lip synchronization and eye behavior. ACM Comput Entertain 7(4):58:1–58:20
Ramachandran VS, Rogers-Ramachandra DC, Cobb S (1995) Touching the phantom. Nature 377:489–490
Ramig LO, Verdolini K (1998) Treatment efficacy: voice disorders. J Speech Lang Hear Res 41:101–116
Roy N, Bless D, Heisey D, Ford C (1993) Manual circumlaryngeal therapy for functional dysphonia: an evaluation of short- and long-term treatment outcomes. J Voice 11:321–331
Roy N, Weinrich B, Gray S, Tanner K, Stemple JC, Sapienza CM (2003) Three treatments for 2 teachers with voice disorders: a randomized clinical trial. J Speech Lang Hear Res 46:670–688
Shen J, Raghunathan A, Cheung SC, Patel R (2011) Automatic content generation for video self modeling. In: Proceedings of IEEE international conference on multimedia expo (ICME 2011)
Shen J, Ti C, Cheung SC, Patel R (2012) Automatic lip-synchronized video-self-modeling intervention for voice disorders. In: 2012 IEEE 14th international conference on e-health networking, applications and services (Healthcom). pp 244–249
Shen J, Cheung SC (2013) Layer depth denoising and completion for structured-light rgb-d cameras. In: 2013 IEEE conference on computer vision and pattern recognition (CVPR). pp 1187–1194. doi: 10.1109/CVPR.2013.157
Shen J, Su P, Cheung S, Zhao J (2013) Virtual mirror rendering with stationary rgb-d cameras and stored 3d background. IEEE transactions on image processing: a publication of the IEEE signal processing society
Sundermann D, Bonafonte A, Ney H, Hoge H (2004) Time domain vocal tract length normalization. In: Signal processing and information technology
TIMIT acoustic-phonetic continuous speech corpus
Verdolini K, Ramig LO (2001) Review: occupational risks for voice problems. Logopedics, Phoniatrics, Vocology 26(1):37–46
Viola P., Jones M. (2001) Rapid object detection using a boosted cascade of simple features. In: Computer vision and pattern recognition. pp 511–518
Vogl W, Ma B, Sitti M (2006) Augmented reality user interface for an atomic force microscope based nanorobotic system. IEEE Trans Nanotechnol 5(4):397–406
Xie Z, NP (2006) Robust acoustic-based syllable detection. In: INTERSPEECH’06
Yang J, Fei Z (2010) Hdar: Hole detection and adaptive geographic routing for ad hoc networks. In: Proceedings of 19th international conference on computer communications and networks (ICCCN) 2010. pp 1–6
Yang J, Fei Z (2013) Broadcasting with prediction and selective forwarding in vehicular networks. In: International journal of distributed sensor networks
Zheng Q, Chellappa R (1995) Automatic feature point extraction and tracking in image sequences for arbitrary camera motion. Int J Comput Vis 15:31–76
Zhou Y, Li L, Zhao T, Zhang H (2010) Region-based high-level semantics extraction with cedd. In: 2010 2nd IEEE international conference on network infrastructure and digital content. IEEE, pp 404–408
Acknowledgments
Part of this material is based upon work supported by the National Science Foundation under Grant No. 1237134. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Shen, J., Ti, C., Raghunathan, A. et al. Automatic video self modeling for voice disorder. Multimed Tools Appl 74, 5329–5351 (2015). https://doi.org/10.1007/s11042-014-2015-1
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-014-2015-1