Skip to main content
Log in

Automatic video self modeling for voice disorder

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Video self modeling (VSM) is a behavioral intervention technique in which a learner models a target behavior by watching a video of him- or herself. In the field of speech language pathology, the approach of VSM has been successfully used for treatment of language in children with Autism and in individuals with fluency disorder of stuttering. Technical challenges remain in creating VSM contents that depict previously unseen behaviors. In this paper, we propose a novel system that synthesizes new video sequences for VSM treatment of patients with voice disorders. Starting with a video recording of a voice-disorder patient, the proposed system replaces the coarse speech with a clean, healthier speech that bears resemblance to the patient’s original voice. The replacement speech is synthesized using either a text-to-speech engine or selecting from a database of clean speeches based on a voice similarity metric. To realign the replacement speech with the original video, a novel audiovisual algorithm that combines audio segmentation with lip-state detection is proposed to identify corresponding time markers in the audio and video tracks. Lip synchronization is then accomplished by using an adaptive video re-sampling scheme that minimizes the amount of motion jitter and preserves the spatial sharpness. Results of both objective measurements and subjective evaluations on a dataset with 31 subjects demonstrate the effectiveness of the proposed techniques.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. The script can be found in http://vis.uky.edu/nsf-autism/speaktome

References

  1. Aleksic P, Katsaggelos A (2003) roduct HMMs for audio-visual continuous speech recognition using facial animation parameters. In: International conference on multimedia and expo (ICME). pp 481–484

  2. Alvero AM, Austin J (2004) The effects of conducting behavioral observations on the behavior of the observer. J Appl Behav Anal 37:457–468

    Article  Google Scholar 

  3. Arsic I., Thiran J. (2006) Mutual information engenlips for audio-visual speech. In: 14th European signal processing conference

  4. Bandura A (1997) Self-effiicacy: the exercise of control. Freeman, New York

    Google Scholar 

  5. Bartels JGASC, Bilmes J (2004) Dbn based multi-stream models for audio-visual speech recognition. In: IEEE international conference on acoustics, speech and signal processing (ICASSP). pp 993–996

  6. Bonastre JF, Scheffer N, Fredouille C, Matrouf D (2004) Nist’04 speaker recognition evaluation campaign: new lia speaker detection platform based on alize toolkit. In: Proceedings of NIST speaker evaluation

  7. Boone DR, McFarlane SC (2006) The voice and voice therapy. Prentice Hall

  8. Buggey T (2009) Seeing is believing: video self modeling for people with autism and other developmental disabilities. Wodbine House

  9. CereProc, http://www.cereproc.com: Text to Speech Technology

  10. Chen J, Tiddeman B, Zhao G (2008) Real-time lip contour extraction and tracking using an improved active contour model. In: Lecture notes in computer science, vol 5359. pp 236–245

  11. Cui S, Manica R, Tabor RF, Chan DYC (2012) Interpreting atomic force microscopy measurements of hydrodynamic and surface forces with nonlinear parametric estimation. In: Review of scientific instrument, vol 83. p 103–702

  12. Deng Z, Neumann U (2008) Expressive speech animation synthesis with phoneme-level controls. Comput Graph Forum 27:2096–2113

    Article  Google Scholar 

  13. Dowrick PW (1983) Self-modeling using video: psychological and social applications, Wiley

  14. Duy N, David H (2006) Real-time face detection and lip feature extraction using field-programmable gate arrays. In: IEEE transactions systems man cybernet. pp 902–912

  15. Eveno N, Caplier A, Coulon PY (2002) Key points based segmentation of lips, In: IEEE international conference on multimedia and expo, 2002

  16. Eveno N, Caplier A, Coulon PY (2004) Accurate and quasi-automatic lip tracking. IEEE Trans Circ Syst Video Tech 14:706–715

    Article  Google Scholar 

  17. Hammal Z, Eveno N, Caplier A, Coulon P (2005) Parametric models for facial features segmentation. IEEE J Sig Process

  18. Hapner E, Portone-Maira CJM (2009) A study of voice therapy dropout. j voice. J Voice 23:337–40

    Article  Google Scholar 

  19. Hitchcock CH, Dowrick PW, Prater MA (2003) Video self-modeling intervention in school-based settings: a review. Remedial Spec Educ 24(1):36–45

    Article  Google Scholar 

  20. Howitt A (2000) Automatic syllable detection for vowel landmarks, PhD Thesis

  21. Kass M, Witkin A, Terzopoulos D (1988) Snakes: active contour models. Int J Comput Vis 14:321–331

    Article  Google Scholar 

  22. Kaucic R, Dalton B, Blake A (1996) Real-time lip tracking for audio-visual speech recognition applications. In: Lecture notes in computer science, vol 1065. pp 376–387

  23. Kazuhiro N, Noriaki M, Kazuyoshi T, Naofumi T (2002) A real-time lip reading lsi for word recognition. In: Proceedings IEEE conference ASIC. pp 303–306

  24. Krouse HJ (2001) Video modeling to educate patients. J Adv Nurs 33:748–757

    Article  Google Scholar 

  25. Leung AWC, Liew SH, WL (2000) Lip contour extraction using deformalbe model. In: International conference on image processing

  26. Li L, Zhou Y, Zhang H (2010) Adaptive learning of region-based plsa model for total scene annotation. arXiv: 1311.5590 (preprint)

  27. Ma J, Cole R, Pellom B, Ward W, Wise B (2005) Accurate visible speech synthesis based on concatenating variable length motion capture data. IEEE Trans Vis Comput Graph 15:485–500

    Google Scholar 

  28. MacKenzie K, Millar A, Wilson JA, Sellars C, Deary IJ (2001) Is voice therapy an effective treatment for dysphonia? A randomized controlled trial. Br Med J 323:658–661

    Article  Google Scholar 

  29. McDaniel RW, Rhodes v A (1998) Development of a preparatory sensor information videotape for women receiving chemotherapy for breast cancer. Cancer Nurs 21:143–148

    Article  Google Scholar 

  30. Mermelstein P (1975) Automatic segmentation of speech into syllabic units. J. Acoust. Soc. Am. 58:880–883

    Article  Google Scholar 

  31. Mertens P (1987) Automatic segmentation of speech into syllables. ECST. pp 2009–2013

  32. Nielsen D, Sigurdsson SO, Austin J (2009) Preventing back injuries in hospital settings: the effects of video modeling on safe patient lifting by nurses. J Appl Behav Anal 42(3):551–561

    Article  Google Scholar 

  33. Patel R, Bless D, Thibeault S (2011) A novel intensive approach to voice therapy. J Voice 25:562–569

    Article  Google Scholar 

  34. Queiroz R, Cohen M, Musse SR (2009) An extensible framework for interactive facial animation with facial expressions, lip synchronization and eye behavior. ACM Comput Entertain 7(4):58:1–58:20

    Google Scholar 

  35. Ramachandran VS, Rogers-Ramachandra DC, Cobb S (1995) Touching the phantom. Nature 377:489–490

    Article  Google Scholar 

  36. Ramig LO, Verdolini K (1998) Treatment efficacy: voice disorders. J Speech Lang Hear Res 41:101–116

    Article  Google Scholar 

  37. Roy N, Bless D, Heisey D, Ford C (1993) Manual circumlaryngeal therapy for functional dysphonia: an evaluation of short- and long-term treatment outcomes. J Voice 11:321–331

    Article  Google Scholar 

  38. Roy N, Weinrich B, Gray S, Tanner K, Stemple JC, Sapienza CM (2003) Three treatments for 2 teachers with voice disorders: a randomized clinical trial. J Speech Lang Hear Res 46:670–688

    Article  Google Scholar 

  39. Shen J, Raghunathan A, Cheung SC, Patel R (2011) Automatic content generation for video self modeling. In: Proceedings of IEEE international conference on multimedia expo (ICME 2011)

  40. Shen J, Ti C, Cheung SC, Patel R (2012) Automatic lip-synchronized video-self-modeling intervention for voice disorders. In: 2012 IEEE 14th international conference on e-health networking, applications and services (Healthcom). pp 244–249

  41. Shen J, Cheung SC (2013) Layer depth denoising and completion for structured-light rgb-d cameras. In: 2013 IEEE conference on computer vision and pattern recognition (CVPR). pp 1187–1194. doi: 10.1109/CVPR.2013.157

  42. Shen J, Su P, Cheung S, Zhao J (2013) Virtual mirror rendering with stationary rgb-d cameras and stored 3d background. IEEE transactions on image processing: a publication of the IEEE signal processing society

  43. Sundermann D, Bonafonte A, Ney H, Hoge H (2004) Time domain vocal tract length normalization. In: Signal processing and information technology

  44. TIMIT acoustic-phonetic continuous speech corpus

  45. Verdolini K, Ramig LO (2001) Review: occupational risks for voice problems. Logopedics, Phoniatrics, Vocology 26(1):37–46

    Article  Google Scholar 

  46. Viola P., Jones M. (2001) Rapid object detection using a boosted cascade of simple features. In: Computer vision and pattern recognition. pp 511–518

  47. Vogl W, Ma B, Sitti M (2006) Augmented reality user interface for an atomic force microscope based nanorobotic system. IEEE Trans Nanotechnol 5(4):397–406

    Article  Google Scholar 

  48. Xie Z, NP (2006) Robust acoustic-based syllable detection. In: INTERSPEECH’06

  49. Yang J, Fei Z (2010) Hdar: Hole detection and adaptive geographic routing for ad hoc networks. In: Proceedings of 19th international conference on computer communications and networks (ICCCN) 2010. pp 1–6

  50. Yang J, Fei Z (2013) Broadcasting with prediction and selective forwarding in vehicular networks. In: International journal of distributed sensor networks

  51. Zheng Q, Chellappa R (1995) Automatic feature point extraction and tracking in image sequences for arbitrary camera motion. Int J Comput Vis 15:31–76

    Article  Google Scholar 

  52. Zhou Y, Li L, Zhao T, Zhang H (2010) Region-based high-level semantics extraction with cedd. In: 2010 2nd IEEE international conference on network infrastructure and digital content. IEEE, pp 404–408

Download references

Acknowledgments

Part of this material is based upon work supported by the National Science Foundation under Grant No. 1237134. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ju Shen.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Shen, J., Ti, C., Raghunathan, A. et al. Automatic video self modeling for voice disorder. Multimed Tools Appl 74, 5329–5351 (2015). https://doi.org/10.1007/s11042-014-2015-1

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-014-2015-1

Keywords

Navigation