LipSpeaker: Helping Acquired Voice Disorders People Speak Again

Chen, Yaohao; Zhang, Junjian; Zhang, Yizhi; Ochiai, Yoichi

doi:10.1007/978-3-030-30712-7_19

Yaohao Chen⁹,
Junjian Zhang⁹,
Yizhi Zhang¹⁰ &
…
Yoichi Ochiai⁹

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1088))

Included in the following conference series:

International Conference on Human-Computer Interaction

1248 Accesses
1 Citations
119 Altmetric

Abstract

In this paper, we designed a system called LipSpeaker to help acquired voice disorder people to communicate in daily life. Acquired voice disorder users only need to face the camera on their smartphones, and then use their lips to imitate the pronunciation of the words. LipSpeaker can recognize the movements of the lips and convert them to texts, and then it generates audio to play.

Compared to texts, mel-spectrogram is more emotionally informative. In order to generate smoother and more emotional audio, we also use the method of predicting mel-spectrogram instead of texts through recognizing users’ lip movements and expression together.

You have full access to this open access chapter, Download conference paper PDF

Assistance of Speech Recognition in Noisy Environment with Sentence Level Lip-Reading

Emotional Speech Recognition Based on Lip-Reading

Progress to a VOCA with Prosodic Synthesised Speech

Keywords

1 Introduction

Currently acquired voice disorder people communicate with others mainly in three methods. The first is through sign language. The second is communicate through paper and pen. The last method is to use smartphones or computers as medium.

However, all three methods have their own flaws. The first method requires others to be proficient in sign language but few people know sign language. The second method requires literate people but not all the people are literate. Furthermore, it is inconvenient to create writing environment on the road. The third method requires the users to master the basic keyboard input which is not applicable to all the people.

In order to solve the above problems, we have designed a new interactive solution- LipSpeaker: a system that uses the movements of the user’s lips to generate speech. What the user need to do is simply face the camera on his smartphone. LipSpeaker uses the facial landmark detector to capture images of the user’s lips. With the time sequence frame of the lips captured as input, the deep neural network can generate the text of the user’s speech. With LipSpeaker, acquired voice disorder users can communicate with other people without the need for sign language, literacy or any keyboard input.

2 Related Work

Benefited from the development of deep neural networks in recent years, the field of lipreading has also been greatly developed. Among the well-known contributions are LipNet [1] of Assae et al. and Lip Reading in the Wild [2] by Chung et al.

Word Error Rate (WER) is an important indicator in training and evaluating the accuracy of lipreading using GRID corpus [3] dataset. Compared to Wand’s WER in Lipreading with long short term memory [4] in 2016 is only 20.4%, the WER in LipNet is 4.8%. In Lip Reading in the Wild, the WER even reached 3.0%. At the same time, Assae et al. indicated that the accuracy of lipreading using deep neural network is 4.1 times higher than that of artificial lipreading. Proving that using deep neural networks to predict text through the movement of the lips is feasible.

At this stage, we also use the GRID corpus dataset to train deep neural networks. The text results are predicted by inputting the motion sequence frame of the user’s lips into the lipreading deep neural network model, and the audio playback is synthesized through Text-To-Speech (TTS) system. However, there is a disadvantage in generating audio in this way: the tone of the user’s speech will be filtered directly into a text while our ultimate goal is to generate emotional audio based on the user’s lip movements.

Inspired by Tacotron2 [5] by Jonathan Shen et al., we are trying to predict mel-spectrogram using the lipreading deep neural network model instead of predicting texts. Together with mel-spectrogram, we can generate emotional audio with WaveNet [6].

3 Implementation

The implementation of LipSpeaker is divided into two major steps: training phase and evaluation phase. Training phase runs on Ubuntu. We use tensorflow as a framework for deep learning, training the lipreading deep neural network with the GRID corpus dataset, and obtaining a well-trained model after training. In the evaluation phase, we convert the well-trained model into the model of Apple’s deep learning framework Core ML, and then run the model onto the phone.

Since loading and running a well-trained model on the smartphone produces a relatively large amount of computation, it will result in high performance requirements for the smartphone. In order to alleviate the burden of computation on Text-To-Speech (TTS), we use Apple’s AVSpeechSynthsizer provided by AVFoundation to generate audio. Compared to the TTS system such as Tacotron2, AVSpeechSynthsizer improves the generation speed and reduces the amount of computation at the expense of audio fluency and naturalness. AVSpeechSynthsizer is sufficient at this stage to verify the validity of the system.

In order to improve the accuracy of lipreading. In the pre-processing, we did mouth detection on the input image and cropped the user’s mouth area as input. Since the training device is PC and the actual running device is smartphone. In order to ensure the consistency of mouth detection results between training and running, we use dlib [7] to crop the position of the lips.

In terms of deep neural network models, we have adopted a network structure similar to LipNet since it has three advantages over the network structure of Lip Reading in the wild. First, the overall result it gets is better. Second, its network structure is simpler and the amount of computation is much less, and third, the network structure is End-To-End, which is more suitable for running on smartphones.

See the Fig. 1 for the specific Network Architecture. We adopt Connectionist Temporal Classification [8] (CTC) loss as our loss function. For optimizer we use AdamOpimizer.

The accuracy of the model after training is similar to that in the LipNet. The WER of the prediction results for the overlapped speaker is about 7%, and the number for the Unseen Speaker is about 14%.

4 Future Works

We will use the same evaluation method as TTS to verify the validation of LipSpeaker using the Mean Opinion Score (MOS). In this experiment, each group will consist of an acquired voice disorder participant and a non-disabled participant. The two users will communicate in three ways: pen and paper, keyboard input and LipSpeaker. Both users in each group score the three methods respectively with the score ranges from 1 to 5, and each group generating six numbers. Through multiple sets of experiments, the MOS of each communication method is calculated and compared to verify the effect of LipSpeaker on the acquired voice disorder participants and non-disabled participants.

As mentioned in Sect. 2: predicting text with lip motion sequence frames can result in loss of features of the user’s emotions. Therefore, we have improved the network structure inspired by Tacotron2. Tacotron2 is a system for TTS that consists of two major parts. The first part uses deep neural network to predict mel-spectrograms through text sequences. The second part uses the obtained mel-spectrogram to generate audio through another deep neural network WaveNet.

Since the mel-spectrogram is richer in the amount of features carried by texts, the generated audio from mel-spectrogram can better restore the user’s emotions. Therefore, we are trying to predict the mel-spectrogram through deep neural network using the lip motion sequence frame as input. Lastly we use the trained WaveNet to generate audio with intonation. See the Fig. 2 for the specific Network Architecture.

Natoki Kimura et al.’s SottoVoce [9] successfully predicted the Mel-scale spectrum using the ultrasound picture sequence frames of the tongue. Since the lip motion sequence frame has more features than ultrasound picture sequence frames of the tongue, we highly believe it is feasible for lip motion sequence frame. Up until now, we have tried to generate mel-spectrogram using 3D Convolutional Neural Networks [10] (3D-CNN). In order to achieve the expected accuracy, further experiments are still needed.

5 Conclusion

The system LipSpeaker we designed shows a new way of human-computer interaction. LipSpeaker can use the deep neural network to predict text by identifying the lip motions of the acquired voice disorder people and use it to generate speech in conjunction with TTS. LipSpeaker can help acquired voice disorders people to communicate more easily with others in their daily life. The WER of the model reached 7% in the laboratory environment, demonstrating the effectiveness of the method.

However, at the same time, the model is greatly affected by the environment. In the case of poor lighting conditions or the user’s lip pictures are not clear, the accuracy will drop dramatically. The reason may be that the trained GRID corpus datasets data is obtained in an environment where the light is always sufficient and the participant is always facing the camera at the front face. In future work, we will try to add more training data to improve this situation.

Since the mel-spectrogram is richer in the amount of features carried by the text, the generated audio by mel-spectrogram can better restore the user’s feelings. Inspired by the network structure of Tacotron2, we proposed to predict the mel-spectrogram through the lip motion sequence frame and use WaveNet to generate smoother audio with more intonation. In order to achieve the expected accuracy, we will conduct further experiments based on 3D-CNN.

References

Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: LipNet: End-to-End Sentence-level Lipreading. arXiv e-prints, page arXiv:1611.01599, November 2016
Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip Reading Sentences in the Wild. arXiv e-prints, page arXiv:1611.05358, November 2016
Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. Acoust. Soc. Am. J. 120, 2421 (2006)
Article Google Scholar
Wand, M., Koutník, J., Schmidhuber, J.: Lipreading with Long Short-Term Memory. arXiv e-prints, page arXiv:1601.08188, January 2016
Shen, J., et al.: Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions. arXiv e-prints, page arXiv:1712.05884, December 2017
van den Oord, A., et al.: WaveNet: A Generative Model for Raw Audio. arXiv e-prints, page arXiv:1609.03499, September 2016
King, D.E.: Dlib-ml: a machine learning toolkit. J. Mach. Learn. Res. 10, 1755–1758 (2009)
Google Scholar
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, ICML 2006, pp. 369–376. ACM, New York (2006)
Google Scholar
Rekimoto, J., Kimura, N., Kono, M.: SottoVoce: an ultrasound imaging-based silent speech interaction using deep neural networks. In: ACM CHI (2019)
Google Scholar
Ji, S., Xu, W., Yang, M., Yu, K.: 3D convolutional neural networks for human action recognition. IEEE Trans. Pattern Anal. Mach. Intell. 35(1), 221–231 (2013)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Digital Nature Group, University of Tsukuba, Tsukuba, Ibaraki, Japan
Yaohao Chen, Junjian Zhang & Yoichi Ochiai
Applied Analytics, Columbia University, New York, USA
Yizhi Zhang

Authors

Yaohao Chen
View author publications
You can also search for this author in PubMed Google Scholar
Junjian Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yizhi Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yoichi Ochiai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding authors

Correspondence to Yaohao Chen , Junjian Zhang or Yoichi Ochiai .

Editor information

Editors and Affiliations

University of Crete and Foundation for Research and Technology – Hellas (FORTH), Heraklion, Crete, Greece
Constantine Stephanidis
Foundation for Research and Technology – Hellas (FORTH), Heraklion, Crete, Greece
Margherita Antona

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, Y., Zhang, J., Zhang, Y., Ochiai, Y. (2019). LipSpeaker: Helping Acquired Voice Disorders People Speak Again. In: Stephanidis, C., Antona, M. (eds) HCI International 2019 – Late Breaking Posters. HCII 2019. Communications in Computer and Information Science, vol 1088. Springer, Cham. https://doi.org/10.1007/978-3-030-30712-7_19

Download citation

DOI: https://doi.org/10.1007/978-3-030-30712-7_19
Published: 20 September 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-30711-0
Online ISBN: 978-3-030-30712-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

LipSpeaker: Helping Acquired Voice Disorders People Speak Again

Abstract

Similar content being viewed by others

Assistance of Speech Recognition in Noisy Environment with Sentence Level Lip-Reading

Emotional Speech Recognition Based on Lip-Reading

Progress to a VOCA with Prosodic Synthesised Speech

Keywords

1 Introduction

2 Related Work

3 Implementation

4 Future Works

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

LipSpeaker: Helping Acquired Voice Disorders People Speak Again

Abstract

Similar content being viewed by others

Assistance of Speech Recognition in Noisy Environment with Sentence Level Lip-Reading

Emotional Speech Recognition Based on Lip-Reading

Progress to a VOCA with Prosodic Synthesised Speech

Keywords

1 Introduction

2 Related Work

3 Implementation

4 Future Works

5 Conclusion

References

Author information

Authors and Affiliations

Corresponding authors

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation