Skip to main content

Development of Mobile Device-Based Speech Enhancement System Using Lip-Reading

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 327))

Abstract

We have been developing a speech enhancement device for laryngectomees. Our approach is to use a lip-reading technology to be able to recognize Japanese words from lip images and generate speech outputs using mobile devices. The target words are translated into registered 36 viseme sequences, and converted into VAE (Variational Auto Encoder) feature parameters. Then the corresponding words are recognized using CNN-based model. Previously the PC-based experimental prototype was tested with 20 Japanese words and a well-trained single subject. From the result, we confirmed 65% recognition accuracy, and 100% including 1st and 2nd candidates from that test. In this paper, a couple of recognition performance improvement methods were investigated with larger vocabulary and multiple subjects. After adjusting the speech rate and the mouth movement, we obtained about 60% word recognition accuracy including the 1st through 6th candidates with inexperienced users. Also, we developed a mobile device based prototype and conducted the preliminary recognition experiment with 20 words by a well-trained single subject, and 95% accuracy was obtained including the 1st through 6th candidates, which was almost equivalent to the PC-based system.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Matsui, K., et al.: Enhancement of esophageal speech using formant synthesis. J. Acoust. Soc. Jpn. (E) 23(2), 66–79 (2002)

    Google Scholar 

  2. Matsui, K., et al.: Development of speech enhancement system. IEEJ J. 134(4), 216–219 (2014)

    Article  Google Scholar 

  3. Kimura, K., et al.: Development of wearable speech enhancement system for laryngectomees. In: NCSP 2016, pp. 339–342, March (2016)

    Google Scholar 

  4. Nakahara, T., et al.: Speech enhancement system using lip-reading. In: 17th International Conference on Distributed Computing and Artificial Intelligence, DCAI 2020, pp. 159–167, October (2020)

    Google Scholar 

  5. Matsui, K., et al.: Mobile device-based speech enhancement system using lip-reading. In: IICAIET 2020. September (2020)

    Google Scholar 

  6. Denby, B., Schultz, T., Honda, K., Hueber, T., Gilbert, J.M., et al.: Silent speech interfaces. Speech Commun. 52(4), 270 (2010)

    Article  Google Scholar 

  7. Kapur, A., Kapur, S., Maes, P.: AlterEgo: a personalized wearable silent speech interface. In: IUI 2018. Tokyo, Japan, March 7–11 (2018)

    Google Scholar 

  8. Goodfellow, I., Bengio, Y., Courville, A.: Deep Leaning. MIT Press, Cambridge, Massachusetts (2016)

    MATH  Google Scholar 

  9. Saito, Y.: Deep Learning From Scratch. O’Reilly, Japan (2016)

    Google Scholar 

  10. Hideki, A., et al.: Deep Leaning. Kindai Kagakusya, Tokyo (2015)

    Google Scholar 

  11. King, D.E.: Max-Margin Object Detection. arXiv:1502.00046v1 [cs.CV], 31, Jan (2015)

  12. Kazemi, V., Sullivan, J.: One millisecond face alignment with an ensemble of regression trees, In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1867–1874 (2014)

    Google Scholar 

  13. Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: LipNet: end-to-end sentence-level lip reading. In: GPU Technology Conference (2017)

    Google Scholar 

  14. Ito, D., Takiguchi, T., Ariki, Y.: Lip image to speech conversion using LipNet. Acoustic Society of Japan articles, March (2018)

    Google Scholar 

  15. Kawahara, H.: STRAIGHT, exploitation of the other aspect of vocoder: perceptually isomorphic decomposition of speech sounds. Acoust. Sci. Technol. 27(6), 349–353 (2006)

    Article  Google Scholar 

  16. Asami, et al.: Basic study on lip reading for Japanese speaker by machine learning. In: 33rd, Picture Coding Symposium (PCSJ/IMPS2018), pp. 3–8. Nov. (2018)

    Google Scholar 

  17. Saitoh, T., Kubokawa, M.: SSSD: Japanese speech scene database by smart device for visual speech recognition. IEICE 117(513), 163–168 (2018)

    Google Scholar 

  18. Saitoh, T., Kubokawa, M.: SSSD: speech scene database by smart device for visual speech recognition. In: Proceeding of the ICPR 2018, pp. 3228–3232 (2018)

    Google Scholar 

Download references

Acknowledgments

This study was subsidized by JSPS Grant-in-Aid for Scientific Research 1 9K12905.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kenji Matsui .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Eguchi, F., Matsui, K., Nakatoh, Y., Kato, Y.O., Rivas, A., Corchado, J.M. (2022). Development of Mobile Device-Based Speech Enhancement System Using Lip-Reading. In: Matsui, K., Omatu, S., Yigitcanlar, T., González, S.R. (eds) Distributed Computing and Artificial Intelligence, Volume 1: 18th International Conference. DCAI 2021. Lecture Notes in Networks and Systems, vol 327. Springer, Cham. https://doi.org/10.1007/978-3-030-86261-9_21

Download citation

Publish with us

Policies and ethics