Development of Mobile Device-Based Speech Enhancement System Using Lip-Reading

Eguchi, Fumiaki; Matsui, Kenji; Nakatoh, Yoshihisa; Kato, Yumiko O.; Rivas, Alberto; Corchado, Juan Manuel

doi:10.1007/978-3-030-86261-9_21

Development of Mobile Device-Based Speech Enhancement System Using Lip-Reading

Fumiaki Eguchi¹³,
Kenji Matsui¹³,
Yoshihisa Nakatoh¹⁴,
Yumiko O. Kato¹⁵,
Alberto Rivas¹⁶ &
…
Juan Manuel Corchado¹⁶

Conference paper
First Online: 02 September 2021

408 Accesses
1 Citations

Part of the book series: Lecture Notes in Networks and Systems ((LNNS,volume 327))

Abstract

We have been developing a speech enhancement device for laryngectomees. Our approach is to use a lip-reading technology to be able to recognize Japanese words from lip images and generate speech outputs using mobile devices. The target words are translated into registered 36 viseme sequences, and converted into VAE (Variational Auto Encoder) feature parameters. Then the corresponding words are recognized using CNN-based model. Previously the PC-based experimental prototype was tested with 20 Japanese words and a well-trained single subject. From the result, we confirmed 65% recognition accuracy, and 100% including 1^st and 2^nd candidates from that test. In this paper, a couple of recognition performance improvement methods were investigated with larger vocabulary and multiple subjects. After adjusting the speech rate and the mouth movement, we obtained about 60% word recognition accuracy including the 1^st through 6^th candidates with inexperienced users. Also, we developed a mobile device based prototype and conducted the preliminary recognition experiment with 20 words by a well-trained single subject, and 95% accuracy was obtained including the 1^st through 6^th candidates, which was almost equivalent to the PC-based system.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Matsui, K., et al.: Enhancement of esophageal speech using formant synthesis. J. Acoust. Soc. Jpn. (E) 23(2), 66–79 (2002)
Google Scholar
Matsui, K., et al.: Development of speech enhancement system. IEEJ J. 134(4), 216–219 (2014)
Article Google Scholar
Kimura, K., et al.: Development of wearable speech enhancement system for laryngectomees. In: NCSP 2016, pp. 339–342, March (2016)
Google Scholar
Nakahara, T., et al.: Speech enhancement system using lip-reading. In: 17th International Conference on Distributed Computing and Artificial Intelligence, DCAI 2020, pp. 159–167, October (2020)
Google Scholar
Matsui, K., et al.: Mobile device-based speech enhancement system using lip-reading. In: IICAIET 2020. September (2020)
Google Scholar
Denby, B., Schultz, T., Honda, K., Hueber, T., Gilbert, J.M., et al.: Silent speech interfaces. Speech Commun. 52(4), 270 (2010)
Article Google Scholar
Kapur, A., Kapur, S., Maes, P.: AlterEgo: a personalized wearable silent speech interface. In: IUI 2018. Tokyo, Japan, March 7–11 (2018)
Google Scholar
Goodfellow, I., Bengio, Y., Courville, A.: Deep Leaning. MIT Press, Cambridge, Massachusetts (2016)
MATH Google Scholar
Saito, Y.: Deep Learning From Scratch. O’Reilly, Japan (2016)
Google Scholar
Hideki, A., et al.: Deep Leaning. Kindai Kagakusya, Tokyo (2015)
Google Scholar
King, D.E.: Max-Margin Object Detection. arXiv:1502.00046v1 [cs.CV], 31, Jan (2015)
Kazemi, V., Sullivan, J.: One millisecond face alignment with an ensemble of regression trees, In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1867–1874 (2014)
Google Scholar
Assael, Y.M., Shillingford, B., Whiteson, S., de Freitas, N.: LipNet: end-to-end sentence-level lip reading. In: GPU Technology Conference (2017)
Google Scholar
Ito, D., Takiguchi, T., Ariki, Y.: Lip image to speech conversion using LipNet. Acoustic Society of Japan articles, March (2018)
Google Scholar
Kawahara, H.: STRAIGHT, exploitation of the other aspect of vocoder: perceptually isomorphic decomposition of speech sounds. Acoust. Sci. Technol. 27(6), 349–353 (2006)
Article Google Scholar
Asami, et al.: Basic study on lip reading for Japanese speaker by machine learning. In: 33rd, Picture Coding Symposium (PCSJ/IMPS2018), pp. 3–8. Nov. (2018)
Google Scholar
Saitoh, T., Kubokawa, M.: SSSD: Japanese speech scene database by smart device for visual speech recognition. IEICE 117(513), 163–168 (2018)
Google Scholar
Saitoh, T., Kubokawa, M.: SSSD: speech scene database by smart device for visual speech recognition. In: Proceeding of the ICPR 2018, pp. 3228–3232 (2018)
Google Scholar

Download references

Acknowledgments

This study was subsidized by JSPS Grant-in-Aid for Scientific Research 1 9K12905.

Author information

Authors and Affiliations

Osaka Institute of Technology, Osaka, Japan
Fumiaki Eguchi & Kenji Matsui
Kyushu Institute of Technology, Kitakyushu, Japan
Yoshihisa Nakatoh
St. Marianna University School of Medicine, Kawasaki, Japan
Yumiko O. Kato
BISITE Digital Innovation Hub, University of Salamanca, Salamanca, Spain
Alberto Rivas & Juan Manuel Corchado

Authors

Fumiaki Eguchi
View author publications
You can also search for this author in PubMed Google Scholar
Kenji Matsui
View author publications
You can also search for this author in PubMed Google Scholar
Yoshihisa Nakatoh
View author publications
You can also search for this author in PubMed Google Scholar
Yumiko O. Kato
View author publications
You can also search for this author in PubMed Google Scholar
Alberto Rivas
View author publications
You can also search for this author in PubMed Google Scholar
Juan Manuel Corchado
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Kenji Matsui .

Editor information

Editors and Affiliations

Faculty of Robotics and Design, Osaka Institute of Technology, Osaka, Japan
Kenji Matsui
Graduate School, Hiroshima University, Higashi-Hiroshima, Osaka, Japan
Sigeru Omatu
School of Architecture and Built Environment, Queensland University of Technology, Brisbane, Australia
Tan Yigitcanlar
BISITE, Digital Innovation Hub, University of Salamanca, Salamanca, Spain
Sara Rodríguez González

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Eguchi, F., Matsui, K., Nakatoh, Y., Kato, Y.O., Rivas, A., Corchado, J.M. (2022). Development of Mobile Device-Based Speech Enhancement System Using Lip-Reading. In: Matsui, K., Omatu, S., Yigitcanlar, T., González, S.R. (eds) Distributed Computing and Artificial Intelligence, Volume 1: 18th International Conference. DCAI 2021. Lecture Notes in Networks and Systems, vol 327. Springer, Cham. https://doi.org/10.1007/978-3-030-86261-9_21

Download citation

DOI: https://doi.org/10.1007/978-3-030-86261-9_21
Published: 02 September 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-86260-2
Online ISBN: 978-3-030-86261-9
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)

Publish with us

Policies and ethics