Novel three-axis accelerometer-based silent speech interface using deep neural network

https://doi.org/10.1016/j.engappai.2023.105909Get rights and content

Abstract

Silent speech interfaces (SSIs) have been developed as new non-acoustic communication channels for people with speech impairment. Various modalities have been employed to implement SSIs, including ultrasound imaging, electromagnetic articulography, and surface electromyography. In this study, for the first time, we examined the feasibility of implementing an SSI using accelerometers, which have been widely used to acquire motion-related information in human activity recognition. Five accelerometers were attached to the facial surface of participants to measure speech-induced facial movements. A deep neural network architecture combining a one-dimensional (1D) convolutional neural network and bidirectional long short-term memory (1D CNN-bLSTM) was implemented to decode speech-related information contained in the accelerometer signals. In total, 20 healthy individuals participated in the SSI experiments, wherein they were asked to articulate 40 words consisting of 30 Korean words and 10 English Numbers without vocalization. Leave-one-session-out cross-validation was employed to evaluate the classification accuracy of the proposed accelerometer-based SSI. Consequently, an average classification accuracy of 95.58 ± 1.83% was achieved with only four accelerometers, which is significantly higher than that of the conventional sEMG-based SSI (89.68 ± 5.27%, p < 0.0005, Wilcoxon signed-rank test). In addition, the proposed SSI achieved an average classification accuracy of 94.65 ± 2.54% in classifying 40 English words spoken silently. The result demonstrates that accelerometers can be a promising modality to implement SSIs. Considering that accelerometers have multiple advantages over conventional modalities, including non-invasiveness, cost-effectiveness, low power consumption, and portability, it is expected that accelerometer-based SSIs would provide a novel means of communication to those who cannot generate speech signals.

Introduction

Speech is the most common and natural form of communication in human society. People express emotions, share information, and convey intentions using their voices. Recently, automatic speech recognition (ASR) technology has allowed people to interact with electronic devices using their voices instead of conventional input devices, such as keyboards, mouse devices, and touchscreens (Graves et al., 2013). With recent advances in ASR, smart devices, such as smartphones and artificial intelligence-based speakers can be easily controlled via the user’s voice, which makes daily life more convenient and safer (Jou et al., 2006, Kapur et al., 2018).

Although speech-based interfaces are becoming an important communication method in modern life, a number of fundamental limitations still exist in conventional speech-based interfaces. For example, acoustic signals are easily affected by ambient noise, resulting in dramatic performance degradation of the ASR. Additionally, it is difficult to use speech-based interfaces in silent environments, such as public libraries and classrooms, or during military operations. Moreover, conventional speech-based interfaces cannot be used in environments where acoustic signals cannot be transmitted, such as underwater or in space. Most importantly, patients that have trouble vocalizing cannot use conventional speech-based interfaces. Millions of people worldwide lose the ability to produce acoustic signals due to traumatic injuries, laryngectomy, and neurodegeneration (Meltzner et al., 2017).

Owing to the limitations of conventional speech interfaces, silent-speech interfaces (SSIs) have received considerable attention as promising non-acoustic communication channels for people with speech impairment (Denby et al., 2010). Speech-related activities generated while a user is silently speaking words are generally used to implement SSIs. Various modalities have been employed to capture speech-related non-acoustic activities for implementing SSIs. For instance, lip-reading-based SSIs (Wand et al., 2016, Sun et al., 2018) utilize images or movies of mouth movement to estimate the content of the utterance. Electromagnetic articulography (EMA) (Fagan et al., 2008) and ultrasound imaging (Hueber et al., 2010) can successfully extract sufficient information to implement basic speech recognition systems by tracking the movements of the articulatory organs and vocal tract. Recently, surface electromyography (sEMG)-based SSIs have also been actively investigated. Moreover, a series of studies have demonstrated the potential of using brain activities, such as electroencephalogram and near-infrared spectroscopy to implement SSIs (Herff and Schultz, 2016, Rezazadeh Sereshkeh et al., 2019).

Although SSIs have been developed using various modalities, most conventional modalities have limitations for practical use. For example, EMA, permanent magnetic articulography (PMA), and palatography are invasive modalities; therefore, they are unsuitable for general use (Birkholz et al., 2018). Camera-based lip-reading systems require a fixed camera in a position where it can capture the user’s face without any obstruction, causing inconvenience to users (Schultz et al., 2017). Moreover, ultrasound imaging systems tend to be bulky and stationary (Sobhani et al., 2016). Although portable ultrasound imaging systems are being developed, they are generally expensive and produce relatively low-quality images (Wang et al., 2019). sEMG is difficult to apply for long-term use cases because the performance of sEMG-based SSI systems tends to deteriorate over time because of muscle fatigue or the characteristic changes of sEMG signals (e.g., impedance changes due to sweat) (He et al., 2015). In addition, many sEMG sensors are required to implement high-performance SSIs. For example, the average accuracy of classifying ten English numbers from 0 to 9 was only 86% when 10 optimal sEMG sensors were used (Zhu et al., 2021).

To address these issues, we employed accelerometers attached to the face to capture speech-related facial motions. Three-axis accelerometers have been widely used to assess physical body movements, such as acceleration, deceleration, and change of directions in three axial directions—x-, y-, and z-axes in the Cartesian coordinate system (de Almeida Mendes et al., 2018). Accelerometers have several advantages over the conventional modalities, such as non-invasiveness, cost-effectiveness, energy-efficiency, low weight, and high sensitivity to motion-induced signals (Dehzangi and Sahu, 2018, Varanis et al., 2018). Owing to these advantages, accelerometers have been widely employed to implement various interfacing applications, such as human activity monitoring and driving pattern analysis (Johnson and Trivedi, 2011, Patel et al., 2012, Tong et al., 2020). To the best of our knowledge, no previous study has implemented SSIs using three-axis accelerometers attached to the facial surface.

Recent studies have demonstrated that deep learning-based signal classification approaches can enhance the overall performance of SSIs (Kim et al., 2017, Ji et al., 2018). Convolutional neural networks (CNNs) and long short-term memory (LSTM) are the most widely used neural networks in deep learning, and have exhibited excellent performance in automatic feature extraction and time-series signal processing, respectively (Asgher et al., 2020). Recently, modified network architectures have also been proposed to enhance the performance of deep learning, such as bidirectional LSTM (bLSTM) (Schuster and Paliwal, 1997, Bin et al., 2019), which is an extended version of the standard LSTM incorporating a backward direction layer to exploit information simultaneously from the past and future, and CNN-LSTM, which combines CNN and LSTM to effectively extract spatial and temporal features (Swapna et al., 2018, Xu et al., 2020, Cai et al., 2019). Based on these studies, we utilized deep learning-based approaches to investigate the feasibility of implementing accelerometer-based SSI.

In this study, we implemented a novel three-axis accelerometer-based SSI based on CNNs and bLSTM, for the first time (patent pending: (Im et al., 2020)). Accordingly, five accelerometers were attached to the facial surface to record speech-related facial movements while participants silently spoke 40 designated words. A new one-dimensional CNN-bLSTM (1D CNN-bLSTM)-based deep neural network was proposed for the automatic feature extraction and the silent speech recognition (SSR). Subsequently, the leave-one-session-out cross-validation (LOSO-CV) strategy was employed to evaluate the performance of the proposed 1D CNN-bLSTM applied to the novel accelerometer-based SSI. The performance of the proposed accelerometer-based SSI was compared to that of SSI implemented using sEMG which is one of the most widely used modalities for SSI.

Section snippets

Related works

In general, the modalities to implement SSIs can be categorized into invasive and non-invasive modalities. EMA, PMA, electropalatography, and electro-optical stomatography (EOS) are the representative invasive modalities for implementing SSIs. As these modalities are suitable for medical purposes (Birkholz et al., 2018), only non-invasive modalities are reviewed in the present study.

Among the non-invasive modalities, imaging techniques, such as video imaging and ultrasound imaging, are the

Participants

A total of 20 native-Korean adults (10 males and 10 females, aged 24.6 ± 3.19 years) participated in our experiments. None of the participants reported any serious history of neurological, psychiatric, or other severe disease that could have influenced the experimental results. Prior to the experiments, all participants were provided with details of the experiments and written consent was obtained. This study and its experimental protocol were approved by the Institutional Review Board (IRB) of

Individual performance evaluation of accelerometer-based SSI

Fig. 5 illustrates the individual performance of the proposed accelerometer-based SSI system. Gray bars and error bars represent the average classification accuracies and standard deviations, respectively, evaluated using LOSO-CV. The average classification accuracies of all participants were higher than 90%, and half of the participants (i.e., ten participants) exhibited average classification accuracies greater than or equal to 95%. In the classification of 40 silently spoken words with five

Discussion

In this study, we investigated the feasibility of three-axis accelerometer-based SSIs as a novel non-acoustic communication method. Accordingly, we implemented the proposed SSI using five accelerometers attached to the face and employed a 1D CNN-bLSTM model architecture to effectively decode the speech-related information conveyed in the accelerometer signals. The accelerometer signals were recorded while the participants silently spoke 40 words consisting of 30 Korean words and 10 English

Conclusion

In this study, we employed three-axis accelerometers to implement a SSI for the first time. The proposed accelerometer-based SSIs showed higher performance with a small number of training data compared to the conventional SSIs. Moreover, unlike the conventional modalities that required various preprocessing processes based on prior knowledge, the proposed accelerometer-based SSI could be easily realized by a simple z-score normalization without any complicated preprocessing. Our next step is to

CRediT authorship contribution statement

Jinuk Kwon: Conceptualization, Methodology, Software, Validation, Formal analysis, Investigation, Data curation, Visualization, Writing – original draft, Writing – review & editing. Hyerin Nam: Software, Investigation. Younsoo Chae: Software, Investigation. Seungjae Lee: Methodology. In Young Kim: Methodology. Chang-Hwan Im: Conceptualization, Supervision, Validation, Writing – original draft, Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported in part by the Institute for Information & communications Technology Promotion (IITP) funded by the Korea Government, Ministry of Science and ICT (MSIT) under Grant 2020-0-01373 and in part by the Alchemist Brain to X (B2X) Project under Grant 20012355 funded by the Ministry of Trade, Industry and Energy (MOTIE), South Korea .

References (75)

  • BinY. et al.

    Describing video with attention-based bidirectional LSTM

    IEEE Trans. Cybern.

    (2019)
  • BirkholzP. et al.

    Non-invasive silent phoneme recognition using microwave signals

    IEEE/ACM Trans. Audio Speech Lang. Process.

    (2018)
  • Cai, W., Cai, D., Huang, S., Li, M., 2019. Utterance-level End-to-end Language Identification Using Attention-based...
  • Dahl, G.E., Sainath, T.N., Hinton, G.E., 2013. Improving deep neural networks for LVCSR using rectified linear units...
  • Dehzangi, O., Sahu, V., 2018. IMU-Based Robust Human Activity Recognition using Feature Analysis, Extraction, and...
  • Dong, W., Zhang, H., Liu, H., Chen, T., Sun, L., 2019. A Super-Flexible and High-Sensitive Epidermal sEMG Electrode...
  • DuanL. et al.

    Zero-shot learning for EEG classification in motor imagery-based BCI system

    IEEE Trans. Neural Syst. Rehabil. Eng.

    (2020)
  • EidA.M. et al.

    Ultrawideband speech sensing

    IEEE Antennas Wirel. Propag. Lett.

    (2009)
  • El-BialyR. et al.

    Developing phoneme-based lip-reading sentences system for silent speech recognition

    CAAI Trans. Intell. Technol.

    (2022)
  • ElbattahM. et al.

    Variational autoencoder for image-based augmentation of eye-tracking data

    J. Imaging

    (2021)
  • EskesM. et al.

    Predicting 3D lip shapes using facial surface EMG

    PLoS One

    (2017)
  • FerreiraD. et al.

    Exploring silent speech interfaces based on frequency-modulated continuous-wave radar

    Sensors

    (2022)
  • Gonzalez-LopezJ.A. et al.

    Silent speech interfaces for speech restoration: A review

    IEEE Access

    (2020)
  • Gosztolya, G., P, Á., Tóth, L., Grósz, T., Markó, A., Csapó, T.G., 2019. Autoencoder-Based Articulatory-to-Acoustic...
  • Graves, A., Mohamed, A., Hinton, G., 2013. Speech recognition with deep recurrent neural networks. In: IEEE...
  • Guo, Z., Liu, P., Yang, J., Hu, Y., 2020. Multivariate time series classification based on MCNN-LSTMS network. In:...
  • HeJ. et al.

    User adaptation in long-term, open-loop myoelectric training: Implications for EMG pattern recognition in prosthesis control

    J. Neural Eng.

    (2015)
  • HerffC. et al.

    Automatic speech recognition from neural signals: A focused review

    Front. Neurosci.

    (2016)
  • Hua, S., Wang, C., Xu, B., Zhan, W., 2021. An analysis of sEMG-based gestures classification with different influencing...
  • HussainI. et al.

    The soft-SixthFinger: A wearable EMG controlled robotic extra-finger for grasp compensation in chronic stroke patients

    IEEE Robot. Autom. Lett.

    (2016)
  • ImC.-H. et al.

    Method and apparatus for recognizing silent speech

    (2020)
  • Ioffe, S., Szegedy, C., 2015. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate...
  • JankeM. et al.

    EMG-to-speech: Direct generation of speech from facial electromyographic signals

    IEEE/ACM Trans. Audio Speech Lang. Process.

    (2017)
  • Janke, M., Wand, M., Schultz, T., 2010. Impact of lack of acoustic feedback in EMG-based silent speech recognition. In:...
  • Johnson, D.A., Trivedi, M.M., 2011. Driving style recognition using a smartphone as a sensor platform. In: Proc. 14th...
  • JongN.S. et al.

    A speech recognition system based on electromyography for the rehabilitation of dysarthric patients: A Thai syllable study

    Biocybern. Biomed. Eng.

    (2019)
  • Jose, N., Raj, R., Adithya, P., Sivanadan, K., 2017. Classification of forearm movements from sEMG time domain features...
  • Cited by (0)

    View full text