Appearance and shape-based hybrid visual feature extraction: toward audio–visual automatic speech recognition

Debnath, Saswati; Roy, Pinki

doi:10.1007/s11760-020-01717-0

Appearance and shape-based hybrid visual feature extraction: toward audio–visual automatic speech recognition

Original Paper
Published: 11 June 2020

Volume 15, pages 25–32, (2021)
Cite this article

Signal, Image and Video Processing Aims and scope Submit manuscript

465 Accesses
12 Citations
Explore all metrics

Abstract

Nowadays, audio–visual automatic speech recognition (AV-ASR) is an emerging field of research, but there is still lack of proper visual features for visual speech recognition. Visual features are mainly categorized into shape based and appearance based. Based on the different information embedded in shape and appearance features, this paper proposes a new set of hybrid visual features which lead to a better visual speech recognition system. Pseudo-Zernike Moment (PZM) is calculated for shape-based visual feature while Local Bnary Pattern-three orthogonal planes (LBP-TOP) and Discrete Cosine Transform (DCT) are calculated for the appearance-based feature. Moreover, our proposed method also gathers global and local visual information. Thus, the objective of the proposed system is to embed all this visual information into a compact features set. Here, for audio speech recognition, the proposed system uses Mel-frequency cepstral coefficients (MFCC). We also propose a hybrid classification method to carry out all the experiments of AV-ASR. Artificial Neural Network (ANN), multiclass Support Vector Machine (SVM) and Naive Bayes (NB) classifiers are used for classifier hybridization. It is shown that the proposed AV-ASR system with a hybrid classifier significantly improves the recognition rate.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

A Deep Learning Framework for Audio Deepfake Detection

Article 08 November 2021

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

Article Open access 03 January 2024

References

Borde, P., Varpe, A., Manza, R., Yannawar, P.: Recognition of isolated words using Zernike and MFCC features for audio visual speech recognition. Int. J. Speech Technol. 18(2), 167–175 (2015)
Article Google Scholar
Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., Ogata, T.: Audio-visual speech recognition using deep learning. Appl. Intell. 42(4), 722–737 (2015)
Article Google Scholar
Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 91(9), 1306–1326 (2003)
Article Google Scholar
Dave, N.: A lip localization based visual feature extraction method. Electr. Comput. Eng. Int. J. ECIJ (2015). https://doi.org/10.14810/ecij.2015.4403
Article Google Scholar
Chitu, A.G., Rothkrantz, L.J.M., Wojdel, J.C., Wiggers, P.: Comparison between different feature extraction techniques for audio-visual speech recognition. J. Multimodal User Interfaces 1(1), 7–20 (2007)
Article Google Scholar
Dupont, S., Luettin, J.: Audio-Visual Speech Modeling for Continuous Speech Recognition. IEEE Trans. Multimedia 2(3), 141–151 (2000)
Article Google Scholar
Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. Trans. IEEE Pattern Anal. Mach. Intell 24(7), 971–987 (2002)
Article Google Scholar
Zhao, G., Barnard, M., Pietikainen, M.: Lipreading with local spatiotemporal descriptors. Trans. IEEE Multimed. 11(7), 1254–1265 (2009)
Article Google Scholar
Dabbaghchiana, S., Ghaemmaghamib, M.P., Aghagolzadeh, A.: Feature extraction using discrete cosine transform and discrimination power analysis with a face recognition technology. Pattern Recognit. 43(4), 1431–1440 (2010)
Article Google Scholar
Bhatia, A., Wolf, E.: On the circle polynomials of zernike and related orthogonal sets. Proc. Camb. Philos. Soc. 50(1), 40–48 (2002)
Article MathSciNet Google Scholar
Singh, C., Upneja, R.: Accurate calculation of high order pseudo Zernike moments and their numerical stability. Digit. Signal Proc. 27(1), 95–106 (2013)
Google Scholar
Sato, H., Iwai, T.: A complex singular value decomposition algorithm based on the Riemannian Newton method. In: 52nd IEEE Conference on Decision and Control, Florence, Italy. IEEE (2013)
Wen, J., Fang, X., Cui, J., Fei, L., Yan, K., Chen, Y., Xu, Y.: Robust sparse linear discriminant analysis. IEEE Trans. Circuits Syst. Video Technol. 29(2), 390–403 (2019)
Article Google Scholar
Davis, S.B., Mermelstein, P.: Comparison of parametric representation for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–365 (1980)
Article Google Scholar
Gevaert, V.M.W., Tsenov, G.: Neural networks used for speech recognition. J. Autom. Control 20(1), 1–7 (2010)
Article Google Scholar
Ganapathiraju, A., Jonathan, E., Picone, H.J.: Applications of support vector machines to speech recognition. IEEE Trans. Signal Process. 52(8), 2348–2355 (2004)
Article Google Scholar
Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach. Prentice Hall, Upper Saddle River (2003)
MATH Google Scholar
Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley, New York (2004)
Book Google Scholar
Borde, P., Manza, R., Gawali, B., Yannawar, P.: ‘VISWa’: a multilingual multi-pose audio visual database for robust human computer interaction. Int. J. Comput. Appl. 137(4), 25–31 (2004)
Google Scholar
Liu, G.H., Yang, J.Y., Li, Z.: Content-based image retrieval using computational visual attention model. Pattern Recognit. 48(8), 2554–2566 (2015)
Article Google Scholar
Liu, G.H., Yang, J.Y.: Exploiting color volume and color difference for salient region detection. IEEE Trans. Image Process. 28(1), 6–16 (2019)
Article MathSciNet Google Scholar
Sui, C., Togneri, R., Bennamoun, M.: A cascade gray-stereo visual feature extraction method for visual and audio-visual speech recognition. Speech Commun. 90, 26–38 (2017)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, National Institute of Technology, Silchar, Silchar, Assam, India
Saswati Debnath & Pinki Roy

Authors

Saswati Debnath
View author publications
You can also search for this author in PubMed Google Scholar
Pinki Roy
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Saswati Debnath.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Debnath, S., Roy, P. Appearance and shape-based hybrid visual feature extraction: toward audio–visual automatic speech recognition. SIViP 15, 25–32 (2021). https://doi.org/10.1007/s11760-020-01717-0

Download citation

Received: 30 August 2019
Revised: 09 December 2019
Accepted: 23 May 2020
Published: 11 June 2020
Issue Date: February 2021
DOI: https://doi.org/10.1007/s11760-020-01717-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Appearance and shape-based hybrid visual feature extraction: toward audio–visual automatic speech recognition

Abstract

Access this article

Similar content being viewed by others

A comprehensive survey on automatic speech recognition using neural networks

A Deep Learning Framework for Audio Deepfake Detection

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Appearance and shape-based hybrid visual feature extraction: toward audio–visual automatic speech recognition

Abstract

Access this article

Similar content being viewed by others

A comprehensive survey on automatic speech recognition using neural networks

A Deep Learning Framework for Audio Deepfake Detection

Comparative analysis of audio classification with MFCC and STFT features using machine learning techniques

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation