Abstract
Nowadays, audio–visual automatic speech recognition (AV-ASR) is an emerging field of research, but there is still lack of proper visual features for visual speech recognition. Visual features are mainly categorized into shape based and appearance based. Based on the different information embedded in shape and appearance features, this paper proposes a new set of hybrid visual features which lead to a better visual speech recognition system. Pseudo-Zernike Moment (PZM) is calculated for shape-based visual feature while Local Bnary Pattern-three orthogonal planes (LBP-TOP) and Discrete Cosine Transform (DCT) are calculated for the appearance-based feature. Moreover, our proposed method also gathers global and local visual information. Thus, the objective of the proposed system is to embed all this visual information into a compact features set. Here, for audio speech recognition, the proposed system uses Mel-frequency cepstral coefficients (MFCC). We also propose a hybrid classification method to carry out all the experiments of AV-ASR. Artificial Neural Network (ANN), multiclass Support Vector Machine (SVM) and Naive Bayes (NB) classifiers are used for classifier hybridization. It is shown that the proposed AV-ASR system with a hybrid classifier significantly improves the recognition rate.
Similar content being viewed by others
References
Borde, P., Varpe, A., Manza, R., Yannawar, P.: Recognition of isolated words using Zernike and MFCC features for audio visual speech recognition. Int. J. Speech Technol. 18(2), 167–175 (2015)
Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., Ogata, T.: Audio-visual speech recognition using deep learning. Appl. Intell. 42(4), 722–737 (2015)
Potamianos, G., Neti, C., Gravier, G., Garg, A., Senior, A.W.: Recent advances in the automatic recognition of audiovisual speech. Proc. IEEE 91(9), 1306–1326 (2003)
Dave, N.: A lip localization based visual feature extraction method. Electr. Comput. Eng. Int. J. ECIJ (2015). https://doi.org/10.14810/ecij.2015.4403
Chitu, A.G., Rothkrantz, L.J.M., Wojdel, J.C., Wiggers, P.: Comparison between different feature extraction techniques for audio-visual speech recognition. J. Multimodal User Interfaces 1(1), 7–20 (2007)
Dupont, S., Luettin, J.: Audio-Visual Speech Modeling for Continuous Speech Recognition. IEEE Trans. Multimedia 2(3), 141–151 (2000)
Ojala, T., Pietikainen, M., Maenpaa, T.: Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. Trans. IEEE Pattern Anal. Mach. Intell 24(7), 971–987 (2002)
Zhao, G., Barnard, M., Pietikainen, M.: Lipreading with local spatiotemporal descriptors. Trans. IEEE Multimed. 11(7), 1254–1265 (2009)
Dabbaghchiana, S., Ghaemmaghamib, M.P., Aghagolzadeh, A.: Feature extraction using discrete cosine transform and discrimination power analysis with a face recognition technology. Pattern Recognit. 43(4), 1431–1440 (2010)
Bhatia, A., Wolf, E.: On the circle polynomials of zernike and related orthogonal sets. Proc. Camb. Philos. Soc. 50(1), 40–48 (2002)
Singh, C., Upneja, R.: Accurate calculation of high order pseudo Zernike moments and their numerical stability. Digit. Signal Proc. 27(1), 95–106 (2013)
Sato, H., Iwai, T.: A complex singular value decomposition algorithm based on the Riemannian Newton method. In: 52nd IEEE Conference on Decision and Control, Florence, Italy. IEEE (2013)
Wen, J., Fang, X., Cui, J., Fei, L., Yan, K., Chen, Y., Xu, Y.: Robust sparse linear discriminant analysis. IEEE Trans. Circuits Syst. Video Technol. 29(2), 390–403 (2019)
Davis, S.B., Mermelstein, P.: Comparison of parametric representation for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Signal Process. 28(4), 357–365 (1980)
Gevaert, V.M.W., Tsenov, G.: Neural networks used for speech recognition. J. Autom. Control 20(1), 1–7 (2010)
Ganapathiraju, A., Jonathan, E., Picone, H.J.: Applications of support vector machines to speech recognition. IEEE Trans. Signal Process. 52(8), 2348–2355 (2004)
Russell, S., Norvig, P.: Artificial Intelligence: A Modern Approach. Prentice Hall, Upper Saddle River (2003)
Kuncheva, L.I.: Combining Pattern Classifiers: Methods and Algorithms. Wiley, New York (2004)
Borde, P., Manza, R., Gawali, B., Yannawar, P.: ‘VISWa’: a multilingual multi-pose audio visual database for robust human computer interaction. Int. J. Comput. Appl. 137(4), 25–31 (2004)
Liu, G.H., Yang, J.Y., Li, Z.: Content-based image retrieval using computational visual attention model. Pattern Recognit. 48(8), 2554–2566 (2015)
Liu, G.H., Yang, J.Y.: Exploiting color volume and color difference for salient region detection. IEEE Trans. Image Process. 28(1), 6–16 (2019)
Sui, C., Togneri, R., Bennamoun, M.: A cascade gray-stereo visual feature extraction method for visual and audio-visual speech recognition. Speech Commun. 90, 26–38 (2017)
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Debnath, S., Roy, P. Appearance and shape-based hybrid visual feature extraction: toward audio–visual automatic speech recognition. SIViP 15, 25–32 (2021). https://doi.org/10.1007/s11760-020-01717-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11760-020-01717-0