Abstract
This paper proposes a unified system for both visual speech recognition and speaker identification. The proposed system can handle image and depth data if they are available. The proposed system consists of four consecutive steps, namely, 3D face pose tracking, mouth region extraction, features computing, and classification using the Support Vector Machine method. The system is experimentally evaluated on three public datasets, namely, MIRACL-VC1, OuluVS, and CUAVE. In one hand, the visual speech recognition module achieves up to 96 % and 79.2 % for speaker dependent and speaker independent settings, respectively. On the other hand, speaker identification performs up to 98.9 % of recognition rate. Additionally, the obtained results demonstrate the importance of the depth data to resolve the subject dependency issue.
Chapter PDF
References
Ahlberg, J.: Candide-3 - an updated parameterised face. Technical report, Department of Electrical Engineering, Linköping University, Sweden (2001)
Bakry, A., Elgammal, A.: Mkpls: manifold kernel partial least squares for lipreading and speaker identification. In: International Conference on Computer Vision and Pattern Recognition, pp. 684–691 (2013)
Ben-Hamadou, A., Soussen, C., Daul, C., Blondel, W., Wolf, D.: Flexible calibration of structured-light systems projecting point patterns. Computer Vision and Image Understanding 117(10), 1468–1481 (2013)
Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. International Conference on Computer Vision and Pattern Recognition 1, 886–893 (2005)
Dalal, N., Triggs, B., Schmid, C.: Human detection using oriented histograms of flow and appearance. In: Leonardis, A., Bischof, H., Pinz, A. (eds.) ECCV 2006. LNCS, vol. 3952, pp. 428–441. Springer, Heidelberg (2006)
de la Cuesta, A.G., Zhang, J., Miller, P.: Biometric identification using motion history images of a speaker’s lip movements. In: International Machine Vision and Image Processing Conference, IMVIP 2008, pp. 83–88. IEEE (2008)
Liu, Y.-F., Lin, C.-Y., Guo, J.-M.: Impact of the lips for biometrics. IEEE Transactions on Image Processing 21(6), 3092–3101 (2012)
Lucey, P., Sridharan, S.: Patch-based representation of visual speech. In: Proceedings of the HCSNet Workshop on Use of Vision in Human-Computer Interaction, pp. 79–85 (2006)
Lucey, P., Sridharan, S., Dean, D.: Continuous pose-invariant lipreading. In: INTERSPEECH 2008, 9th Annual Conference of the International Speech Communication Association, Brisbane, Australia, pp. 2679–2682, September 22–26, 2008
Papandreou, G., Katsamanis, A., Pitsikalis, V., Maragos, P.: Adaptive multimodal fusion by uncertainty compensation with application to audiovisual speech recognition. Audio, Speech, and Language Processing 17(3), 423–435 (2009)
Patterson, E.K., Gurbuz, S., Tufekci, Z., Gowdy, J.: Cuave: a new audio-visual database for multimodal human-computer interface research. In: Acoustics, Speech, and Signal Processing, vol. 2, pp. 2017–2020 (2002)
Pei, Y., Kim, T.-k., Zha, H.: Unsupervised random forest manifold alignment for lipreading. In: International Conference on Computer Vision, pp. 129–136 (2013)
Rekik, A., Ben-Hamadou, A., Mahdi, W.: Face pose tracking under arbitrary illumination changes. In: International Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications (2014)
Rekik, A., Ben-Hamadou, A., Mahdi, W.: A new visual speech recognition approach for RGB-D cameras. In: Campilho, A., Kamel, M. (eds.) ICIAR 2014, Part II. LNCS, vol. 8815, pp. 21–28. Springer, Heidelberg (2014)
Rekik, A., Ben-Hamadou, A., Mahdi, W.: An adaptive approach for lip-reading using image and depth data. Multimedia Tools and Applications, 1–28 (2015)
Rekik, A., Ben-Hamadou, A., Mahdi, W.: Human machine interaction via visual speech spotting. In: Proc. of Advanced Concepts for Intelligent Vision Systems (ACIVS) (2015)
Saeed, U.: Comparative analysis of lip features for person identification. In: Proceedings of the 8th International Conference on Frontiers of Information Technology, pp. 20. ACM (2010)
Saeed, U.: Person identification using behavioral features from lip motion. In: 2011 IEEE International Conference on Automatic Face & Gesture Recognition and Workshops (FG 2011), pp. 131–136. IEEE (2011)
Zhang, Z.: A flexible new technique for camera calibration. Pattern Analysis and Machine Intelligence 22(11), 1330–1334 (2000)
Zhao, G., Barnard, M., Pietikainen, M.: Lipreading with local spatiotemporal descriptors. Multimedia, IEEE Transactions 11(7), 1254–1265 (2009)
Zhou, Z., Hong, X., Zhao, G., Pietikainen, M.: A compact representation of visual speech data using latent variables. IEEE Transactions on Pattern Analysis and Machine Intelligence 36(1), 181–187 (2014)
Zhou, Z., Zhao, G., Hong, X., Pietikäinen, M.: A review of recent advances in visual speech decoding. Image and Vision Computing (2014)
Zhou, Z., Zhao, G. and Pietikainen, M.: Towards a practical lipreading system. In: International Conference on Computer Vision and Pattern Recognition, pp. 137–144 (2011)
Zhou, Z., Zhao, G., Pietikainen, M.: Lipreading: a graph embedding approach. In: International Conference on Pattern Recognition, pp. 523–526 (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2015 Springer International Publishing Switzerland
About this paper
Cite this paper
Rekik, A., Ben-Hamadou, A., Mahdi, W. (2015). Unified System for Visual Speech Recognition and Speaker Identification. In: Battiato, S., Blanc-Talon, J., Gallo, G., Philips, W., Popescu, D., Scheunders, P. (eds) Advanced Concepts for Intelligent Vision Systems. ACIVS 2015. Lecture Notes in Computer Science(), vol 9386. Springer, Cham. https://doi.org/10.1007/978-3-319-25903-1_33
Download citation
DOI: https://doi.org/10.1007/978-3-319-25903-1_33
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-25902-4
Online ISBN: 978-3-319-25903-1
eBook Packages: Computer ScienceComputer Science (R0)