Efficient Audio-Visual Speaker Recognition via Deep Heterogeneous Feature Fusion

Liu, Yu-Hang; Liu, Xin; Fan, Wentao; Zhong, Bineng; Du, Ji-Xiang

doi:10.1007/978-3-319-69923-3_62

Efficient Audio-Visual Speaker Recognition via Deep Heterogeneous Feature Fusion

Yu-Hang Liu^23,24,
Xin Liu^23,24,
Wentao Fan^23,24,
Bineng Zhong^23,24 &
…
Ji-Xiang Du^23,24

Conference paper
First Online: 20 October 2017

3721 Accesses
5 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10568))

Abstract

Audio-visual speaker recognition (AVSR) has long been an active research area primarily due to its complementary information for reliable access control in biometric system, and it is a challenging problem mainly attributes to its multimodal nature. In this paper, we present an efficient audio-visual speaker recognition approach via deep heterogeneous feature fusion. First, we exploit a dual-branch deep convolutional neural networks (CNN) learning framework to extract and fuse the high-level semantic features of face and audio data. Further, by considering the temporal dependency of audio-visual data, we embed the fused features into a bidirectional Long Short-Term Memory (LSTM) networks to produce the recognition result, though which the speakers acquired under different challenging conditions can be well identified. The experimental results have demonstrated the efficiency of our proposed approach in both audio-visual feature fusion and speaker recognition.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Bredin, H., Chollet, G.: Audio-visual speech synchrony measure for talking-face identity verification. In: Processing of IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 233–236 (2007)
Google Scholar
Cheng, H.T., Chao, Y.H., Yeh, S.L., Chen, C.S.: An efficient approach to multimodal person identity verification by fusing face and voice information. In: Processing of IEEE International Conference on Multimedia and Expo, pp. 542–545, 2005
Google Scholar
Feng, W., Xie, L., Zeng, J., Liu, Z.Q.: Audio-visual human recognition using semi-supervised spectral learning and hidden markov models. J. Vis. Lang. Comput. 20(3), 188–195 (2009)
Article Google Scholar
Geng, J., Liu, X., Cheung, Y.: Audio-visual speaker recognition via multi-modal correlated neural networks. In: IEEE/wic/acm International Conference on Web Intelligence Workshops, pp. 123–128 (2016)
Google Scholar
Graves, A., Fernández, S., Schmidhuber, J.: Bidirectional LSTM networks for improved phoneme classification and recognition. In: Duch, W., Kacprzyk, J., Oja, E., Zadrożny, S. (eds.) ICANN 2005. LNCS, vol. 3697, pp. 799–804. Springer, Heidelberg (2005). doi:10.1007/11550907_126
Google Scholar
Haghighat, M., Abdel-Mottaleb, M., Alhalabi, W.: Discriminant correlation analysis: real-time feature level fusion for multimodal biometric recognition. IEEE Trans. Inf. Forensics Secur. 11(9), 1984–1996 (2016)
Article Google Scholar
Hu, Y., Ren, J.S.J., Dai, J., Yuan, C., Xu, L., Wang, W.: Deep multimodal speaker naming. In: Proceedings of Annual ACM International Conference on Multimedia, pp. 1107–1110 (2015)
Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceeding of IEEE International Conference on Machine Learning, pp. 448–456 (2015)
Google Scholar
Maas, A.L., Hannun, A.Y., Ng, A.Y.: Rectifier nonlinearities improve neural network acoustic models. In: Processing of IEEE International Conference on Machine Learning Workshop, pp. 1–6 (2013)
Google Scholar
Ngiam, J., Khosla, A., Kim, M., Nam, J., Lee, H., Ng, A.Y.: Multimodal deep learning. In: Proceedings of IEEE International Conference on Machine Learning, pp. 689–696 (2011)
Google Scholar
Sahidullah, M., Saha, G.: Design, analysis and experimental evaluation of block based transformation in MFCC computation for speaker recognition. Speech Commun. 54(4), 543–565 (2012)
Article Google Scholar
Soltane, M., Doghmane, N., Guersi, N.: Face and speech based multi-modal biometric authentication. Process. IEEE Int. J. Adv. Sci. Technol. 21(6), 41–56 (2010)
Google Scholar
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
MATH MathSciNet Google Scholar
David Sánchez, A.V.: Advanced support vector machines and kernel methods. Neurocomputing 55(1C2), 5–20 (2003)
Article Google Scholar

Download references

Acknowledgment

The work described in this paper was supported by the National Science Foundation of China (No. 61673185, 61502183, 61572205, 61673186), National Science Foundation of Fujian Province (2017J01112), Promotion Program for Young and Middle-aged Teacher in Science and Technology Research (No. ZQN-PY309), the Promotion Program for graduate student in Scientific research and innovation ability of Huaqiao University (No. 1611314014).

Author information

Authors and Affiliations

Department of Computer Science, Huaqiao University, Xiamen, 361021, China
Yu-Hang Liu, Xin Liu, Wentao Fan, Bineng Zhong & Ji-Xiang Du
Xiamen Key Laboratory of Computer Vision and Pattern Recognition, Huaqiao University, Xiamen, 361021, China
Yu-Hang Liu, Xin Liu, Wentao Fan, Bineng Zhong & Ji-Xiang Du

Authors

Yu-Hang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Wentao Fan
View author publications
You can also search for this author in PubMed Google Scholar
Bineng Zhong
View author publications
You can also search for this author in PubMed Google Scholar
Ji-Xiang Du
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xin Liu .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Jie Zhou
Beihang University, Beijing, China
Yunhong Wang
Institute of Automation, Chinese Academy of Sciences, Beijing, China
Zhenan Sun
Computing and Technology, Chinese Academy of Sciences, Beijing, China
Yong Xu
Shenzhen University, Shenzhen, China
Linlin Shen
Tsinghua University, Beijing, China
Jianjiang Feng
Chinese Academy of Sciences, Beijing, China
Shiguang Shan
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China
Yu Qiao
Graduate School at Shenzhen, Tsinghua University, Shenzhen, China
Zhenhua Guo
Shenzhen University, Shenzhen, China
Shiqi Yu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, YH., Liu, X., Fan, W., Zhong, B., Du, JX. (2017). Efficient Audio-Visual Speaker Recognition via Deep Heterogeneous Feature Fusion. In: Zhou, J., et al. Biometric Recognition. CCBR 2017. Lecture Notes in Computer Science(), vol 10568. Springer, Cham. https://doi.org/10.1007/978-3-319-69923-3_62

Download citation

DOI: https://doi.org/10.1007/978-3-319-69923-3_62
Published: 20 October 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-69922-6
Online ISBN: 978-3-319-69923-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics