Multi-stream Fusion for Speaker Classification

Shafran, Izhak

doi:10.1007/978-3-540-74200-5_17

Izhak Shafran¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4343))

2375 Accesses
1 Citations

Abstract

Accurate detection of speaker traits has clear benefits in improving speech interfaces, finding useful information in multi-media archives, and in medical applications. Humans infer a variety of traits, robustly and effortlessly, from available sources of information, which may include vision and gestures in addition to voice. This paper examines techniques for integrating information from multiple sources, which may be broadly categorized into those in feature space, model space, score space and kernel space. Integration in feature space and model space has been extensively studied in the context of audio-visual literature, and here we focus on score space and kernel space. There are large number of potential schemes for integration in kernel space, and here we examine a particular instance which can integrate both acoustic and lexical information for affect recognition. The example is taken from a widely-deployed real-world application. We compare the kernel-based classifier with other competing techniques and demonstrate how it can provide a general and flexible framework for detecting speaker characteristics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Gorin, A.L., Abella, A., Alonso, T., Riccardi, G., Wright, J.H.: Automated natural spoken dialog. IEEE Computer Magazine 35(4), 51–56 (2002)
Google Scholar
Shafran, I., Riley, M., Mohri, M.: Voice signatures. In: ASRU. Proc. of IEEE Automatic Speech Recognition and Understanding Workshop, pp. 31–36 (2003)
Google Scholar
Lee, C.M., Narayanan, S.S., Pieraccini, R.: Combining acoustic and language information for emotion recognition. In: Proc. Int’l Conference on Spoken Language Processing (2002)
Google Scholar
Kumar, N., Andreou, A.G.: Heteroscedastic discriminant analysis and reduced rank hmms for improved speech recognition. Speech Communication 4, 283–297 (1998)
Article Google Scholar
Saon, G., Padmanabhan, M., Gopinath, R., Chen, S.: Maximum likelihood discriminant feature spaces. In: ICASSP. Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 1129–1132 (2000)
Google Scholar
Gravier, G., Axelrod, S., Potamianos, G., Neti, C.: Maximum entropy and MCE based HMM stream weight estimation for audio-visual asr. In: ICASSP. Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 853–856 (2002)
Google Scholar
Chetty, G., Wagner, M.: Audio-visual multimodal fusion for biometric person authentication and liveness verification. In: Proc. ACM NICTA-HCSNet Multimodal User Interaction Workshop, pp. 17–24 (2006)
Google Scholar
Gunes, H., Piccardi, M.: Affect recognition from face and body: Early fusion vs. late fusion. In: Proc. of IEEE Int’l Conference on Systems, Man and Cybernetics, pp. 3437–3443 (2005)
Google Scholar
Auckenthaler, R., Carey, M., Lloyd-Thomas, H.: Score normalization for text independent speaker verification system. In: Digital Signal Processing, pp. 42–54 (2000)
Google Scholar
Mariethoz, J., Bengio, S.: A unified framework for score normalization techniques applied to text independent speaker verification. IEEE Signal Processing Letters, 532–535 (1997)
Google Scholar
Poh, N., Bengio, S.: EER of fixed and trainable fusion classifiers: A theoretical study with application to biometric authentication tasks. In: Multiple classifier systems, pp. 74–85 (2005)
Google Scholar
Hong, L., Jain, A.K., Pankanti, S.: Can mutli-biometrics improve performance. In: Proc. IEEE Workshop on Automatic Identification Advanced Technologies (WAIAT), pp. 59–64 (1999)
Google Scholar
Ferrer, L., Sonmez, K., Kajarekar, S.: Class-dependent score combination for speaker recognition. In: Proc. European Conference on Speech Communication and Technology, pp. 2173–2176 (2005)
Google Scholar
White, C., Shafran, I., luc Gauvain, J.: Discriminative classifiers for language recognition. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 213–216 (2006)
Google Scholar
Poh, N., Bengio, S.: How do correlation and variance of base-experts affect fusion in biometric authentication tasks? In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4384–4396 (2005)
Google Scholar
Trees, H.L.V.: Detection, Estimation, and Modulation Theory: Part I. Wiley, Chichester (2001)
Google Scholar
Nefian, A.V., Liang, L., Pi, X., Liu, X., Murphy, K.: Dynamic Bayesian networks for audio-visual speech recognition. EURASIP Journal of Applied Signal Processing (11), 1–15 (2002)
Google Scholar
Poh, N., Bengio, S.: Why do multi-stream, multi-band and multi-modal approaches work on biometric user authentication tasks? In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2004)
Google Scholar
Poh, N., Bengio, S.: Noise-robust multi-stream fusion for text-independent speaker authentication. In: The Speaker and Recognition Workshop (2004)
Google Scholar
Glotin, H., Vergyri, D., Neti, C., Potamianos, G., Luettin, J.: Weighting schemes for audio-visual fusion in speech recognition. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 173–176 (2001)
Google Scholar
Heckman, M., Berthommier, F., Kroschel, K.: Noisy adaptive stream weighting in audio-visual speech recognition. Journal of Applied Signal Processing, Special Issue of Audio-Visual Signal Processing, 1260–1273 (2002)
Google Scholar
Bengio, S.: An asynchronous hidden Markov model for audio-visual speech recognition. In: Neural Information Processing System (NIPS) (2002)
Google Scholar
Nefian, A.V., Liang, L., Fu, T., Liu, X.: A Bayesian approach to audio visual speaker identification. In: IEEE International Conference on Audio-and Video- based Biometric Person Authentication (June 2003)
Google Scholar
Gowdy, J.N., Subramanya, A., Bartels, C., Bilmes, J.: Dbn-based multi-stream models for audio-visual speech recognition. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2004)
Google Scholar
Cetin, O., Ostendorf, M.: Multi-rate hidden Markov models and their application to machining tool-wear classification. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (2004)
Google Scholar
Cetin, O., Ostendorf, M.: Multi-rate and variable-rate modeling of speech at phone and syllable time scales. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 665–668 (2005)
Google Scholar
Hetherington, I.L., Han, S., Glass, J.R.: Flexible multi-stream framework for speech recognition using multi-tape finite-state transducers. In: ICASSP. Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 417–420. IEEE Computer Society Press, Los Alamitos (2006)
Google Scholar
Schölkopf, B., Burges, C.J.C., Smola, A.J.: Advances in Kernel Methods: Support Vector Learning. MIT Press, Cambridge (1999)
Google Scholar
Shafran, I., Mohri, M.: A comparison of classifiers for detecting emotion from speech. In: ICASSP. Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 341–344. IEEE Computer Society Press, Los Alamitos (2005)
Google Scholar
Ljolje, A.: Speech recognition using fundamental frequency and voicing in acoustic modeling. In: ICASSP. Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE Computer Society Press, Los Alamitos (2002)
Google Scholar
Mohri, M., Pereira, F.C.N., Riley, M.: Weighted finite-state transducers in speech recognition. Computer Speech and Language 16(1), 69–88 (2002)
Article Google Scholar
Cortes, C., Haffner, P., Mohri, M.: Rational kernels: Theory and algorithms. Journal of Machine Learning Research (JMLR), 1035–1062 (2004)
Google Scholar
Devillers, L., Vasilescu, I., Lamel, L.: Emotion detection in task-oriented dialog corpus. In: Proc. IEEE Int’l Conference on Multimedia, IEEE Computer Society Press, Los Alamitos (2003)
Google Scholar
Cortes, C., Haffner, P., Mohri, M.: Lattice kernels for spoken-dialog classification. In: ICASSP. Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 628–631. IEEE Computer Society Press, Los Alamitos (2003)
Google Scholar
Haussler, D.: Convolution kernels on discrete structures. In: UC Santa Cruz Technical Report UCS-CRL-99-10 (1999)
Google Scholar
Leslie, C., Eskin, E., Cohen, A., Weston, J., Noble, W.S.: Mismatch string kernels for svm protein classification. In: Proc. conference on Advances in Neural Information Processing Systems (NIPS) (2003)
Google Scholar
Jaakkola, T.S., Haussler, D.: Exploiting generative models in discriminative classifiers. In: Proc. conference on Advances in Neural Information Processing Systems (NIPS), pp. 487–493 (1999)
Google Scholar
Layton, M., Gales, M.: Acoustic modeling using continuous rational kernels. Journal of VLSI Signal Processing (2007)
Google Scholar
Reg, J.M., Murphy, K.P., Fieguth, P.W.: Vision-based speaker detection using Bayesian networks. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 110–116. IEEE Computer Society Press, Los Alamitos (1999)
Google Scholar
Lanckriet, G.R.G., Bie, T.D., Cristianini, N., Jordan, M.I., Stafford, W.: A statistical framework for genomic data fusion. Bioinformatics 16, 2626–2635 (2004)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science & Electrical Engineering, OGI School of Science & Engineering, Oregon Health & Science University (OHSU), 20000 NW Walker Rd, Beaverton, OR 97006,
Izhak Shafran

Authors

Izhak Shafran
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Christian Müller

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Shafran, I. (2007). Multi-stream Fusion for Speaker Classification. In: Müller, C. (eds) Speaker Classification I. Lecture Notes in Computer Science(), vol 4343. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74200-5_17

Download citation

DOI: https://doi.org/10.1007/978-3-540-74200-5_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74186-2
Online ISBN: 978-3-540-74200-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics