Abstract
Accurate detection of speaker traits has clear benefits in improving speech interfaces, finding useful information in multi-media archives, and in medical applications. Humans infer a variety of traits, robustly and effortlessly, from available sources of information, which may include vision and gestures in addition to voice. This paper examines techniques for integrating information from multiple sources, which may be broadly categorized into those in feature space, model space, score space and kernel space. Integration in feature space and model space has been extensively studied in the context of audio-visual literature, and here we focus on score space and kernel space. There are large number of potential schemes for integration in kernel space, and here we examine a particular instance which can integrate both acoustic and lexical information for affect recognition. The example is taken from a widely-deployed real-world application. We compare the kernel-based classifier with other competing techniques and demonstrate how it can provide a general and flexible framework for detecting speaker characteristics.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Gorin, A.L., Abella, A., Alonso, T., Riccardi, G., Wright, J.H.: Automated natural spoken dialog. IEEE Computer Magazine 35(4), 51–56 (2002)
Shafran, I., Riley, M., Mohri, M.: Voice signatures. In: ASRU. Proc. of IEEE Automatic Speech Recognition and Understanding Workshop, pp. 31–36 (2003)
Lee, C.M., Narayanan, S.S., Pieraccini, R.: Combining acoustic and language information for emotion recognition. In: Proc. Int’l Conference on Spoken Language Processing (2002)
Kumar, N., Andreou, A.G.: Heteroscedastic discriminant analysis and reduced rank hmms for improved speech recognition. Speech Communication 4, 283–297 (1998)
Saon, G., Padmanabhan, M., Gopinath, R., Chen, S.: Maximum likelihood discriminant feature spaces. In: ICASSP. Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 1129–1132 (2000)
Gravier, G., Axelrod, S., Potamianos, G., Neti, C.: Maximum entropy and MCE based HMM stream weight estimation for audio-visual asr. In: ICASSP. Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 853–856 (2002)
Chetty, G., Wagner, M.: Audio-visual multimodal fusion for biometric person authentication and liveness verification. In: Proc. ACM NICTA-HCSNet Multimodal User Interaction Workshop, pp. 17–24 (2006)
Gunes, H., Piccardi, M.: Affect recognition from face and body: Early fusion vs. late fusion. In: Proc. of IEEE Int’l Conference on Systems, Man and Cybernetics, pp. 3437–3443 (2005)
Auckenthaler, R., Carey, M., Lloyd-Thomas, H.: Score normalization for text independent speaker verification system. In: Digital Signal Processing, pp. 42–54 (2000)
Mariethoz, J., Bengio, S.: A unified framework for score normalization techniques applied to text independent speaker verification. IEEE Signal Processing Letters, 532–535 (1997)
Poh, N., Bengio, S.: EER of fixed and trainable fusion classifiers: A theoretical study with application to biometric authentication tasks. In: Multiple classifier systems, pp. 74–85 (2005)
Hong, L., Jain, A.K., Pankanti, S.: Can mutli-biometrics improve performance. In: Proc. IEEE Workshop on Automatic Identification Advanced Technologies (WAIAT), pp. 59–64 (1999)
Ferrer, L., Sonmez, K., Kajarekar, S.: Class-dependent score combination for speaker recognition. In: Proc. European Conference on Speech Communication and Technology, pp. 2173–2176 (2005)
White, C., Shafran, I., luc Gauvain, J.: Discriminative classifiers for language recognition. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 213–216 (2006)
Poh, N., Bengio, S.: How do correlation and variance of base-experts affect fusion in biometric authentication tasks? In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4384–4396 (2005)
Trees, H.L.V.: Detection, Estimation, and Modulation Theory: Part I. Wiley, Chichester (2001)
Nefian, A.V., Liang, L., Pi, X., Liu, X., Murphy, K.: Dynamic Bayesian networks for audio-visual speech recognition. EURASIP Journal of Applied Signal Processing (11), 1–15 (2002)
Poh, N., Bengio, S.: Why do multi-stream, multi-band and multi-modal approaches work on biometric user authentication tasks? In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2004)
Poh, N., Bengio, S.: Noise-robust multi-stream fusion for text-independent speaker authentication. In: The Speaker and Recognition Workshop (2004)
Glotin, H., Vergyri, D., Neti, C., Potamianos, G., Luettin, J.: Weighting schemes for audio-visual fusion in speech recognition. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 173–176 (2001)
Heckman, M., Berthommier, F., Kroschel, K.: Noisy adaptive stream weighting in audio-visual speech recognition. Journal of Applied Signal Processing, Special Issue of Audio-Visual Signal Processing, 1260–1273 (2002)
Bengio, S.: An asynchronous hidden Markov model for audio-visual speech recognition. In: Neural Information Processing System (NIPS) (2002)
Nefian, A.V., Liang, L., Fu, T., Liu, X.: A Bayesian approach to audio visual speaker identification. In: IEEE International Conference on Audio-and Video- based Biometric Person Authentication (June 2003)
Gowdy, J.N., Subramanya, A., Bartels, C., Bilmes, J.: Dbn-based multi-stream models for audio-visual speech recognition. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2004)
Cetin, O., Ostendorf, M.: Multi-rate hidden Markov models and their application to machining tool-wear classification. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (2004)
Cetin, O., Ostendorf, M.: Multi-rate and variable-rate modeling of speech at phone and syllable time scales. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 665–668 (2005)
Hetherington, I.L., Han, S., Glass, J.R.: Flexible multi-stream framework for speech recognition using multi-tape finite-state transducers. In: ICASSP. Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 417–420. IEEE Computer Society Press, Los Alamitos (2006)
Schölkopf, B., Burges, C.J.C., Smola, A.J.: Advances in Kernel Methods: Support Vector Learning. MIT Press, Cambridge (1999)
Shafran, I., Mohri, M.: A comparison of classifiers for detecting emotion from speech. In: ICASSP. Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 341–344. IEEE Computer Society Press, Los Alamitos (2005)
Ljolje, A.: Speech recognition using fundamental frequency and voicing in acoustic modeling. In: ICASSP. Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE Computer Society Press, Los Alamitos (2002)
Mohri, M., Pereira, F.C.N., Riley, M.: Weighted finite-state transducers in speech recognition. Computer Speech and Language 16(1), 69–88 (2002)
Cortes, C., Haffner, P., Mohri, M.: Rational kernels: Theory and algorithms. Journal of Machine Learning Research (JMLR), 1035–1062 (2004)
Devillers, L., Vasilescu, I., Lamel, L.: Emotion detection in task-oriented dialog corpus. In: Proc. IEEE Int’l Conference on Multimedia, IEEE Computer Society Press, Los Alamitos (2003)
Cortes, C., Haffner, P., Mohri, M.: Lattice kernels for spoken-dialog classification. In: ICASSP. Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 628–631. IEEE Computer Society Press, Los Alamitos (2003)
Haussler, D.: Convolution kernels on discrete structures. In: UC Santa Cruz Technical Report UCS-CRL-99-10 (1999)
Leslie, C., Eskin, E., Cohen, A., Weston, J., Noble, W.S.: Mismatch string kernels for svm protein classification. In: Proc. conference on Advances in Neural Information Processing Systems (NIPS) (2003)
Jaakkola, T.S., Haussler, D.: Exploiting generative models in discriminative classifiers. In: Proc. conference on Advances in Neural Information Processing Systems (NIPS), pp. 487–493 (1999)
Layton, M., Gales, M.: Acoustic modeling using continuous rational kernels. Journal of VLSI Signal Processing (2007)
Reg, J.M., Murphy, K.P., Fieguth, P.W.: Vision-based speaker detection using Bayesian networks. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 110–116. IEEE Computer Society Press, Los Alamitos (1999)
Lanckriet, G.R.G., Bie, T.D., Cristianini, N., Jordan, M.I., Stafford, W.: A statistical framework for genomic data fusion. Bioinformatics 16, 2626–2635 (2004)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Shafran, I. (2007). Multi-stream Fusion for Speaker Classification. In: Müller, C. (eds) Speaker Classification I. Lecture Notes in Computer Science(), vol 4343. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74200-5_17
Download citation
DOI: https://doi.org/10.1007/978-3-540-74200-5_17
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-74186-2
Online ISBN: 978-3-540-74200-5
eBook Packages: Computer ScienceComputer Science (R0)