Skip to main content

Multi-stream Fusion for Speaker Classification

  • Chapter
Speaker Classification I

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 4343))

Abstract

Accurate detection of speaker traits has clear benefits in improving speech interfaces, finding useful information in multi-media archives, and in medical applications. Humans infer a variety of traits, robustly and effortlessly, from available sources of information, which may include vision and gestures in addition to voice. This paper examines techniques for integrating information from multiple sources, which may be broadly categorized into those in feature space, model space, score space and kernel space. Integration in feature space and model space has been extensively studied in the context of audio-visual literature, and here we focus on score space and kernel space. There are large number of potential schemes for integration in kernel space, and here we examine a particular instance which can integrate both acoustic and lexical information for affect recognition. The example is taken from a widely-deployed real-world application. We compare the kernel-based classifier with other competing techniques and demonstrate how it can provide a general and flexible framework for detecting speaker characteristics.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Gorin, A.L., Abella, A., Alonso, T., Riccardi, G., Wright, J.H.: Automated natural spoken dialog. IEEE Computer Magazine 35(4), 51–56 (2002)

    Google Scholar 

  2. Shafran, I., Riley, M., Mohri, M.: Voice signatures. In: ASRU. Proc. of IEEE Automatic Speech Recognition and Understanding Workshop, pp. 31–36 (2003)

    Google Scholar 

  3. Lee, C.M., Narayanan, S.S., Pieraccini, R.: Combining acoustic and language information for emotion recognition. In: Proc. Int’l Conference on Spoken Language Processing (2002)

    Google Scholar 

  4. Kumar, N., Andreou, A.G.: Heteroscedastic discriminant analysis and reduced rank hmms for improved speech recognition. Speech Communication 4, 283–297 (1998)

    Article  Google Scholar 

  5. Saon, G., Padmanabhan, M., Gopinath, R., Chen, S.: Maximum likelihood discriminant feature spaces. In: ICASSP. Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 1129–1132 (2000)

    Google Scholar 

  6. Gravier, G., Axelrod, S., Potamianos, G., Neti, C.: Maximum entropy and MCE based HMM stream weight estimation for audio-visual asr. In: ICASSP. Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 853–856 (2002)

    Google Scholar 

  7. Chetty, G., Wagner, M.: Audio-visual multimodal fusion for biometric person authentication and liveness verification. In: Proc. ACM NICTA-HCSNet Multimodal User Interaction Workshop, pp. 17–24 (2006)

    Google Scholar 

  8. Gunes, H., Piccardi, M.: Affect recognition from face and body: Early fusion vs. late fusion. In: Proc. of IEEE Int’l Conference on Systems, Man and Cybernetics, pp. 3437–3443 (2005)

    Google Scholar 

  9. Auckenthaler, R., Carey, M., Lloyd-Thomas, H.: Score normalization for text independent speaker verification system. In: Digital Signal Processing, pp. 42–54 (2000)

    Google Scholar 

  10. Mariethoz, J., Bengio, S.: A unified framework for score normalization techniques applied to text independent speaker verification. IEEE Signal Processing Letters, 532–535 (1997)

    Google Scholar 

  11. Poh, N., Bengio, S.: EER of fixed and trainable fusion classifiers: A theoretical study with application to biometric authentication tasks. In: Multiple classifier systems, pp. 74–85 (2005)

    Google Scholar 

  12. Hong, L., Jain, A.K., Pankanti, S.: Can mutli-biometrics improve performance. In: Proc. IEEE Workshop on Automatic Identification Advanced Technologies (WAIAT), pp. 59–64 (1999)

    Google Scholar 

  13. Ferrer, L., Sonmez, K., Kajarekar, S.: Class-dependent score combination for speaker recognition. In: Proc. European Conference on Speech Communication and Technology, pp. 2173–2176 (2005)

    Google Scholar 

  14. White, C., Shafran, I., luc Gauvain, J.: Discriminative classifiers for language recognition. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 213–216 (2006)

    Google Scholar 

  15. Poh, N., Bengio, S.: How do correlation and variance of base-experts affect fusion in biometric authentication tasks? In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 4384–4396 (2005)

    Google Scholar 

  16. Trees, H.L.V.: Detection, Estimation, and Modulation Theory: Part I. Wiley, Chichester (2001)

    Google Scholar 

  17. Nefian, A.V., Liang, L., Pi, X., Liu, X., Murphy, K.: Dynamic Bayesian networks for audio-visual speech recognition. EURASIP Journal of Applied Signal Processing (11), 1–15 (2002)

    Google Scholar 

  18. Poh, N., Bengio, S.: Why do multi-stream, multi-band and multi-modal approaches work on biometric user authentication tasks? In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2004)

    Google Scholar 

  19. Poh, N., Bengio, S.: Noise-robust multi-stream fusion for text-independent speaker authentication. In: The Speaker and Recognition Workshop (2004)

    Google Scholar 

  20. Glotin, H., Vergyri, D., Neti, C., Potamianos, G., Luettin, J.: Weighting schemes for audio-visual fusion in speech recognition. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp. 173–176 (2001)

    Google Scholar 

  21. Heckman, M., Berthommier, F., Kroschel, K.: Noisy adaptive stream weighting in audio-visual speech recognition. Journal of Applied Signal Processing, Special Issue of Audio-Visual Signal Processing, 1260–1273 (2002)

    Google Scholar 

  22. Bengio, S.: An asynchronous hidden Markov model for audio-visual speech recognition. In: Neural Information Processing System (NIPS) (2002)

    Google Scholar 

  23. Nefian, A.V., Liang, L., Fu, T., Liu, X.: A Bayesian approach to audio visual speaker identification. In: IEEE International Conference on Audio-and Video- based Biometric Person Authentication (June 2003)

    Google Scholar 

  24. Gowdy, J.N., Subramanya, A., Bartels, C., Bilmes, J.: Dbn-based multi-stream models for audio-visual speech recognition. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP) (2004)

    Google Scholar 

  25. Cetin, O., Ostendorf, M.: Multi-rate hidden Markov models and their application to machining tool-wear classification. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (2004)

    Google Scholar 

  26. Cetin, O., Ostendorf, M.: Multi-rate and variable-rate modeling of speech at phone and syllable time scales. In: Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 665–668 (2005)

    Google Scholar 

  27. Hetherington, I.L., Han, S., Glass, J.R.: Flexible multi-stream framework for speech recognition using multi-tape finite-state transducers. In: ICASSP. Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 417–420. IEEE Computer Society Press, Los Alamitos (2006)

    Google Scholar 

  28. Schölkopf, B., Burges, C.J.C., Smola, A.J.: Advances in Kernel Methods: Support Vector Learning. MIT Press, Cambridge (1999)

    Google Scholar 

  29. Shafran, I., Mohri, M.: A comparison of classifiers for detecting emotion from speech. In: ICASSP. Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 341–344. IEEE Computer Society Press, Los Alamitos (2005)

    Google Scholar 

  30. Ljolje, A.: Speech recognition using fundamental frequency and voicing in acoustic modeling. In: ICASSP. Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, IEEE Computer Society Press, Los Alamitos (2002)

    Google Scholar 

  31. Mohri, M., Pereira, F.C.N., Riley, M.: Weighted finite-state transducers in speech recognition. Computer Speech and Language 16(1), 69–88 (2002)

    Article  Google Scholar 

  32. Cortes, C., Haffner, P., Mohri, M.: Rational kernels: Theory and algorithms. Journal of Machine Learning Research (JMLR), 1035–1062 (2004)

    Google Scholar 

  33. Devillers, L., Vasilescu, I., Lamel, L.: Emotion detection in task-oriented dialog corpus. In: Proc. IEEE Int’l Conference on Multimedia, IEEE Computer Society Press, Los Alamitos (2003)

    Google Scholar 

  34. Cortes, C., Haffner, P., Mohri, M.: Lattice kernels for spoken-dialog classification. In: ICASSP. Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing, pp. 628–631. IEEE Computer Society Press, Los Alamitos (2003)

    Google Scholar 

  35. Haussler, D.: Convolution kernels on discrete structures. In: UC Santa Cruz Technical Report UCS-CRL-99-10 (1999)

    Google Scholar 

  36. Leslie, C., Eskin, E., Cohen, A., Weston, J., Noble, W.S.: Mismatch string kernels for svm protein classification. In: Proc. conference on Advances in Neural Information Processing Systems (NIPS) (2003)

    Google Scholar 

  37. Jaakkola, T.S., Haussler, D.: Exploiting generative models in discriminative classifiers. In: Proc. conference on Advances in Neural Information Processing Systems (NIPS), pp. 487–493 (1999)

    Google Scholar 

  38. Layton, M., Gales, M.: Acoustic modeling using continuous rational kernels. Journal of VLSI Signal Processing (2007)

    Google Scholar 

  39. Reg, J.M., Murphy, K.P., Fieguth, P.W.: Vision-based speaker detection using Bayesian networks. In: Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 110–116. IEEE Computer Society Press, Los Alamitos (1999)

    Google Scholar 

  40. Lanckriet, G.R.G., Bie, T.D., Cristianini, N., Jordan, M.I., Stafford, W.: A statistical framework for genomic data fusion. Bioinformatics 16, 2626–2635 (2004)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Christian Müller

Rights and permissions

Reprints and permissions

Copyright information

© 2007 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Shafran, I. (2007). Multi-stream Fusion for Speaker Classification. In: Müller, C. (eds) Speaker Classification I. Lecture Notes in Computer Science(), vol 4343. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-74200-5_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-540-74200-5_17

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-74186-2

  • Online ISBN: 978-3-540-74200-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics