Skip to main content
Log in

Audio-visual speaker diarization using fisher linear semi-discriminant analysis

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Speaker diarization aims to automatically answer the question “who spoke when” given a speech signal. In this work, we have focused on applying the FLsD approach, a semi-supervised version of Fisher Linear Discriminant analysis, both in the audio and the video signals to form a complete multimodal speaker diarization system. Extensive experiments have proven that the FLsD method boosts the performance of the face diarization task (i.e. the task of discovering faces over time given only the visual signal). In addition, we have proven through experimentation that applying the FLsD method for discriminating between faces is also independent of the initial feature space and remains relatively unaffected as the number of faces increases. Finally, a fusion method is proposed that leads to performance improvement in comparison to the best individual modality, which is the audio signal.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

References

  1. Anguera Miro X, Bozonnet S, Evans N, Fredouille C, Friedland G, Vinyals O (2012) Speaker diarization: a review of recent research. IEEE Trans Audio Speech Lang Process 20(2):356–370

    Article  Google Scholar 

  2. Babuka R, Van der Veen P, Kaymak U (2002) Improved covariance estimation for gustafson-kessel clustering. In: FUZZ-IEEE’02, vol 2. IEEE, pp 1081–1085

  3. Barnard M, Holden EJ, Owens R (2002) Lip tracking using pattern matching snakes. In: Proceedings of the 5th Asian conference on computer vision

  4. Barras C, Zhu X, Meignier S, Gauvain J (2006) Multistage speaker diarization of broadcast news. IEEE Trans Audio Speech Lang Process 14(5):1505–1512

    Article  Google Scholar 

  5. Carletta J, Ashby S, Bourban S, Flynn M, Guillemot M, Hain T, Kadlec J, Karaiskos V, Kraaij W, Kronenthal M et al (2006) The ami meeting corpus: a pre-announcement. In: Machine learning for multimodal interaction. Springer, pp 28–39

  6. Castaldo F, Colibro D, Dalmasso E, Laface P, Vair C (2008) Stream-based speaker segmentation using speaker factors and eigenvoices. In: IEEE international conference on coustics, speech and signal processing. ICASSP 2008. IEEE, pp 4133–4136

  7. Chu SM, Tang H, Huang TS (2009) Fishervoice and semi-supervised speaker clustering. ICASSP:4089–4092

  8. Cover TM, Thomas JA (1991) Elements of information theory. Wiley

  9. Dalka P, Czyzewski A (2010) Human-computer interface based on visual lip movement and gesture recognition. IJCSA 7(3):124–139

    Google Scholar 

  10. Daugman JG et al (1985) Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters. Opt Soc Amer J A: Optics Image Sci 2(7):1160–1169

    Article  Google Scholar 

  11. Dielmann A. (2010) Unsupervised detection of multimodal clusters in edited recordings. In: 2010 IEEE international workshop on multimedia signal processing (MMSP). IEEE, pp 177–182

  12. Fiscus J, Garofolo J, Le A, Martin A, Pallett D, Przybocki M, Sanders G (2004) Results of the fall 2004 stt and mde evaluation. In: RT-04F workshop

  13. Fleck MM, Forsyth DA, Bregler C (1996) Finding naked people. In: Computer vision ECCV’96. Springer, pp 593–602

  14. Fodor IK (2002) A survey of dimension reduction techniques. Tech. rep., Technical Report UCRL-ID-148494, Lawrence Livermore National Laboratory

  15. Foley D, Sammon J (1975) An optimal set of discriminant vectors. IEEE Trans Comput 100:281–289

    Article  Google Scholar 

  16. Friedland G, Hung H, Yeo C (2009) Multi-modal speaker diarization of real-world meetings using compressed-domain video features. In: IEEE international conference on acoustics, speech and signal processing. ICASSP 2009. IEEE, pp 4069–4072

  17. Fukunaga K. (1990) Introduction to statistical pattern recognition. Academic Press Limited, Boston

    MATH  Google Scholar 

  18. Garau G., Bourlard H. (2010) Using audio and visual cues for speaker diarisation initialisation. In: 2010 IEEE international conference on acoustics speech and signal processing (ICASSP). IEEE, pp 4942–4945

  19. Gargi U, Kasturi R, Strayer SH (2000) Performance characterization of video-shot-change detection methods. IEEE Trans Circ Syst Video Technol 10(1):1–13

    Article  Google Scholar 

  20. Giannakopoulos T, Petridis S (2012) Fisher linear semi-discriminant analysis for speaker diarization. IEEE Trans Audio Speech Lang Process 20(7):1913–1922

    Article  Google Scholar 

  21. Kuhn HW (1955) The hungarian method for the assignment problem. Naval Res Logist Q 2(1-2):83–97

    Article  Google Scholar 

  22. Liu C, Wechsler H (2002) Gabor feature based classification using the enhanced fisher linear discriminant model for face recognition. IEEE Trans Image Process 11(4):467–476

    Article  Google Scholar 

  23. Moore D (2002) The idiap smart meeting room

  24. Noulas A, Englebienne G, Krose BJ (2012) Multimodal speaker diarization. IEEE Trans Pattern Anal Mach Intell 34(1):79–93

    Article  Google Scholar 

  25. Pardo JM, Anguera X, Wooters C (2007) Speaker diarization for multiple-distant-microphone meetings using several sources of information. IEEE Trans Comput 56(9):1212–1224

    Article  MathSciNet  Google Scholar 

  26. Schiele B, Crowley JL (2000) Recognition without correspondence using multidimensional receptive field histograms. Int J Comput Vis 36(1):31–50

    Article  Google Scholar 

  27. Seichepine N, Essid S, Févotte C, Cappe O (2013) Soft nonnegative matrix co-factorizationwith application to multimodal speaker diarization. In: 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 3537–3541

  28. Soetedjo A, Yamada K (2008) Skin color segmentation using coarse-to-fine region on normalized rgb chromaticity diagram for face detection. IEICE Trans Inf Syst 91(10):2493–2502

    Article  Google Scholar 

  29. Swain MJ, Ballard DH (1991) Color indexing. Int J Comput Vis 7(1):11–32

    Article  Google Scholar 

  30. Tranter SE, Reynolds DA (2006) An overview of automatic speaker diarization systems. IEEE Trans Audio Speech Lang Process 14(5):1557–1565

    Article  Google Scholar 

  31. Tzanetakis G, Cook P (2002) Musical genre classification of audio signals. IEEE Trans ASP 10:293–302. doi:10.1109/TSA.2002.800560

    Google Scholar 

  32. Vajaria H, Islam T, Sarkar S, Sankar R, Kasturi R (2006) Audio segmentation and speaker localization in meeting videos. In: 18th international conference on pattern recognition. ICPR 2006, vol 2. IEEE, pp 1150–1153

  33. Vallet F., Essid S., Carrive J. (2013) A multimodal approach to speaker diarization on tv talk-shows. IEEE Trans Multimedia 15 (3):509–520

    Article  Google Scholar 

  34. Vendramin L, Campello R, Hruschka E (2009) On the comparison of relative clustering validity criteria. In: SIAM international conference on data mining, pp 733–744

  35. Vinciarelli A (2009) Capturing order in social interactions [social sciences]. IEEE Signal Process Mag 26 (5):133–152

    Article  Google Scholar 

  36. Vinciarelli A, Dielmann A, Favre S, Salamin H (2009) Canal9: a database of political debates for analysis of social interactions. In: 3rd International conference on affective computing and intelligent interaction and workshops. ACII 2009. IEEE, pp 1–4

  37. Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. CVPR 2001, vol 1. IEEE, pp I–511

  38. Zhang H, Kankanhalli A, Smoliar SW (1993) Automatic partitioning of full-motion video. Multimedia Syst 1(1):10–28

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nikolaos Sarafianos.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Sarafianos, N., Giannakopoulos, T. & Petridis, S. Audio-visual speaker diarization using fisher linear semi-discriminant analysis. Multimed Tools Appl 75, 115–130 (2016). https://doi.org/10.1007/s11042-014-2274-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-014-2274-x

Keywords

Navigation