Audio-visual speaker diarization using fisher linear semi-discriminant analysis

Sarafianos, Nikolaos; Giannakopoulos, Theodoros; Petridis, Sergios

doi:10.1007/s11042-014-2274-x

Audio-visual speaker diarization using fisher linear semi-discriminant analysis

Published: 28 September 2014

Volume 75, pages 115–130, (2016)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Nikolaos Sarafianos¹,
Theodoros Giannakopoulos¹ &
Sergios Petridis¹

479 Accesses
11 Citations
Explore all metrics

Abstract

Speaker diarization aims to automatically answer the question “who spoke when” given a speech signal. In this work, we have focused on applying the FLsD approach, a semi-supervised version of Fisher Linear Discriminant analysis, both in the audio and the video signals to form a complete multimodal speaker diarization system. Extensive experiments have proven that the FLsD method boosts the performance of the face diarization task (i.e. the task of discovering faces over time given only the visual signal). In addition, we have proven through experimentation that applying the FLsD method for discriminating between faces is also independent of the initial feature space and remains relatively unaffected as the number of faces increases. Finally, a fusion method is proposed that leads to performance improvement in comparison to the best individual modality, which is the audio signal.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Unified System for Visual Speech Recognition and Speaker Identification

Multimodal Speaker Diarization Utilizing Face Clustering Information

Audio-Video Speaker Diarization for Unsupervised Speaker and Face Model Creation

References

Anguera Miro X, Bozonnet S, Evans N, Fredouille C, Friedland G, Vinyals O (2012) Speaker diarization: a review of recent research. IEEE Trans Audio Speech Lang Process 20(2):356–370
Article Google Scholar
Babuka R, Van der Veen P, Kaymak U (2002) Improved covariance estimation for gustafson-kessel clustering. In: FUZZ-IEEE’02, vol 2. IEEE, pp 1081–1085
Barnard M, Holden EJ, Owens R (2002) Lip tracking using pattern matching snakes. In: Proceedings of the 5th Asian conference on computer vision
Barras C, Zhu X, Meignier S, Gauvain J (2006) Multistage speaker diarization of broadcast news. IEEE Trans Audio Speech Lang Process 14(5):1505–1512
Article Google Scholar
Carletta J, Ashby S, Bourban S, Flynn M, Guillemot M, Hain T, Kadlec J, Karaiskos V, Kraaij W, Kronenthal M et al (2006) The ami meeting corpus: a pre-announcement. In: Machine learning for multimodal interaction. Springer, pp 28–39
Castaldo F, Colibro D, Dalmasso E, Laface P, Vair C (2008) Stream-based speaker segmentation using speaker factors and eigenvoices. In: IEEE international conference on coustics, speech and signal processing. ICASSP 2008. IEEE, pp 4133–4136
Chu SM, Tang H, Huang TS (2009) Fishervoice and semi-supervised speaker clustering. ICASSP:4089–4092
Cover TM, Thomas JA (1991) Elements of information theory. Wiley
Dalka P, Czyzewski A (2010) Human-computer interface based on visual lip movement and gesture recognition. IJCSA 7(3):124–139
Google Scholar
Daugman JG et al (1985) Uncertainty relation for resolution in space, spatial frequency, and orientation optimized by two-dimensional visual cortical filters. Opt Soc Amer J A: Optics Image Sci 2(7):1160–1169
Article Google Scholar
Dielmann A. (2010) Unsupervised detection of multimodal clusters in edited recordings. In: 2010 IEEE international workshop on multimedia signal processing (MMSP). IEEE, pp 177–182
Fiscus J, Garofolo J, Le A, Martin A, Pallett D, Przybocki M, Sanders G (2004) Results of the fall 2004 stt and mde evaluation. In: RT-04F workshop
Fleck MM, Forsyth DA, Bregler C (1996) Finding naked people. In: Computer vision ECCV’96. Springer, pp 593–602
Fodor IK (2002) A survey of dimension reduction techniques. Tech. rep., Technical Report UCRL-ID-148494, Lawrence Livermore National Laboratory
Foley D, Sammon J (1975) An optimal set of discriminant vectors. IEEE Trans Comput 100:281–289
Article Google Scholar
Friedland G, Hung H, Yeo C (2009) Multi-modal speaker diarization of real-world meetings using compressed-domain video features. In: IEEE international conference on acoustics, speech and signal processing. ICASSP 2009. IEEE, pp 4069–4072
Fukunaga K. (1990) Introduction to statistical pattern recognition. Academic Press Limited, Boston
MATH Google Scholar
Garau G., Bourlard H. (2010) Using audio and visual cues for speaker diarisation initialisation. In: 2010 IEEE international conference on acoustics speech and signal processing (ICASSP). IEEE, pp 4942–4945
Gargi U, Kasturi R, Strayer SH (2000) Performance characterization of video-shot-change detection methods. IEEE Trans Circ Syst Video Technol 10(1):1–13
Article Google Scholar
Giannakopoulos T, Petridis S (2012) Fisher linear semi-discriminant analysis for speaker diarization. IEEE Trans Audio Speech Lang Process 20(7):1913–1922
Article Google Scholar
Kuhn HW (1955) The hungarian method for the assignment problem. Naval Res Logist Q 2(1-2):83–97
Article Google Scholar
Liu C, Wechsler H (2002) Gabor feature based classification using the enhanced fisher linear discriminant model for face recognition. IEEE Trans Image Process 11(4):467–476
Article Google Scholar
Moore D (2002) The idiap smart meeting room
Noulas A, Englebienne G, Krose BJ (2012) Multimodal speaker diarization. IEEE Trans Pattern Anal Mach Intell 34(1):79–93
Article Google Scholar
Pardo JM, Anguera X, Wooters C (2007) Speaker diarization for multiple-distant-microphone meetings using several sources of information. IEEE Trans Comput 56(9):1212–1224
Article MathSciNet Google Scholar
Schiele B, Crowley JL (2000) Recognition without correspondence using multidimensional receptive field histograms. Int J Comput Vis 36(1):31–50
Article Google Scholar
Seichepine N, Essid S, Févotte C, Cappe O (2013) Soft nonnegative matrix co-factorizationwith application to multimodal speaker diarization. In: 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 3537–3541
Soetedjo A, Yamada K (2008) Skin color segmentation using coarse-to-fine region on normalized rgb chromaticity diagram for face detection. IEICE Trans Inf Syst 91(10):2493–2502
Article Google Scholar
Swain MJ, Ballard DH (1991) Color indexing. Int J Comput Vis 7(1):11–32
Article Google Scholar
Tranter SE, Reynolds DA (2006) An overview of automatic speaker diarization systems. IEEE Trans Audio Speech Lang Process 14(5):1557–1565
Article Google Scholar
Tzanetakis G, Cook P (2002) Musical genre classification of audio signals. IEEE Trans ASP 10:293–302. doi:10.1109/TSA.2002.800560
Google Scholar
Vajaria H, Islam T, Sarkar S, Sankar R, Kasturi R (2006) Audio segmentation and speaker localization in meeting videos. In: 18th international conference on pattern recognition. ICPR 2006, vol 2. IEEE, pp 1150–1153
Vallet F., Essid S., Carrive J. (2013) A multimodal approach to speaker diarization on tv talk-shows. IEEE Trans Multimedia 15 (3):509–520
Article Google Scholar
Vendramin L, Campello R, Hruschka E (2009) On the comparison of relative clustering validity criteria. In: SIAM international conference on data mining, pp 733–744
Vinciarelli A (2009) Capturing order in social interactions [social sciences]. IEEE Signal Process Mag 26 (5):133–152
Article Google Scholar
Vinciarelli A, Dielmann A, Favre S, Salamin H (2009) Canal9: a database of political debates for analysis of social interactions. In: 3rd International conference on affective computing and intelligent interaction and workshops. ACII 2009. IEEE, pp 1–4
Viola P, Jones M (2001) Rapid object detection using a boosted cascade of simple features. In: Proceedings of the 2001 IEEE computer society conference on computer vision and pattern recognition. CVPR 2001, vol 1. IEEE, pp I–511
Zhang H, Kankanhalli A, Smoliar SW (1993) Automatic partitioning of full-motion video. Multimedia Syst 1(1):10–28
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computational Intelligence Laboratory (CIL), Institute of Informatics and Telecommunications, National Center for Scientific Research “Demokritos”, Athens, Greece
Nikolaos Sarafianos, Theodoros Giannakopoulos & Sergios Petridis

Authors

Nikolaos Sarafianos
View author publications
You can also search for this author in PubMed Google Scholar
Theodoros Giannakopoulos
View author publications
You can also search for this author in PubMed Google Scholar
Sergios Petridis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nikolaos Sarafianos.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Sarafianos, N., Giannakopoulos, T. & Petridis, S. Audio-visual speaker diarization using fisher linear semi-discriminant analysis. Multimed Tools Appl 75, 115–130 (2016). https://doi.org/10.1007/s11042-014-2274-x

Download citation

Received: 07 April 2014
Revised: 21 July 2014
Accepted: 11 September 2014
Published: 28 September 2014
Issue Date: January 2016
DOI: https://doi.org/10.1007/s11042-014-2274-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Audio-visual speaker diarization using fisher linear semi-discriminant analysis

Abstract

Access this article

Similar content being viewed by others

Unified System for Visual Speech Recognition and Speaker Identification

Multimodal Speaker Diarization Utilizing Face Clustering Information

Audio-Video Speaker Diarization for Unsupervised Speaker and Face Model Creation

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Audio-visual speaker diarization using fisher linear semi-discriminant analysis

Abstract

Access this article

Similar content being viewed by others

Unified System for Visual Speech Recognition and Speaker Identification

Multimodal Speaker Diarization Utilizing Face Clustering Information

Audio-Video Speaker Diarization for Unsupervised Speaker and Face Model Creation

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation