Skip to main content
Log in

Multimodal speaker clustering in full length movies

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Multimodal clustering/diarization tries to answer the question ”who spoke when” by using audio and visual information. Diarizationconsists of two steps, at first segmentation of the audio information and detection of the speech segments and then clustering of the speech segments to group the speakers. This task has been mainly studied on audiovisual data from meetings, news broadcasts or talk shows. In this paper, we use visual information to aid speaker clustering and we introduce a new video-based feature, called actor presence that can be used to enhance audio-based speaker clustering. We tested the proposed method in three full length stereoscopic movies, i.e. a scenario much more difficult than the ones used so far, where there is no certainty that speech segments and video appearances of actors will always overlap. The results proved that the visual information can improve the speaker clustering accuracy and hence the diarization process.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. Alameda-Pineda X, Yan Y, Ricci E, Lanz O, Sebe N (2015) Analyzing free-standing conversational groups: a multimodal approach. In: Proceedings of the 23rd ACM international conference on multimedia, MM ’15. ACM, New York, pp 5–14

    Chapter  Google Scholar 

  2. Asthana A, Zafeiriou S, Cheng S, Pantic M (2013) Robust discriminative response map fitting with constrained local models. In: Proceedings of 2013 IEEE conference on computer vision and pattern recognition (CVPR), pp 3444–3451

  3. Baltzakis H, Argyros A, Lourakis M, Trahanias P (2008) Tracking of human hands and faces through probabilistic fusion of multiple visual cues. In: Proceedings of the 6th international conference on computer vision systems, ICVS’08. Springer, Berlin, Heidelberg, pp 33–42

    Google Scholar 

  4. Calic J, Campbell N, Dasiopoulou S, Kompatsiaris Y (2005) A survey on multimodal video representation for semantic retrieval. In: The international conference on computer as a tool, 2005. EUROCON 2005, vol 1, pp 135–138

  5. Carletta J (2006) Announcing the ami meeting corpus. The ELRA Newsletter 1(1):3–5

    Google Scholar 

  6. Chen S, Gopalakrishnan P (1998) Speaker, environment and channel change detection and clustering via the bayesian information criterion. In: Proceedings of DARPA broadcast news transcription and understanding workshop

  7. El Khoury E, Snac C, Joly P (2014) Audiovisual diarization of people in video content. Multimed Tools Appl 68(3):747–775

    Article  Google Scholar 

  8. Elmansori MM, Omar K (2011) An enhanced face detection method using skin color and back-propagation neural network. Eur J Sci Res 55(1):80

    Google Scholar 

  9. Feng W, Xie L, Zeng J, Liu ZQ (2009) Audio-visual human recognition using semi-supervised spectral learning and hidden markov models. J Vis Lang Comput 20(3):188–195

    Article  Google Scholar 

  10. Friedland G, Hung H, Yeo C (2009) Multi-modal speaker diarization of real-world meetings using compressed-domain video features. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing, 2009. ICASSP 2009, pp 4069–4072

  11. Friedland G, Yeo C, Hung H (2009) Visual speaker localization aided by acoustic models. In: Proceedings of the 17th ACM international conference on multimedia, MM ’09. ACM, New York, pp 195–202

    Chapter  Google Scholar 

  12. Garau G, Bourlard H (2010) Using audio and visual cues for speaker diarisation initialisation. In: Proceedings of the IEEE international conference on acoustics speech and signal processing (ICASSP), pp 4942–4945

  13. Iosifidis A, Tefas A, Pitas I (2015) On the kernel extreme learning machine classifier. Pattern Recogn Lett 54:11–17

    Article  Google Scholar 

  14. Jaimes A, Sebe N (2005) Multimodal human computer interaction: a survey. In: Computer vision in human-computer interaction. Lecture notes in computer science, vol 3766. Springer, Berlin Heidelberg, pp 1–15

  15. Khalidov V, Forbes F, Hansard M, Arnaud E, Horaud R (2008) Audio-visual clustering for 3d speaker localization. In: Proceedings of the 5th international workshop on machine learning for multimodal interaction, MLMI ’08. Springer, Berlin, Heidelberg, pp 86–97

    Chapter  Google Scholar 

  16. Ng AY, Jordan MI, Weiss Y (2001) On spectral clustering: analysis and an algorithm. In: Proceedings of NIPS. MIT Press, Cambridge, MA, pp 849–856

    Google Scholar 

  17. Noulas A, Englebienne G, Krose B (2012) Multimodal speaker diarization. IEEE Trans Pattern Anal Mach Intell 34(1):79–93

    Article  Google Scholar 

  18. Ohn-Bar E, Trivedi MM (2013) Joint angles similiarities and HOG 2 for action recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition workshops: human activity understanding from 3D Data, CVPR ’13. IEEE Press, Piscataway, NJ

    Google Scholar 

  19. Ojala T, Pietikainen M, Harwood D (1994) Performance evaluation of texture measures with classification based on kullback discrimination of distributions. In: Proceedings of the 12th IAPR international conference on pattern recognition, vol 1, pp 582–585

  20. Orfanidis G, Tefas A, Nikolaidis N, Pitas I (2014) Facial image clustering in stereo videos using local binary patterns and double spectral analysis. In: IEEE Symposium Series on Computational Intelligence (SSCI)

  21. Orfanidis G, Tefas A, Nikolaidis N, Pitas I (2015) Facial image clustering in stereoscopic videos using double spectral analysis. Signal Process Image Commun 33:86–105

    Article  Google Scholar 

  22. Patrona F, Iosifidis A, Tefas A, Nikolaidis N, Pitas I (2015) Visual voice activity detection based on spatiotemporal information and bag of words. In: IEEE international conference on image processing, ICIP 2015

  23. Sargin M, Aradhye H, Moreno P, Zhao M (2009) Audiovisual celebrity recognition in unconstrained web videos. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing, 2009. ICASSP 2009, pp 1977–1980

  24. Snoek CGM, Worring M (2005) Multimodal video indexing: A review of the state-of-the-art. Multimed Tools Appl 25(1):5–35

    Article  Google Scholar 

  25. Stamou G, Krinidis M, Nikolaidis N, Pitas I (2007) A monocular system for person tracking: implementation and testing. Journal on Multimodal User Interfaces 1(2):31–47

    Article  Google Scholar 

  26. Subramanian R, Yan Y, Staiano J, Lanz O, Sebe N (2013) On the relationship between head pose, social attention and personality prediction for unstructured and dynamic group interactions. In: Proceedings of the 15th ACM on international conference on multimodal interaction, ICMI ’13. ACM, New York, pp 3–10

    Chapter  Google Scholar 

  27. Uricar M, Franc V, Hlac V (2012) Detector of facial landmarks learned by the structured output svm. In: Proceedings of VISAPP 2012, pp 547–556

  28. Vallet F, Essid S, Carrive J (2013) A multimodal approach to speaker diarization on tv talk-shows. IEEE Trans Multimedia 15(3):509–520

    Article  Google Scholar 

  29. Wang H, Kläser A, Schmid C, Liu CL (2011) Action recognition by dense trajectories. In: Proceedings of the IEEE conference on computer vision and pattern recognition, 2011. CVPR 2011. IEEE, pp 3169–3176

  30. Wang H, Ullah M, Kläserr A, Laptev I, Schmid C (2009) Evaluation of local spatio-temporal features for action recognition. In: BMVC 2009-British machine vision conference

  31. Yan Y, Yang Y, Meng D, Liu G, Tong W, Hauptmann A, Sebe N (2015) Event oriented dictionary learning for complex event detection. IEEE Trans Image Process 24(6):1867–1878

    Article  MathSciNet  Google Scholar 

  32. Zoidi O, Nikolaidis N, Tefas A, Pitas I (2014) Stereo object tracking with fusion of texture, color and disparity information. Signal Process Image Commun 29(5):573–589

    Article  Google Scholar 

  33. Zoidi O, Nikolaidis N, Pitas I (2013) Appearance based object tracking in stereo sequences. In: Proceedings of the 2013 IEEE international conference on acoustics, speech and signal processing (ICASSP), pp 2434–2438

Download references

Acknowledgments

The research leading to these results has received funding from the European Union Seventh Framework Programme (FP7/2007-2013) under grant agreement number 287674 (3DTVS). This publication reflects only the authors views. The European Union is not liable for any use that may be made of the information contained therein.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to I. Kapsouras.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kapsouras, I., Tefas, A., Nikolaidis, N. et al. Multimodal speaker clustering in full length movies. Multimed Tools Appl 76, 2223–2242 (2017). https://doi.org/10.1007/s11042-015-3181-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-015-3181-5

Keywords

Navigation