Skip to main content
Log in

Incomplete-Data-Driven Speaker Segmentation for Diarization Application; A Help-Training Approach

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

This paper presents a new segmentation method for diarization application. This method is established on a support vector regression-based discriminative engine which bears the main duty of estimating the most possible change points. This engine is aided by a generative classifier in a help-training approach. Considering that there are no pre-labeled training samples in a segmentation task, the proposed model-based segmentation method attempts to suggest an appropriate solution to overcome this obstacle. The introduced iterative method supposes that the initial frames in a given segment belong to the associated speaker. This hypothesis permits the SVR engine to be initiated in the first iteration. In the following iterations, discriminative regression block in conjunction with the generative classifier tags the remaining frames with advantageous (positive) and disadvantageous (negative) labels. These newly labeled frames establish the working set to update the associated speaker model. In addition to the proposed segmentation method, a new strategy is introduced to estimate inserted and deleted change points. In the evaluation section, in addition to the common experimental assessment, attempts are made to achieve a unique and comprehensive insight into the statistical aspects of choosing training samples. Finally, comparison of the proposed segmentation and diarization system with similar method shows approximately 22.95% enhancement in the performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  1. M.M. Adankon, M. Cheriet, Help-training for semi-supervised support vector machines. Pattern Recogn. 44(9), 2220–2230 (2011)

    Article  Google Scholar 

  2. X. Angueramiro et al., Speaker diarization: a review of recent research. IEEE Trans. Audio Speech Lang. Process. 20(2), 356–370 (2012)

    Article  Google Scholar 

  3. X. Anguera et al., in International Workshop on Machine Learning for Multimodal Interaction. Robust speaker segmentation for meetings: the ICSI-SRI spring 2005 diarization system, (Springer, New York, 2005), p. 402–414

  4. M. Barnard et al., Robust multi-speaker tracking via dictionary learning and identity modeling. IEEE Trans. Multimed. 16(3), 864–880 (2014)

    Article  Google Scholar 

  5. B. Bielefeld, Language identification using shifted delta cepstrum. Presented at the 14th annual speech research symposium (1994)

  6. M. Cettolo, M. Vescovi, in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (ICASSP’03). Efficient audio segmentation algorithms based on the BIC, (IEEE, 2003), p. VI–537

  7. S. Cumani, P. Laface, Large-scale training of pairwise support vector machines for speaker recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(11), 1590–1600 (2014)

    Article  Google Scholar 

  8. H. Frihia, H. Bahi, HMM/SVM segmentation and labelling of Arabic speech for speech recognition applications. Int. J. Speech Technol. 20(3), 563–573 (2017)

    Article  Google Scholar 

  9. S. Galliano et al., in The ESTER 2 Evaluation Campaign for the Rich Transcription of French Radio Broadcasts. Tenth annual conference of the international speech communication association (2009)

  10. V. Hautamaki et al., Sparse classifier fusion for speaker verification. IEEE Trans. Audio Speech Lang. Process. 21(8), 1622–1631 (2013)

    Article  Google Scholar 

  11. M. Hu et al., Speaker Change Detection and Speaker Diarization Using Spatial Information. 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), (IEEE, 2015), p. 5743–5747

  12. M. India et al., LSTM neural network-based speaker segmentation using acoustic and language modelling. Presented at the August 20 (2017)

  13. T. Kinnunen, P. Rajan, in ICASSP. A practical, self-adaptive voice activity detector for speaker verification with noisy telephone and microphone data, (Citeseer, 2013), p. 7229–7233

  14. M. Kotti et al., in 2006 IEEE International Conference on Multimedia and Expo. Automatic speaker segmentation using multiple features and distance measures: a comparison of three approaches, (IEEE, 2006), p. 1101–1104

  15. K. Kumar et al., in 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Delta-spectral cepstral coefficients for robust speech recognition, (IEEE, 2011), p. 4784–4787

  16. H. Kun, D.L. Wang, Towards generalizing classification based speech separation. IEEE Trans. Audio Speech Lang. Process. 21(1), 168–177 (2013)

    Article  Google Scholar 

  17. J. Li et al., An overview of noise-robust automatic speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22(4), 745–777 (2014)

    Article  Google Scholar 

  18. J. Mairal et al., in Proceedings of the 26th Annual International Conference on Machine Learning. Online dictionary learning for sparse coding, (ACM, 2009), p. 689–696

  19. A.S. Malegaonkar et al., Efficient speaker change detection using adapted Gaussian mixture models. IEEE Trans. Audio Speech Lang. Process. 15(6), 1859–1869 (2007)

    Article  Google Scholar 

  20. S. Meignier, T. Merlin, in CMU SPUD Workshop. LIUM SpkDiarization: an open source toolkit for diarization (2010)

  21. H. Meinedo, J. Neto, in 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings (ICASSP’03). Audio segmentation, classification and clustering in a broadcast news task, (IEEE, 2003), p. II–5

  22. M.H. Moattar, M.M. Homayounpour, A review on speaker diarization systems and approaches. Speech Commun. 54(10), 1065–1103 (2012)

    Article  Google Scholar 

  23. S.H.K. Parthasarathi et al., Wordless sounds: robust speaker diarization using privacy-preserving audio representations. IEEE Trans. Audio Speech Lang. Process. 21(1), 85–98 (2013)

    Article  MathSciNet  Google Scholar 

  24. H. Phan et al., Random regression forests for acoustic event detection and classification. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 20–31 (2015)

    Article  Google Scholar 

  25. D.A. Reynolds, P. Torres-Carrasquillo, The MIT Lincoln Laboratory RT-04F diarization systems: Applications to broadcast audio and telephone conversations. DTIC Document (2004)

  26. T.N. Sainath et al., Exemplar-based sparse representation features: from TIMIT to LVCSR. IEEE Trans. Audio Speech Lang. Process. 19(8), 2598–2613 (2011)

    Article  Google Scholar 

  27. B. Schölkopf, A.J. Smola, Learning with Kernels: Support Vector Machines, Regularization, Optimization, and Beyond (MIT Press, Cambridge, MA, 2002)

    Google Scholar 

  28. Y. Shao et al., 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07. Incorporating auditory feature uncertainties in robust speaker identification, (IEEE, 2007), p. IV–277

  29. J. Silovsky, J. Prazak, in 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Speaker diarization of broadcast streams using two-stage clustering based on i-vectors and cosine distance scoring, (IEEE, 2012), p. 4193–4196

  30. M. Sinclair, S. King, in 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. Where are the challenges in speaker diarization? (IEEE, 2013), p. 7741–7745

  31. A. Tritschler, R.A. Gopinath, in Eurospeech. Improved speaker segmentation and segments clustering using the bayesian information criterion (1999), p. 679–682

  32. S. Xavier-de-Souza et al., Coupled simulated annealing. IEEE Trans. Syst. Man Cybern. Part B (Cybernetics) 40(2), 320–335 (2010)

    Article  Google Scholar 

  33. X. Yong et al., A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 23(1), 7–19 (2015)

    Article  Google Scholar 

  34. X. Zhao et al., CASA-based robust speaker identification. IEEE Trans. Audio Speech Lang. Process. 20(5), 1608–1616 (2012)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Farbod Razzazi.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Teimoori, F., Razzazi, F. Incomplete-Data-Driven Speaker Segmentation for Diarization Application; A Help-Training Approach. Circuits Syst Signal Process 38, 2489–2522 (2019). https://doi.org/10.1007/s00034-018-0974-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-018-0974-6

Keywords

Navigation