Abstract
In this paper, a speaker segmentation algorithm is proposed based on a Combined feature approach using the Convolution Neural Network (CNN), which is used to deal with the speaker segmentation problem of dialogue speech with partial prior knowledge in the CALL_CENTER environment. For the first time, the Mel-Frequency Cepstral Coefficients (MFCC) feature and the SPECTROGRAM feature are combined as the input of CNN to train the speakers’ voice feature model and to estimate the change point. In the experiments, a real database about the dialogue voice related to insurance sales and real estate sales industry is used to compare our proposed approach with Bayesian Information Criterion (BIC) approach using different acoustic features sets. The results show that the synthetical performance is improved, and our algorithm has a better segmentation.
Keywords
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bonastre, J.F., Delacourt, P., Fredouille, C., Merlin, T., Wellekens, C.: A speaker tracking system based on speaker turn detection for NIST evaluation. In: Proceedings of 2000 IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2000, vol. 2, pp. 1177–1180 (2000)
Barras, C., Zhu, X., Meignier, S., et al.: Multistage speaker diarization of broadcast news. IEEE Trans. Audio Speech Lang. Process. 14(5), 1505–1512 (2006)
Tranter, S.E., Reynolds, D.A.: An overview of automatic speaker diarization systems. IEEE Trans. Audio Speech Lang. Process. 14(5), 1557–1565 (2006)
Saeidi, R., Mohammadi, H.S., Rodman, R.D., Kinnunen, T.: A new segmentation algorithm combined with transient frames power for text independent speaker verification. In: IEEE International Conference on Acoustics, Speech and Signal, ICASSP 2007, vol. 4, p. 305 (2007)
Chen, S., Gopalakrishnan, P. S.: Speaker, environment and channel change detection and clustering via the Bayesian information criterion. In: Proceedings of DARPA Broadcast News Transcription and Understanding Workshop, vol. 8, pp. 127–132 (1998)
Delacourt, P., Wellekens, C.: DISTBIC: A speaker-based segmentation for audio data indexing. Speech Commun. 32(1), 111–126 (2000)
Bakis, R., Chen, S., Gopalakrishnan, P., Gopinath, R., Maes, S., Polymenakos, L., Franz, M.: Transcription of broadcast news shows with the IBM large vocabulary speech recognition system. In: Proceedings of DARPA Speech Recognition Workshop, VA, pp. 67–72 (1997)
Cettolo, M., Vescovi, M., Rizzi, R.: Evaluation of BIC-based algorithms for audio segmentation. J. Comput. Speech & Lang. 19(2), 147–170 (2005)
Siegler, M.A., Jain, U., Raj, B., Stern, R.M.: Automatic segmentation, classification and clustering of broadcast news audio. In: Proceedings of DARPA Speech Recognition Workshop, VA, pp. 97–99 (1997)
Gish, H., Siu, M.H., Rohlicek, R.: Segregation of speakers for speech recognition and speaker identification. In: 1991 International Conference on Acoustics, Speech, and Signal Processing, ICASSP 1991, pp. 873–876 (1991)
Jin, H., Kubala, F., Schwartz, R.: Automatic speaker clustering. In: Proceedings of the DARPA Speech Recognition Workshop, pp. 108–111 (1997)
Tompson, J., Stein, M., Lecun, Y., Perlin, K.: Real-time continuous pose recovery of human hands using convolutional networks. ACM Trans. Graph. 33(5), 169 (2014)
Sell, G., Garcia-Romero, D., McCree, A.: Speaker diarization with I-Vectors from DNN senone posteriors. In: Proceedings of Interspeech, pp. 3096–3099 (2015)
Abdel-Hamid, O., Mohamed, A.R., Jiang, H., Penn, G.: Applying convolutional neural networks concepts to hybrid NN-HMM model for speech recognition. In: Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4277–4280 (2012)
Quatieri, T.F.: Discrete-time Speech Signal Processing: Principles and Practice. Pearson Education, Delhi, India (2006)
Cole, R.A., Rudnicky, A.I., Zue, V.M.: Performance of an expert spectrogram reader. J. Acoust. Soc. Am. 65(S1), S81–S81 (1979)
Zue, V., Lamel, L.: An expert spectrogram reader: A knowledge-based approach to speech recognition. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 1986, vol. 11, pp. 1197–1200 (1986)
Davis, S., Mermelstein, P.: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE Trans. Acoust. Speech Sig. Process. 28(4), 357–366 (1980)
Stevens, S.S., Volkmann, J., Newman, E.B.: A scale for the measurement of the psychological magnitude pitch. J. Acoust. Soc. Am. 8(3), 185–190 (1937)
Deller Jr., J.R., Proakis, J.G., Hansen, J.H.: Discrete Time Processing of Speech Signals, 2nd edn. IEEE Press, New York (2000)
Speer, S. R., Warren, P., Schafer, A.: Intonation and sentence processing. In: Proceedings of the 15th International Congress of Phonetic Sciences, pp. 95–105 (2003)
Mammone, R.J., Zhang, X., Ramachandran, R.P.: Robust speaker recognition: a feature-based approach. IEEE Sig. Process. Mag. 13(5), 58–71 (1996)
Reynolds, D.A.: Experimental evaluation of features for robust speaker identification. IEEE Trans. Speech Audio Process. 2(4), 639–643 (1994)
Ajmera, J., McCowan, I., Bourlard, H.: Robust speaker change detection. IEEE Sig. Process. Lett. 11(8), 649–651 (2004)
Kadri, H., Davy, M., Rabaoui, A., Lachiri, Z., Ellouze, N.: Robust audio speaker segmentation using one class SVMs. In: 2008 16th European Conference on Signal Processing, pp. 1–5 (2008)
Acknowledgement
This work was supported in part by the National High-tech R&D Program of China (NO. 2015AA015308), Social Undertakings and Livelihood Security Science and Technology Innovation Funds of CQ CSTC (No. cstc2017shmsA20013), Frontier and Application Foundation Research Program of CQ CSTC (No. cstc2017jcyjAX0340), National Natural Science Foundation of Chi-na (No. 61402020) and Ph.D. Programs Foundation of Ministry of Education of China (No. 20130001120021).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer International Publishing AG, part of Springer Nature
About this paper
Cite this paper
Zhong, J., Zhang, P., Li, X. (2018). A Combined Feature Approach for Speaker Segmentation Using Convolution Neural Network. In: Zeng, B., Huang, Q., El Saddik, A., Li, H., Jiang, S., Fan, X. (eds) Advances in Multimedia Information Processing – PCM 2017. PCM 2017. Lecture Notes in Computer Science(), vol 10736. Springer, Cham. https://doi.org/10.1007/978-3-319-77383-4_54
Download citation
DOI: https://doi.org/10.1007/978-3-319-77383-4_54
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-77382-7
Online ISBN: 978-3-319-77383-4
eBook Packages: Computer ScienceComputer Science (R0)