Abstract
In this paper, we propose speech/music pitch classification based on recurrent neural network (RNN) for monaural speech segregation from music interferences. The speech segregation methods in this paper exploit sub-band masking to construct segregation masks modulated by the estimated speech pitch. However, for speech signals mixed with music, speech pitch estimation becomes unreliable, as speech and music have similar harmonic structures. In order to remove the music interference effectively, we propose an RNN-based speech/music pitch classification. Our proposed method models the temporal trajectories of speech and music pitch values and determines an unknown continuous pitch sequence as belonging to either speech or music. Among various types of RNNs, we chose simple recurrent network, long short-term memory (LSTM), and bidirectional LSTM for pitch classification. The experimental results show that our proposed method significantly outperforms the baseline methods for speech–music mixtures without loss of segregation performance for speech-noise mixtures.
Similar content being viewed by others
References
Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M et al (2016) Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467
Choi S, Cichocki A, Park HM, Lee SY (2005) Blind source separation and independent component analysis: a review. Neural Inf Process Lett Rev 6(1):1–57
Elman JL (1991) Distributed representations, simple recurrent networks, and grammatical structure. Mach Learn 7(2–3):195–225
Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS, Dahlgren NL (1933) Darpa timit acoustic-phonetic continuous speech corpus CD-ROM. NASA STI/Recon Technical Report N 93, pp 1–79
Goodfellow I, Bengio Y, Courville A. Deep learning (2016). http://www.deeplearningbook.org. Book in preparation for MIT Press
Graves A, Fernández S, Schmidhuber J (2005) Bidirectional lstm networks for improved phoneme classification and recognition. Artif Neural Netw Formal Models Appl 2005:753–753
Graves A, Jaitly N, Mohamed Ar (2013) Hybrid speech recognition with deep bidirectional lstm. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, pp 273–278
Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J (2015) LSTM: a search space odyssey. arXiv preprint arXiv:1503.04069
Hershey JR, Chen Z, Le Roux J, Watanabe S (2016) Deep clustering: discriminative embeddings for segmentation and separation. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 31–35
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Hu G, Wang D (2004) Monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Trans Neural Netw 15(5):1135–1150
Huang Z, Xu W, Yu K (2015) Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991
Jang GJ, Lee TW, Oh YH (2003) Single channel signal separation using time-domain basis functions. IEEE Signal Process Lett 10(6):168–171
Kim HG, Jang GJ, Park JS, Oh YH (2013) Monaural speech segregation based on pitch track correction using an ensemble Kalman filter. In: Proceedings of Interspeech
Lee DD, Seung HS (2001) Algorithms for non-negative matrix factorization. Adv Neural Inf Process Syst 13:556–562
Mikolov T, Karafiát M, Burget L, Cernockỳ J, Khudanpur S (2010) Recurrent neural network based language model. In: Interspeech, p 3
Olson DL, Delen D (2008) Advanced data mining techniques. Springer, Berlin
Patterson RD, Nimmo-Smith I, Holdsworth J, Rice P (1988) An efficient auditory filterbank based on the gammatone function. Technical report. Annex B of the SVos Final Report: the auditory filterbank, APU Report 2341
Raj B, Virtanen T, Chaudhuri S, Singh R (2010) Non-negative matrix factorization based compensation of music for automatic speech recognition. In: Proceedings of INTERSPEECH, pp 717–720
Roweis ST (2001) One microphone source separation. Adv Neural Inf Process Syst 13:793–799
Ryynänen MP, Klapuri AP (2008) Automatic transcription of melody, bass line, and chords in polyphonic music. Comput Music J 32(3):72–86
Smaragdis P (2004) Non-negative matrix factor deconvolution; extraction of multiple sound sources from monophonic inputs. Indep Compon Anal Blind Signal Sep 3195:494–499
Smaragdis P, Brown JC (2003) Non-negative matrix factorization for polyphonic music transcription. In: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp 177–180
Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
Wang DL, Brown GJ (1999) Separation of speech from interfering sounds based on oscillatory correlation. IEEE Trans Neural Netw 10(3):684–697
Weintraub M (1985) A theory and computational model of auditory monaural sounds separation. Ph.D. Thesis. Stanford University
Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83
Author information
Authors and Affiliations
Corresponding author
Additional information
This research was supported by Korea Electric Power Corporation (Grant no.:R18XA05) and by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (Grant no.:NRF-2017M3C1B6071400).
Rights and permissions
About this article
Cite this article
Kim, HG., Jang, GJ., Oh, YH. et al. Speech and music pitch trajectory classification using recurrent neural networks for monaural speech segregation. J Supercomput 76, 8193–8213 (2020). https://doi.org/10.1007/s11227-019-02785-x
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-019-02785-x