Skip to main content
Log in

Speech and music pitch trajectory classification using recurrent neural networks for monaural speech segregation

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

In this paper, we propose speech/music pitch classification based on recurrent neural network (RNN) for monaural speech segregation from music interferences. The speech segregation methods in this paper exploit sub-band masking to construct segregation masks modulated by the estimated speech pitch. However, for speech signals mixed with music, speech pitch estimation becomes unreliable, as speech and music have similar harmonic structures. In order to remove the music interference effectively, we propose an RNN-based speech/music pitch classification. Our proposed method models the temporal trajectories of speech and music pitch values and determines an unknown continuous pitch sequence as belonging to either speech or music. Among various types of RNNs, we chose simple recurrent network, long short-term memory (LSTM), and bidirectional LSTM for pitch classification. The experimental results show that our proposed method significantly outperforms the baseline methods for speech–music mixtures without loss of segregation performance for speech-noise mixtures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14

Similar content being viewed by others

References

  1. Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M et al (2016) Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467

  2. Choi S, Cichocki A, Park HM, Lee SY (2005) Blind source separation and independent component analysis: a review. Neural Inf Process Lett Rev 6(1):1–57

    Google Scholar 

  3. Elman JL (1991) Distributed representations, simple recurrent networks, and grammatical structure. Mach Learn 7(2–3):195–225

    Google Scholar 

  4. Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS, Dahlgren NL (1933) Darpa timit acoustic-phonetic continuous speech corpus CD-ROM. NASA STI/Recon Technical Report N 93, pp 1–79

  5. Goodfellow I, Bengio Y, Courville A. Deep learning (2016). http://www.deeplearningbook.org. Book in preparation for MIT Press

  6. Graves A, Fernández S, Schmidhuber J (2005) Bidirectional lstm networks for improved phoneme classification and recognition. Artif Neural Netw Formal Models Appl 2005:753–753

    Google Scholar 

  7. Graves A, Jaitly N, Mohamed Ar (2013) Hybrid speech recognition with deep bidirectional lstm. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, pp 273–278

  8. Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J (2015) LSTM: a search space odyssey. arXiv preprint arXiv:1503.04069

  9. Hershey JR, Chen Z, Le Roux J, Watanabe S (2016) Deep clustering: discriminative embeddings for segmentation and separation. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 31–35

  10. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  11. Hu G, Wang D (2004) Monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Trans Neural Netw 15(5):1135–1150

    Article  Google Scholar 

  12. Huang Z, Xu W, Yu K (2015) Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991

  13. Jang GJ, Lee TW, Oh YH (2003) Single channel signal separation using time-domain basis functions. IEEE Signal Process Lett 10(6):168–171

    Article  Google Scholar 

  14. Kim HG, Jang GJ, Park JS, Oh YH (2013) Monaural speech segregation based on pitch track correction using an ensemble Kalman filter. In: Proceedings of Interspeech

  15. Lee DD, Seung HS (2001) Algorithms for non-negative matrix factorization. Adv Neural Inf Process Syst 13:556–562

    Google Scholar 

  16. Mikolov T, Karafiát M, Burget L, Cernockỳ J, Khudanpur S (2010) Recurrent neural network based language model. In: Interspeech, p 3

  17. Olson DL, Delen D (2008) Advanced data mining techniques. Springer, Berlin

    MATH  Google Scholar 

  18. Patterson RD, Nimmo-Smith I, Holdsworth J, Rice P (1988) An efficient auditory filterbank based on the gammatone function. Technical report. Annex B of the SVos Final Report: the auditory filterbank, APU Report 2341

  19. Raj B, Virtanen T, Chaudhuri S, Singh R (2010) Non-negative matrix factorization based compensation of music for automatic speech recognition. In: Proceedings of INTERSPEECH, pp 717–720

  20. Roweis ST (2001) One microphone source separation. Adv Neural Inf Process Syst 13:793–799

    Google Scholar 

  21. Ryynänen MP, Klapuri AP (2008) Automatic transcription of melody, bass line, and chords in polyphonic music. Comput Music J 32(3):72–86

    Article  Google Scholar 

  22. Smaragdis P (2004) Non-negative matrix factor deconvolution; extraction of multiple sound sources from monophonic inputs. Indep Compon Anal Blind Signal Sep 3195:494–499

    Article  Google Scholar 

  23. Smaragdis P, Brown JC (2003) Non-negative matrix factorization for polyphonic music transcription. In: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp 177–180

  24. Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958

    MathSciNet  MATH  Google Scholar 

  25. Wang DL, Brown GJ (1999) Separation of speech from interfering sounds based on oscillatory correlation. IEEE Trans Neural Netw 10(3):684–697

    Article  Google Scholar 

  26. Weintraub M (1985) A theory and computational model of auditory monaural sounds separation. Ph.D. Thesis. Stanford University

  27. Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ho-Jin Choi.

Additional information

This research was supported by Korea Electric Power Corporation (Grant no.:R18XA05) and by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (Grant no.:NRF-2017M3C1B6071400).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, HG., Jang, GJ., Oh, YH. et al. Speech and music pitch trajectory classification using recurrent neural networks for monaural speech segregation. J Supercomput 76, 8193–8213 (2020). https://doi.org/10.1007/s11227-019-02785-x

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-019-02785-x

Keywords

Navigation