Speech and music pitch trajectory classification using recurrent neural networks for monaural speech segregation

Kim, Han-Gyu; Jang, Gil-Jin; Oh, Yung-Hwan; Choi, Ho-Jin

doi:10.1007/s11227-019-02785-x

Speech and music pitch trajectory classification using recurrent neural networks for monaural speech segregation

Published: 21 February 2019

Volume 76, pages 8193–8213, (2020)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Han-Gyu Kim¹,
Gil-Jin Jang³,
Yung-Hwan Oh² &
…
Ho-Jin Choi ORCID: orcid.org/0000-0002-3398-9543²

291 Accesses
5 Citations
Explore all metrics

Abstract

In this paper, we propose speech/music pitch classification based on recurrent neural network (RNN) for monaural speech segregation from music interferences. The speech segregation methods in this paper exploit sub-band masking to construct segregation masks modulated by the estimated speech pitch. However, for speech signals mixed with music, speech pitch estimation becomes unreliable, as speech and music have similar harmonic structures. In order to remove the music interference effectively, we propose an RNN-based speech/music pitch classification. Our proposed method models the temporal trajectories of speech and music pitch values and determines an unknown continuous pitch sequence as belonging to either speech or music. Among various types of RNNs, we chose simple recurrent network, long short-term memory (LSTM), and bidirectional LSTM for pitch classification. The experimental results show that our proposed method significantly outperforms the baseline methods for speech–music mixtures without loss of segregation performance for speech-noise mixtures.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Deep Recurrent Neural Networks in Speech Synthesis Using a Continuous Vocoder

Deep Recurrent Neural Networks with Nonlinear Masking Layers and Two-Level Estimation for Speech Separation

Neural RAPT: deep learning-based pitch tracking with prior algorithmic knowledge instillation

Article 07 December 2023

References

Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M et al (2016) Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467
Choi S, Cichocki A, Park HM, Lee SY (2005) Blind source separation and independent component analysis: a review. Neural Inf Process Lett Rev 6(1):1–57
Google Scholar
Elman JL (1991) Distributed representations, simple recurrent networks, and grammatical structure. Mach Learn 7(2–3):195–225
Google Scholar
Garofolo JS, Lamel LF, Fisher WM, Fiscus JG, Pallett DS, Dahlgren NL (1933) Darpa timit acoustic-phonetic continuous speech corpus CD-ROM. NASA STI/Recon Technical Report N 93, pp 1–79
Goodfellow I, Bengio Y, Courville A. Deep learning (2016). http://www.deeplearningbook.org. Book in preparation for MIT Press
Graves A, Fernández S, Schmidhuber J (2005) Bidirectional lstm networks for improved phoneme classification and recognition. Artif Neural Netw Formal Models Appl 2005:753–753
Google Scholar
Graves A, Jaitly N, Mohamed Ar (2013) Hybrid speech recognition with deep bidirectional lstm. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU). IEEE, pp 273–278
Greff K, Srivastava RK, Koutník J, Steunebrink BR, Schmidhuber J (2015) LSTM: a search space odyssey. arXiv preprint arXiv:1503.04069
Hershey JR, Chen Z, Le Roux J, Watanabe S (2016) Deep clustering: discriminative embeddings for segmentation and separation. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, pp 31–35
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Article Google Scholar
Hu G, Wang D (2004) Monaural speech segregation based on pitch tracking and amplitude modulation. IEEE Trans Neural Netw 15(5):1135–1150
Article Google Scholar
Huang Z, Xu W, Yu K (2015) Bidirectional LSTM-CRF models for sequence tagging. arXiv preprint arXiv:1508.01991
Jang GJ, Lee TW, Oh YH (2003) Single channel signal separation using time-domain basis functions. IEEE Signal Process Lett 10(6):168–171
Article Google Scholar
Kim HG, Jang GJ, Park JS, Oh YH (2013) Monaural speech segregation based on pitch track correction using an ensemble Kalman filter. In: Proceedings of Interspeech
Lee DD, Seung HS (2001) Algorithms for non-negative matrix factorization. Adv Neural Inf Process Syst 13:556–562
Google Scholar
Mikolov T, Karafiát M, Burget L, Cernockỳ J, Khudanpur S (2010) Recurrent neural network based language model. In: Interspeech, p 3
Olson DL, Delen D (2008) Advanced data mining techniques. Springer, Berlin
MATH Google Scholar
Patterson RD, Nimmo-Smith I, Holdsworth J, Rice P (1988) An efficient auditory filterbank based on the gammatone function. Technical report. Annex B of the SVos Final Report: the auditory filterbank, APU Report 2341
Raj B, Virtanen T, Chaudhuri S, Singh R (2010) Non-negative matrix factorization based compensation of music for automatic speech recognition. In: Proceedings of INTERSPEECH, pp 717–720
Roweis ST (2001) One microphone source separation. Adv Neural Inf Process Syst 13:793–799
Google Scholar
Ryynänen MP, Klapuri AP (2008) Automatic transcription of melody, bass line, and chords in polyphonic music. Comput Music J 32(3):72–86
Article Google Scholar
Smaragdis P (2004) Non-negative matrix factor deconvolution; extraction of multiple sound sources from monophonic inputs. Indep Compon Anal Blind Signal Sep 3195:494–499
Article Google Scholar
Smaragdis P, Brown JC (2003) Non-negative matrix factorization for polyphonic music transcription. In: IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, pp 177–180
Srivastava N, Hinton GE, Krizhevsky A, Sutskever I, Salakhutdinov R (2014) Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res 15(1):1929–1958
MathSciNet MATH Google Scholar
Wang DL, Brown GJ (1999) Separation of speech from interfering sounds based on oscillatory correlation. IEEE Trans Neural Netw 10(3):684–697
Article Google Scholar
Weintraub M (1985) A theory and computational model of auditory monaural sounds separation. Ph.D. Thesis. Stanford University
Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83
Article Google Scholar

Download references

Author information

Authors and Affiliations

Clova Speech, Naver Corp., Gyeonggi-do, South Korea
Han-Gyu Kim
School of Computing, KAIST, Daejeon, South Korea
Yung-Hwan Oh & Ho-Jin Choi
School of Electronics Engineering, Kyungpook National University, Daegu, South Korea
Gil-Jin Jang

Authors

Han-Gyu Kim
View author publications
You can also search for this author in PubMed Google Scholar
Gil-Jin Jang
View author publications
You can also search for this author in PubMed Google Scholar
Yung-Hwan Oh
View author publications
You can also search for this author in PubMed Google Scholar
Ho-Jin Choi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ho-Jin Choi.

Additional information

This research was supported by Korea Electric Power Corporation (Grant no.:R18XA05) and by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIP) (Grant no.:NRF-2017M3C1B6071400).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, HG., Jang, GJ., Oh, YH. et al. Speech and music pitch trajectory classification using recurrent neural networks for monaural speech segregation. J Supercomput 76, 8193–8213 (2020). https://doi.org/10.1007/s11227-019-02785-x

Download citation

Published: 21 February 2019
Issue Date: October 2020
DOI: https://doi.org/10.1007/s11227-019-02785-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Speech and music pitch trajectory classification using recurrent neural networks for monaural speech segregation

Abstract

Access this article

Similar content being viewed by others

Deep Recurrent Neural Networks in Speech Synthesis Using a Continuous Vocoder

Deep Recurrent Neural Networks with Nonlinear Masking Layers and Two-Level Estimation for Speech Separation

Neural RAPT: deep learning-based pitch tracking with prior algorithmic knowledge instillation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Speech and music pitch trajectory classification using recurrent neural networks for monaural speech segregation

Abstract

Access this article

Similar content being viewed by others

Deep Recurrent Neural Networks in Speech Synthesis Using a Continuous Vocoder

Deep Recurrent Neural Networks with Nonlinear Masking Layers and Two-Level Estimation for Speech Separation

Neural RAPT: deep learning-based pitch tracking with prior algorithmic knowledge instillation

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation