Abstract
In this paper, a new monaural singing voice separation algorithm is presented. This field of signal processing provides important information in many areas dealing with voice recognition, data retrieval, and singer identification. The proposed approach includes a sparse and low-rank decomposition model using spectrogram of the singing voice signals. The vocal and non-vocal parts of a singing voice signal are investigated as sparse and low-rank components, respectively. An alternating optimization algorithm is applied to decompose the singing voice frames using the sparse representation technique over the vocal and non-vocal dictionaries. Also, a novel voice activity detector is presented based upon the energy of the sparse coefficients to learn atoms related to the non-vocal data in the training step. In the test phase, the learned non-vocal atoms of the music instrumental part are updated according to the non-vocal components captured from the test signal using domain adaptation technique. The proposed dictionary learning process includes two coherence measures: atom–data coherence and mutual coherence to provide a learning procedure with low reconstruction error along with a proper separation in the test step. The simulation results using different measures show that the proposed method leads to significantly better results in comparison with the earlier methods in this context and the traditional procedures.
Similar content being viewed by others
References
M. Aharon, M. Elad, A. Bruckstein, K-SVD: an algorithm for designing over-complete dictionaries for sparse representation. IEEE Trans. Signal Process. 54, 4311–4322 (2006)
D. Barchiesi, M.D. Plumbley, Learning incoherent dictionaries for sparse approximation using iterative projections and rotations. IEEE Trans. Signal Process. 61, 2055–2065 (2013)
J. Benesty, Springer Handbook of Speech Processing (Springer, Berlin, 2008)
N. Boulanger, G. Mysore, M. Hoffman, Exploiting long-term temporal dependencies in NMF using recurrent neural networks with application to source separation, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2014), pp. 7019–7023
E.J. Candes, L. Xiaodong, Y. Ma, J. Wright, Robust principal component analysis? J. ACM 58, 1–39 (2011)
P. Chandna, M. Miron, J. Janer, E. Gomez, Monoaural audio source separation using deep convolutional neural networks, in International Conference on Latent Variable Analysis and Signal Separation (2017), pp. 258–266
G. Chen, C. Xiong, J.J. Corso, Dictionary transfer for image denoising via domain adaptation, in Proceedings of IEEE International Conference on Image Processing (2012), pp. 1189–1192
J. Demsar, Statistical comparisons of classifiers over multiple data sets. J. Mach. Learn. Res. 7, 1–30 (2006)
D.L. Donoho, X. Huo, Uncertainty principles and ideal atomic decomposition. IEEE Trans. Inf. Theory 47, 2845–2862 (2001)
J.L. Durrieu, G. Richard, B. David, C. Fevotte, Source/filter model for unsupervised main melody extraction from polyphonic audio signals. IEEE Trans. Audio Speech Lang. Process. 18, 564–575 (2010)
Z.C. Fan, Y.L. Lai, J.S.R. Jang, SVSGAN: singing voice separation via generative adversarial network, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2018)
H. Fujihara, M. Goto, A music information retrieval system based on singing voice timbre, in ISMIR (2007), pp. 467–470
H. Fujihara, M. Goto, J. Ogata, H.G. Okuno, Lyric synchronizer: automatic synchronization system between musical audio signals and lyrics. J. Sel. Top. Signal Process. 5, 1252–1261 (2011)
A. Gray, J. Markel, Distance measures for speech processing. IEEE Trans. Acoust. Speech Signal Process. 24, 380–391 (1976)
C.L. Hsu, J.S.R. Jang, On the improvement of singing voice separation for monaural recordings using the MIR-1 K dataset. IEEE Trans. Audio Speech Lang. Process. 18, 310–319 (2010)
P.S. Huang, S.D. Chen, P. Smaragdis, M. Hasegawa, Singing voice separation from monaural recordings using robust principal component analysis, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2012), pp. 57–60
P.S. Huang, M. Kim, M. Johnson, P. Smaragdis, Singing-voice separation from monaural recordings using deep recurrent neural networks, in International Society for Music Information Retrieval Conference (2014)
Y. Ikemiya, K. Itoyama, K. Yoshii, Singing voice separation and vocal F0 estimation based on mutual combination of robust principal component analysis and subharmonic summation. J. IEEE/ACM TASLP 24, 2084–2095 (2016)
A. Jansson, E.J. Humphrey, N. Montecchio, R. Bittner, A. Kumar, T. Weyde, Singing voice separation with deep U-Net convolutional networks, in Proceedings of the 18th International Society for Music Information Retrieval Conference (ISMIR) (2017), pp. 745–751
M. Lagrange, A. Ozerov, E. Vincent, Robust singer identification in polyphonic music using melody enhancement and uncertainty-based learning, in Proceedings of the ISMIR (2012), pp. 595–560
H. Lee, A. Battle, R. Raina, A.Y. Ng, Efficient sparse coding algorithms, advances in neural information processing systems. Adv. Neural. Inf. Process. Syst. 19, 801–808 (2007)
Y. Li, D.L. Wang, Singing voice separation from monaural recordings, in Proceedings of the International Conference of Music Information Retrieval (2006), pp. 176–179
P.C. Loizou, Speech Enhancement: Theory and Practice (Taylor and Francis, London, 2007)
Y. Luo, Z. Chen, D.P.W. Ellis, Deep clustering for singing voice separation, in MIREX, task of Singing Voice Separation (2016), pp. 1–2
Y. Luo, Z. Chen, J.R. Hershey, J.L. Roux, N. Mesgarani, Deep clustering and conventional networks for music separation: Stronger together, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2017), pp. 61–65
J. Ma, Y. Hu, P.C. Loizou, Objective measures for predicting speech intelligibility in noisy conditions based on new band importance functions. J. Acoust. Soc. Am. 125, 3387–3405 (2009)
S. Mavaddati, A novel singing voice separation method based on sparse non-negative matrix factorization and low-rank modeling. Iran. J. Electr. Electron. Eng. 15, 1–17 (2019)
S. Mavaddaty, S.M. Ahadi, S. Seyedin, A novel speech enhancement method by learnable sparse and low-rank decomposition and domain adaptation. Speech Commun. 76, 42–60 (2016)
S. Mavaddaty, S.M. Ahadi, S. Seyedin, Modified coherence-based dictionary learning method for speech enhancement. Signal Process. IET 9, 537–545 (2015)
S. Mavaddaty, S.M. Ahadi, S. Seyedin, Speech enhancement using sparse dictionary learning in wavelet packet transform domain. Comput. Speech Lang. 44, 22–47 (2017)
A.R. Nerkar, M.A. Joshi, Singing-voice separation from monaural recordings using empirical wavelet transform, in International Conference on Advanced Communication Control and Computing Technologies (2016), pp. 795–800
B.A. Olshausen, D.J. Field, Sparse coding with an overcomplete basis set: a strategy employed by V1. Vis. Res. 37, 3311–3325 (1997)
A. Ozerov, P. Philippe, F. Bimbot, R. Gribonval, Adaptation of Bayesian models for single-channel source separation and its application to voice/music separation in popular songs. IEEE Trans. Audio Speech Lang. Process. 15, 1564–1578 (2007)
Z. Rafii, B. Pardo, A simple music/voice separation method based on the extraction of the repeating musical structure, in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2011), pp. 221–224
Z. Rafii, B. Pardo, Repeating pattern extraction technique (REPET): a simple method for music/voice separation. IEEE Trans. Audio Speech Lang. Process. 21, 73–84 (2013)
A. Rix, J. Beerends, M. Hollier, A. Hekstra, Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs, in Proceedings of International Conference on Acoustics, Speech, Signal Processing (2001), pp. 749–752
D.J. Sheskin, Handbook of Parametric and Nonparametric Statistical Procedures, 4th edn. (Chapman & Hall/CRC, Boca Raton, 2000)
C.D. Sigg, T. Dikk, J.M. Buhmann, Speech enhancement using generative dictionary learning. IEEE Trans. Acoust. Speech Signal Process. 20, 1698–1712 (2012)
P. Sprechmann, A. Bronstein, G. Sapiro, Real-time online singing voice separation from monaural recordings using robust low-rank modeling, in Proceedings of the 13th International Society for Music Information Retrieval Conference (2012), pp. 67–72
P. Teng, Y. Jia, Voice activity detection via noise reducing using non-negative sparse coding. IEEE Signal Process. Lett. 20, 475–478 (2013)
E. Vincent, R. Gribonval, C. Fevotte, Performance measurement in blind audio source separation. IEEE Trans. Audio Speech Lang. Process. 14, 1462–1469 (2006)
Y.H. Yang, Low-rank representation of both singing voice and music accompaniment via learned dictionaries, in Proceedings of the 14th International Society for Music Information Retrieval Conference (2013), pp. 427–432
Y.H. Yang, On sparse and low-rank matrix decomposition for singing voice separation, in ACM Multimedia (2012), pp. 757–760
L. Yipeng, W. DeLiang, Separation of singing voice from music accompaniment for monaural recordings. IEEE Trans. Audio Speech Lang. Process. 15, 1475–1487 (2007)
D.T. You, J.Q. Han, G.B. Zheng, T.R. Zheng, Sparse power spectrum based robust voice activity detector, in IEEE International Conference on Acoustics, Speech, and Signal Processing (2012), pp. 289–292
D.T. You, J.Q. Han, G.B. Zheng, T.R. Zheng, J. Li, Sparse representation with optimized learned dictionary for robust voice activity detection. Circuits Syst. Signal Process. 33, 2267–2291 (2014)
Acknowledgements
The author wishes to thank Professor P. Loizou for making the source codes of the fwSegSNR and PESQ for the objective quality evaluations publicly available. The author also thanks Christian D. Sigg for publishing the MATLAB implementations of the LARC algorithm.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Mavaddati, S. A Novel Singing Voice Separation Method Based on a Learnable Decomposition Technique. Circuits Syst Signal Process 39, 3652–3681 (2020). https://doi.org/10.1007/s00034-019-01338-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-019-01338-0