TeeRNN: A Three-Way RNN Through Both Time and Feature for Speech Separation

Ma, Runze; Xu, Shugong

doi:10.1007/978-3-030-60636-7_40

TeeRNN: A Three-Way RNN Through Both Time and Feature for Speech Separation

Runze Ma¹⁶ &
Shugong Xu¹⁶

Conference paper
First Online: 13 October 2020

1182 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 12307))

Abstract

Recurrent neural networks (RNNs) have been widely used in speech signal processing. Because it is powerful to modeling some sequential information. While most of the networks about RNNs are on frame sight, we propose three-way RNN called TeeRNN which both process the input through the time and the features. According to that, TeeRNN is better to explore the relationship between the features in each frame of encoded speech. As an additional contribution, we also generated a mixture dataset based on LibriSpeech where the devices mismatched and different noises contained making the separation task harder.

R. Ma—Is a student.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Hershey, J.R., Chen, Z., Le Roux, J., Watanabe, S.: Deep clustering: discriminative embeddings for segmentation and separation. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 31–35, March 2016. https://doi.org/10.1109/ICASSP.2016.7471631. iSSN: 2379-190X
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735. https://www.mitpressjournals.org/doi/10.1162/neco.1997.9.8.1735
Article Google Scholar
Kolbæk, M., Yu, D., Tan, Z.H., Jensen, J.: Multi-talker speech separation with utterance-level permutation invariant training of deep recurrent neural networks. arXiv:1703.06284, July 2017
Luo, Y., Chen, Z., Hershey, J.R., Roux, J.L., Mesgarani, N.: Deep clustering and conventional networks for music separation: stronger together. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 61–65, March 2017. https://doi.org/10.1109/ICASSP.2017.7952118. arXiv: 1611.06265
Luo, Y., Chen, Z., Yoshioka, T.: Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation. arXiv:1910.06379, October 2019
Luo, Y., Mesgarani, N.: TasNet: time-domain audio separation network for real-time, single-channel speech separation. arXiv:1711.00541, April 2018
Luo, Y., Mesgarani, N.: Conv-TasNet: surpassing ideal time-frequency magnitude masking for speech separation. IEEE/ACM Trans. Audio Speech Lang. Process. 27(8), 1256–1266 (2019). https://doi.org/10.1109/TASLP.2019.2915167. arXiv: 1809.07454
Article Google Scholar
Panayotov, V., Chen, G., Povey, D., Khudanpur, S.: Librispeech: an ASR corpus based on public domain audio books. In: 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5206–5210. IEEE, South Brisbane, April 2015. https://doi.org/10.1109/ICASSP.2015.7178964. http://ieeexplore.ieee.org/document/7178964/
Roux, J.L., Wichern, G., Watanabe, S., Sarroff, A., Hershey, J.R.: Phasebook and friends: leveraging discrete representations for source separation. IEEE J. Sel. Topics Signal Process. 13(2), 370–382 (2019). https://doi.org/10.1109/JSTSP.2019.2904183. arXiv: 1810.01395
Article Google Scholar
Taal, C.H., Hendriks, R.C., Heusdens, R., Jensen, J.: A short-time objective intelligibility measure for time-frequency weighted noisy speech. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4214–4217, March 2010. https://doi.org/10.1109/ICASSP.2010.5495701. iSSN: 2379-190X
Wang, Z.Q., Roux, J.L., Wang, D., Hershey, J.R.: End-to-end speech separation with unfolded iterative phase reconstruction. arXiv:1804.10204, April 2018
Wang, Z.Q., Tan, K., Wang, D.: Deep learning based phase reconstruction for speaker separation: a trigonometric perspective. arXiv:1811.09010, November 2018
Yu, D., Kolbæk, M., Tan, Z.H., Jensen, J.: Permutation invariant training of deep models for speaker-independent multi-talker speech separation. arXiv:1607.00325, January 2017

Download references

Author information

Authors and Affiliations

Shanghai Institute for Advanced Communication and Data Science, Shanghai University, Shanghai, 200444, China
Runze Ma & Shugong Xu

Authors

Runze Ma
View author publications
You can also search for this author in PubMed Google Scholar
Shugong Xu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Runze Ma .

Editor information

Editors and Affiliations

Peking University, Beijing, China
Yuxin Peng
Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Dalian University of Technology, Dalian, China
Huchuan Lu
Chinese Academy of Sciences, Beijing, China
Zhenan Sun
Chinese Academy of Sciences, Beijing, China
Chenglin Liu
Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Xilin Chen
Peking University, Beijing, China
Hongbin Zha
Nanjing University of Science and Technology, Nanjing, China
Jian Yang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Ma, R., Xu, S. (2020). TeeRNN: A Three-Way RNN Through Both Time and Feature for Speech Separation. In: Peng, Y., et al. Pattern Recognition and Computer Vision. PRCV 2020. Lecture Notes in Computer Science(), vol 12307. Springer, Cham. https://doi.org/10.1007/978-3-030-60636-7_40

Download citation

DOI: https://doi.org/10.1007/978-3-030-60636-7_40
Published: 13 October 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-60635-0
Online ISBN: 978-3-030-60636-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics