skip to main content
10.1145/3573942.3573975acmotherconferencesArticle/Chapter ViewAbstractPublication PagesaiprConference Proceedingsconference-collections
research-article

Solving Size and Performance Dilemma by Reversible and Invertible Recurrent Network for Speech Enhancement: Solving Size and Performance Dilemma by Reversible and Invertible Recurrent Network for Speech Enhancement

Published: 16 May 2023 Publication History

Abstract

Reducing parameter numbers and improving system performance is considered a dilemma problem. As is known to all, reducing parameter numbers will lead to performance degradation, while improving performance often lead to parameter numbers increasing. To solve the above dilemma, we propose a reversible and invertible recurrent (RAIR) network in this paper: Firstly, we construct a reversible dual-path architecture to avoid information loss for two arbitrary functions, F and G. That is to say, no matter what kinds of F and G are and no matter how small the model is, feature maps go through the network without any information loss. Secondly, we adopt an invertible 1x1 convolution to improve channel information remixing. Lastly, we employ a dual-path recurrences (DPR) block that operates in the frequency and the time dimensions separately for the F function and a 3x3 convolution for the G function in the above reversible architecture, which reduces parameter numbers dramatically. Although the model is tiny, experiments on Voice Bank + DEMAND show that our reversible and invertible recurrent architecture improves all the performance metrics: COVL from 3.57 to 3.78, wideband PESQ from 2.94 to 3.15, and STOI from 0.947 to 0.951. The proposed model achieves state-of-the-art results with only 190K parameters. To the best of our knowledge, it is the state-of-the-art model with the smallest size.

References

[1]
Dang, F., Chen, H., and Zhang, P. 2022. Dpt-fsnet: Dual-path transformer based full-band and sub-band fusion network for speech enhancement. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2022), IEEE, pp. 6857–6861.
[2]
Defossez, A., Synnaeve, G., and Adi, Y. 2020. Real time speech enhancement in the waveform domain. arXiv preprint arXiv:2006.12847 (2020).
[3]
Deng, F., Jiang, T., Wang, X., Zhang, C., and Li, Y. 2020. Naagn: Noise-aware attention-gated network for speech enhancement. In INTERSPEECH (2020), pp. 2457–2461.
[4]
Dinh, L., Sohl-Dickstein, J., and Bengio, S. 2016. Density estimation using real nvp. arXiv preprint arXiv:1605.08803 (2016).
[5]
Hu, Y., and Loizou, P. C. 2008. Evaluation of objective quality measures for speech enhancement. IEEE Transactions on audio, speech, and language processing 16, 1 (2008), 229–238.
[6]
Ke, D., Zhang, J., Xie, Y., Xu, Y., and Lin, B. 2021. Speech enhancement using separable polling attention and global layer normalization followed with prelu. arXiv preprint arXiv:2105.02509 (2021).
[7]
Kim, J., El-Khamy, M., and Lee, J. 2018. T-gsa: Transformer with gaussian-weighted self-attention for speech enhancement. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020), IEEE, pp. 6649–6653
[8]
Kingma, D. P., and Dhariwal, P. 2018. Glow: Generative flow with invertible 1x1 convolutions. Advances in neural information processing systems 31 (2018).
[9]
Li, A., Zheng, C., Zhang, L., and Li, X. 2022. Glance and gaze: A collaborative learning framework for single-channel speech enhancement. Applied Acoustics 187 (2022), 108499.
[10]
Li, J., Mohamed, A., Zweig, G., and Gong, Y. 2015. Lstm time and frequency recurrence for automatic speech recognition. In 2015 IEEE workshop on automatic speech recognition and understanding (ASRU) (2015), IEEE, pp. 187–191.
[11]
Luo, Y., Chen, Z., and Yoshioka, T. 2020. Dual-path rnn: efficient long sequence modeling for time-domain singlechannel speech separation. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2020), IEEE, pp. 46–50.
[12]
Lv, S., Fu, Y., Xing, M., Sun, J., Xie, L., Huang, J., Wang, Y., and Yu, T. 2022. S-dccrn: Super wide band dccrn with learnable complex feature for speech enhancement. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2022), IEEE, pp. 7767–7771.
[13]
P.862.2, I.-T. R. 2005. Wideband extension to recommendation p.862 for the assessment of wideband telephone networks and speech codecs. International Telecommunication Union, CH-Geneva 1 (2005).
[14]
Pascual, S., Bonafonte, A., and Serra, J. 2017. Segan: Speech enhancement generative adversarial network. arXiv preprint arXiv:1703.09452 (2017).
[15]
Schröter, H., Escalante-B, A. N., Rosenkranz, T., and Maier, A. 2022. Deepfilternet: A low complexity speech enhancement framework for full-band audio based on deep filtering. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2022), IEEE, pp. 7407–7411.
[16]
Schröter, H., Rosenkranz, T., Maier, A., Deepfilternet2: Towards real-time speech enhancement on embedded devices for full-band audio. arXiv preprint arXiv:2205.05474 (2022).
[17]
Taal, C. H., Hendriks, R. C., Heusdens, R., and Jensen, J. 2010. A short-time objective intelligibility measure for time-frequency weighted noisy speech. In 2010 IEEE international conference on acoustics, speech and signal processing (2010), IEEE, pp. 4214–4217.
[18]
Thiemann, J., Ito, N., and Vincent, E. 2013. The diverse environments multi-channel acoustic noise database (demand): A database of multichannel environmental noise recordings. In Proceedings of Meetings on Acoustics ICA2013 (2013), vol. 19, Acoustical Society of America, p. 035081.
[19]
Valentini-Botinhao, C., Wang, X., Takaki, S., and Yamagishi, J. 2016. Investigating rnn-based speech enhancement methods for noise-robust text-to-speech. In SSW (2016), pp. 146–152.
[20]
Valin, J.-M., Tenneti, S., Helwani, K., Isik, U., and Krishnaswamy, 2021. A. Low-complexity, real-time joint neural echo control and speech enhancement based on percepnet. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2021), IEEE, pp. 7133–7137.
[21]
Veaux, C., Yamagishi, J., and King, S. 2013. The voice bank corpus: Design, collection and data analysis of a large regional accent speech database. In 2013 international conference oriental COCOSDA held jointly with 2013 conference on Asian spoken language research and evaluation (O-COCOSDA/CASLRE) (2013), IEEE, pp. 1–4.
[22]
Wang, K., He, B., and Zhu, W.-P. 2021. Tstnn: Two-stage transformer based neural network for speech enhancement in the time domain. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (2021), IEEE, pp. 7098–7102.
[23]
Wang, Y., Narayanan, A., and Wang, D. 2014. On training targets for supervised speech separation. IEEE/ACM transactions on audio, speech, and language processing 22, 12 (2014), 1849–1858.
[24]
Williamson, D. S., Wang, Y., and Wang, D. 2015. Complex ratio masking for monaural speech separation. IEEE/ACM transactions on audio, speech, and language processing 24, 3 (2015), 483–492.
[25]
Yin, D., Luo, C., Xiong, Z., and Zeng, W. 2020. Phasen: A phase-and-harmonics-aware speech enhancement network. In Proceedings of the AAAI Conference on Artificial Intelligence (2020), vol. 34, pp. 9458–9465.
[26]
Graves, Alex. 2012. Long short-term memory. Supervised sequence labelling with recurrent neural networks (2012): 37-45.
[27]
Cho, K., Van Merriënboer, B., Gulcehre, C., Bahdanau, D., Bougares, F., Schwenk, H., & Bengio, Y. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078.

Index Terms

  1. Solving Size and Performance Dilemma by Reversible and Invertible Recurrent Network for Speech Enhancement: Solving Size and Performance Dilemma by Reversible and Invertible Recurrent Network for Speech Enhancement

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    AIPR '22: Proceedings of the 2022 5th International Conference on Artificial Intelligence and Pattern Recognition
    September 2022
    1221 pages
    ISBN:9781450396899
    DOI:10.1145/3573942
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 16 May 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Invertible Network
    2. Noise Reduction
    3. Reversible Network
    4. Speech Enhancement

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    AIPR 2022

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 29
      Total Downloads
    • Downloads (Last 12 months)13
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 01 Mar 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media