Skip to main content

Hypernetworks Build Implicit Neural Representations of Sounds

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases: Research Track (ECML PKDD 2023)

Abstract

Implicit Neural Representations (INRs) are nowadays used to represent multimedia signals across various real-life applications, including image super-resolution, image compression, or 3D rendering. Existing methods that leverage INRs are predominantly focused on visual data, as their application to other modalities, such as audio, is nontrivial due to the inductive biases present in architectural attributes of image-based INR models. To address this limitation, we introduce HyperSound, the first meta-learning approach to produce INRs for audio samples that leverages hypernetworks to generalize beyond samples observed in training. Our approach reconstructs audio samples with quality comparable to other state-of-the-art models and provides a viable alternative to contemporary sound representations used in deep neural networks for audio processing, such as spectrograms. Our code is publicly available at https://github.com/WUT-AI/hypersound.

F. Szatkowski and K. J. Piczak—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Benbarka, N., et al.: Seeing implicit neural representations as fourier series. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 2041–2050 (Jan 2022)

    Google Scholar 

  2. Caillon, A., Esling, P.: RAVE: A variational autoencoder for fast and high-quality neural audio synthesis. arXiv: 2111.05011 (2021)

  3. Chan, E.R., et al.: PI-GAN: periodic implicit generative adversarial networks for 3d-aware image synthesis. In: Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pp. 5799–5809 (2021)

    Google Scholar 

  4. Chang, O., Flokas, L., Lipson, H.: Principled weight initialization for hypernetworks. In: International Conference on Learning Representations (2019)

    Google Scholar 

  5. Chen, Y., Wang, X.: Transformers as meta-learners for implicit neural representations. In: Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XVII, pp. 170–187. Springer (2022). https://doi.org/10.1007/978-3-031-19790-1_11

  6. Clevert, D.-A., Unterthiner, T., Hochreiter, S.: Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). arXiv preprint arXiv:1511.07289 (2015)

  7. Défossez, A., et al.: SING: symbol-to-instrument neural generator. In: Advances in Neural Information Processing Systems 31 (2018)

    Google Scholar 

  8. Dupont, E., et al.: From data to functa: Your data point is a function and you should treat it like one. arXiv preprint arXiv:2201.12204 (2022)

  9. Engel, J., et al.: Neural audio synthesis of musical notes with wavenet autoencoders. In: International Conference on Machine Learning, pp. 1068–1077. PMLR (2017)

    Google Scholar 

  10. Fons, E., et al.: HyperTime: Implicit Neural Representation for Time Series. arXiv preprint arXiv:2208.05836 (2022)

  11. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, pp. 249–256 (2010)

    Google Scholar 

  12. Ha, D., Dai, A., Le, Q.V.: HyperNetworks. arXiv preprint arXiv:1609.09106 (2016)

  13. He, K., et al.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)

    Google Scholar 

  14. Kania, A., et al.: Hypernerfgan: Hypernetwork approach to 3d nerf gan. arXiv preprint arXiv:2301.11631 (2023)

  15. Karras, T., et al.: Progressive Growing of GANs for Improved Quality, Stability, And Variation. In: International Conference on Learning Representations (2018)

    Google Scholar 

  16. Klocek, S., Maziarka, Ł, Wołczyk, M., Tabor, J., Nowak, J., Śmieja, M.: Hypernetwork functional image representation. In: Tetko, I.V., Kůrková, V., Karpov, P., Theis, F. (eds.) ICANN 2019. LNCS, vol. 11731, pp. 496–510. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30493-5_48

    Chapter  Google Scholar 

  17. Kumar, K., et al.: MelGAN: generative adversarial networks for conditional waveform synthesis. In: Advances in Neural Information Processing Systems 32 (2019)

    Google Scholar 

  18. Liu, H., et al.: Neural Vocoder is All You Need for Speech Superresolution. arXiv preprint arXiv:2203.14941 (2022)

  19. Luo, Y., Mesgarani, N.: TasNet: time-domain audio separation network for real-time, single-channel speech separation. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 696–700. IEEE (2018)

    Google Scholar 

  20. Manocha, P., et al.: CDPAM: contrastive learning for perceptual audio similarity. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 196–200. IEEE (2021)

    Google Scholar 

  21. Mehri, S., et al.: SampleRNN: An Unconditional End-to-End Neural Audio Generation Model. arXiv preprint arXiv:1612.07837 (2016)

  22. Mehta, I., et al.: Modulated periodic activations for generalizable local functional representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14214–14223 (2021)

    Google Scholar 

  23. Mildenhall, B., et al.: NeRF: representing scenes as neural radiance fields for view synthesis. Commun. ACM 65(1), 99–106 (2021)

    Article  Google Scholar 

  24. Müller, N., et al.: DiffRF: Rendering-Guided 3D Radiance Field Diffusion. arXiv preprint arXiv:2212.01206 (2022)

  25. Oord, A., et al.: Parallel WaveNet: fast high-fidelity speech synthesis. In: International Conference on Machine Learning, pp. 3918–3926. PMLR (2018)

    Google Scholar 

  26. van den Oord, A., et al.: WaveNet: A Generative Model for Raw Audio. arXiv preprint arXiv:1609.03499 (2016)

  27. Purwins, H., et al.: Deep learning for audio signal processing. IEEE J. Selected Topics Signal Process. 13(2), 206–219 (2019)

    Article  Google Scholar 

  28. Rahaman, N., et al.: On the spectral bias of neural networks. In: International Conference on Machine Learning, pp. 5301–5310. PMLR (2019)

    Google Scholar 

  29. Rix, A.W., et al.: Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, Proceedings (Cat. No. 01CH37221), vol. 2, pp. 749–752. IEEE (2001)

    Google Scholar 

  30. Sitzmann, V., et al.: Implicit neural representations with periodic activation functions. Adv. Neural. Inf. Process. Syst. 33, 7462–7473 (2020)

    Google Scholar 

  31. Sitzmann, V., et al.: Metasdf: Meta-learning signed distance functions. Adv. Neural. Inf. Process. Syst. 33, 10136–10147 (2020)

    Google Scholar 

  32. Smith, L.N., Topin, N.: Super-convergence: very fast training of neural networks using large learning rates. In: Artificial Intelligence and Machine Learning for Multi-domain Operations Applications, vol. 11006, pp. 369–386. SPIE (2019)

    Google Scholar 

  33. Spurek, P., et al.: Hypernetwork approach to generating point clouds. In: Proceedings of Machine Learning Research, vol. 119 (2020)

    Google Scholar 

  34. Strümpler, Y., et al.: Implicit neural representations for image compression. In: Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022, Proceedings, Part XXVI, pp. 74–91. Springer (2022). https://doi.org/10.1007/978-3-031-19809-0_5

  35. Taal, C.H., et al.: A short-time objective intelligibility measure for time frequency weighted noisy speech. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4214–4217. IEEE (2010)

    Google Scholar 

  36. Tabor, J., Trzciński, T., et al.: Hyperpocket: generative point cloud completion. In: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6848–6853. IEEE (2022)

    Google Scholar 

  37. Tancik, M., et al.: Fourier features let networks learn high frequency functions in low dimensional domains. Adv. Neural. Inf. Process. Syst. 33, 7537–7547 (2020)

    Google Scholar 

  38. Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11) (2008)

    Google Scholar 

  39. Von Oswald, J., et al.: Continual learning with hypernetworks. arXiv preprint arXiv:1906.00695 (2019)

  40. Yamamoto, R., Song, E., Kim, J.-M.: Parallel WaveGAN: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199–6203. IEEE (2020)

    Google Scholar 

  41. Zeghidour, N., et al.: SoundStream: an end-to-end neural audio codec. IEEE/ACM Trans. Audio, Speech Lang. Process, 30, 495–507 (2021)

    Article  Google Scholar 

  42. Zhao, D., et al.: Meta-Learning via Hypernetworks (2020)

    Google Scholar 

  43. Zimny, D., Trzciński, T., Spurek, P.: Points2NeRF: Generating Neural Radiance Fields from 3D point cloud. arXiv preprint arXiv:2206.01290 (2022)

  44. Zuiderveld, J., Federici, M., Bekkers, E.J.: Towards Lightweight Controllable Audio Synthesis with Conditional Implicit Neural Representations. arXiv preprint arXiv:2111.08462 (2021)

Download references

Acknowledgements

This work was supported by Foundation for Polish Science with Grant No POIR.04.04.00-00-14DE/18-00 carried out within the Team-Net program co-financed by the European Union under the European Regional Development Fund, and by the National Centre of Science (Poland) Grant No. 2020/39/B/ST6/01511. Filip Szatkowski and Tomasz Trzcinski are supported by National Centre of Science (Poland) Grant No. 2022/45/B/ST6/02817. Przemysław Spurek is supported by the National Centre of Science (Poland) Grant No. 2021/43/B/ST6/01456. Jacek Tabor research has been supported by a grant from the Priority Research Area DigiWorld under the Strategic Programme Excellence Initiative at Jagiellonian University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Filip Szatkowski .

Editor information

Editors and Affiliations

Ethics declarations

Ethical Statement

This paper belongs to the line of fundamental research and all the experiments in this paper were performed using publicly available data. However, we acknowledge that there are potential ethical implications of our work that need to be considered.

As with all machine learning models, biases from the training data can be encoded into the model, leading to inaccurate or discriminatory behavior towards underrepresented groups.

Furthermore, the data reconstructed with Implicit Neural Representations always contains some degree of error, and models trained with different hyperparameters can produce varying representations. These properties make INRs a potential tool to evade copyright detection, and current detection algorithms are not equipped to work on data stored as INR weights, further compounding the issue.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Szatkowski, F., Piczak, K.J., Spurek, P., Tabor, J., Trzciński, T. (2023). Hypernetworks Build Implicit Neural Representations of Sounds. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14172. Springer, Cham. https://doi.org/10.1007/978-3-031-43421-1_39

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-43421-1_39

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-43420-4

  • Online ISBN: 978-3-031-43421-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics