Hypernetworks Build Implicit Neural Representations of Sounds

Szatkowski, Filip; Piczak, Karol J.; Spurek, Przemysław; Tabor, Jacek; Trzciński, Tomasz

doi:10.1007/978-3-031-43421-1_39

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14172))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

1380 Accesses

Abstract

Implicit Neural Representations (INRs) are nowadays used to represent multimedia signals across various real-life applications, including image super-resolution, image compression, or 3D rendering. Existing methods that leverage INRs are predominantly focused on visual data, as their application to other modalities, such as audio, is nontrivial due to the inductive biases present in architectural attributes of image-based INR models. To address this limitation, we introduce HyperSound, the first meta-learning approach to produce INRs for audio samples that leverages hypernetworks to generalize beyond samples observed in training. Our approach reconstructs audio samples with quality comparable to other state-of-the-art models and provides a viable alternative to contemporary sound representations used in deep neural networks for audio processing, such as spectrograms. Our code is publicly available at https://github.com/WUT-AI/hypersound.

F. Szatkowski and K. J. Piczak—Equal contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Latent Timbre Synthesis

Article 20 October 2020

Deep learning and synthetic media

Article 20 May 2022

ISNN: Impact Sound Neural Network for Audio-Visual Object Classification

References

Benbarka, N., et al.: Seeing implicit neural representations as fourier series. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 2041–2050 (Jan 2022)
Google Scholar
Caillon, A., Esling, P.: RAVE: A variational autoencoder for fast and high-quality neural audio synthesis. arXiv: 2111.05011 (2021)
Chan, E.R., et al.: PI-GAN: periodic implicit generative adversarial networks for 3d-aware image synthesis. In: Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pp. 5799–5809 (2021)
Google Scholar
Chang, O., Flokas, L., Lipson, H.: Principled weight initialization for hypernetworks. In: International Conference on Learning Representations (2019)
Google Scholar
Chen, Y., Wang, X.: Transformers as meta-learners for implicit neural representations. In: Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XVII, pp. 170–187. Springer (2022). https://doi.org/10.1007/978-3-031-19790-1_11
Clevert, D.-A., Unterthiner, T., Hochreiter, S.: Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs). arXiv preprint arXiv:1511.07289 (2015)
Défossez, A., et al.: SING: symbol-to-instrument neural generator. In: Advances in Neural Information Processing Systems 31 (2018)
Google Scholar
Dupont, E., et al.: From data to functa: Your data point is a function and you should treat it like one. arXiv preprint arXiv:2201.12204 (2022)
Engel, J., et al.: Neural audio synthesis of musical notes with wavenet autoencoders. In: International Conference on Machine Learning, pp. 1068–1077. PMLR (2017)
Google Scholar
Fons, E., et al.: HyperTime: Implicit Neural Representation for Time Series. arXiv preprint arXiv:2208.05836 (2022)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, pp. 249–256 (2010)
Google Scholar
Ha, D., Dai, A., Le, Q.V.: HyperNetworks. arXiv preprint arXiv:1609.09106 (2016)
He, K., et al.: Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1026–1034 (2015)
Google Scholar
Kania, A., et al.: Hypernerfgan: Hypernetwork approach to 3d nerf gan. arXiv preprint arXiv:2301.11631 (2023)
Karras, T., et al.: Progressive Growing of GANs for Improved Quality, Stability, And Variation. In: International Conference on Learning Representations (2018)
Google Scholar
Klocek, S., Maziarka, Ł, Wołczyk, M., Tabor, J., Nowak, J., Śmieja, M.: Hypernetwork functional image representation. In: Tetko, I.V., Kůrková, V., Karpov, P., Theis, F. (eds.) ICANN 2019. LNCS, vol. 11731, pp. 496–510. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30493-5_48
Chapter Google Scholar
Kumar, K., et al.: MelGAN: generative adversarial networks for conditional waveform synthesis. In: Advances in Neural Information Processing Systems 32 (2019)
Google Scholar
Liu, H., et al.: Neural Vocoder is All You Need for Speech Superresolution. arXiv preprint arXiv:2203.14941 (2022)
Luo, Y., Mesgarani, N.: TasNet: time-domain audio separation network for real-time, single-channel speech separation. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 696–700. IEEE (2018)
Google Scholar
Manocha, P., et al.: CDPAM: contrastive learning for perceptual audio similarity. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 196–200. IEEE (2021)
Google Scholar
Mehri, S., et al.: SampleRNN: An Unconditional End-to-End Neural Audio Generation Model. arXiv preprint arXiv:1612.07837 (2016)
Mehta, I., et al.: Modulated periodic activations for generalizable local functional representations. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 14214–14223 (2021)
Google Scholar
Mildenhall, B., et al.: NeRF: representing scenes as neural radiance fields for view synthesis. Commun. ACM 65(1), 99–106 (2021)
Article Google Scholar
Müller, N., et al.: DiffRF: Rendering-Guided 3D Radiance Field Diffusion. arXiv preprint arXiv:2212.01206 (2022)
Oord, A., et al.: Parallel WaveNet: fast high-fidelity speech synthesis. In: International Conference on Machine Learning, pp. 3918–3926. PMLR (2018)
Google Scholar
van den Oord, A., et al.: WaveNet: A Generative Model for Raw Audio. arXiv preprint arXiv:1609.03499 (2016)
Purwins, H., et al.: Deep learning for audio signal processing. IEEE J. Selected Topics Signal Process. 13(2), 206–219 (2019)
Article Google Scholar
Rahaman, N., et al.: On the spectral bias of neural networks. In: International Conference on Machine Learning, pp. 5301–5310. PMLR (2019)
Google Scholar
Rix, A.W., et al.: Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In: 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, Proceedings (Cat. No. 01CH37221), vol. 2, pp. 749–752. IEEE (2001)
Google Scholar
Sitzmann, V., et al.: Implicit neural representations with periodic activation functions. Adv. Neural. Inf. Process. Syst. 33, 7462–7473 (2020)
Google Scholar
Sitzmann, V., et al.: Metasdf: Meta-learning signed distance functions. Adv. Neural. Inf. Process. Syst. 33, 10136–10147 (2020)
Google Scholar
Smith, L.N., Topin, N.: Super-convergence: very fast training of neural networks using large learning rates. In: Artificial Intelligence and Machine Learning for Multi-domain Operations Applications, vol. 11006, pp. 369–386. SPIE (2019)
Google Scholar
Spurek, P., et al.: Hypernetwork approach to generating point clouds. In: Proceedings of Machine Learning Research, vol. 119 (2020)
Google Scholar
Strümpler, Y., et al.: Implicit neural representations for image compression. In: Computer Vision-ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022, Proceedings, Part XXVI, pp. 74–91. Springer (2022). https://doi.org/10.1007/978-3-031-19809-0_5
Taal, C.H., et al.: A short-time objective intelligibility measure for time frequency weighted noisy speech. In: 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 4214–4217. IEEE (2010)
Google Scholar
Tabor, J., Trzciński, T., et al.: Hyperpocket: generative point cloud completion. In: 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 6848–6853. IEEE (2022)
Google Scholar
Tancik, M., et al.: Fourier features let networks learn high frequency functions in low dimensional domains. Adv. Neural. Inf. Process. Syst. 33, 7537–7547 (2020)
Google Scholar
Van der Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(11) (2008)
Google Scholar
Von Oswald, J., et al.: Continual learning with hypernetworks. arXiv preprint arXiv:1906.00695 (2019)
Yamamoto, R., Song, E., Kim, J.-M.: Parallel WaveGAN: a fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In: ICASSP 2020–2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6199–6203. IEEE (2020)
Google Scholar
Zeghidour, N., et al.: SoundStream: an end-to-end neural audio codec. IEEE/ACM Trans. Audio, Speech Lang. Process, 30, 495–507 (2021)
Article Google Scholar
Zhao, D., et al.: Meta-Learning via Hypernetworks (2020)
Google Scholar
Zimny, D., Trzciński, T., Spurek, P.: Points2NeRF: Generating Neural Radiance Fields from 3D point cloud. arXiv preprint arXiv:2206.01290 (2022)
Zuiderveld, J., Federici, M., Bekkers, E.J.: Towards Lightweight Controllable Audio Synthesis with Conditional Implicit Neural Representations. arXiv preprint arXiv:2111.08462 (2021)

Download references

Acknowledgements

This work was supported by Foundation for Polish Science with Grant No POIR.04.04.00-00-14DE/18-00 carried out within the Team-Net program co-financed by the European Union under the European Regional Development Fund, and by the National Centre of Science (Poland) Grant No. 2020/39/B/ST6/01511. Filip Szatkowski and Tomasz Trzcinski are supported by National Centre of Science (Poland) Grant No. 2022/45/B/ST6/02817. Przemysław Spurek is supported by the National Centre of Science (Poland) Grant No. 2021/43/B/ST6/01456. Jacek Tabor research has been supported by a grant from the Priority Research Area DigiWorld under the Strategic Programme Excellence Initiative at Jagiellonian University.

Author information

Authors and Affiliations

Warsaw University of Technology, Warsaw, Poland
Filip Szatkowski & Tomasz Trzciński
Faculty of Mathematics and Computer Science, Jagiellonian University, Krakow, Poland
Karol J. Piczak, Przemysław Spurek, Jacek Tabor & Tomasz Trzciński
IDEAS NCBR, Warsaw, Poland
Filip Szatkowski & Tomasz Trzciński
Tooploox, Wroclaw, Poland
Tomasz Trzciński
UES Ltd., Krakow, Poland
Jacek Tabor

Authors

Filip Szatkowski
View author publications
You can also search for this author in PubMed Google Scholar
Karol J. Piczak
View author publications
You can also search for this author in PubMed Google Scholar
Przemysław Spurek
View author publications
You can also search for this author in PubMed Google Scholar
Jacek Tabor
View author publications
You can also search for this author in PubMed Google Scholar
Tomasz Trzciński
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Filip Szatkowski .

Editor information

Editors and Affiliations

University of Michigan, Ann Arbor, MI, USA
Danai Koutra
University of Vienna, Vienna, Austria
Claudia Plant
Max Planck Institute for Software Systems, Kaiserslautern, Germany
Manuel Gomez Rodriguez
Politecnico di Torino, Turin, Italy
Elena Baralis
CENTAI, Turin, Italy
Francesco Bonchi

Ethics declarations

Ethical Statement

This paper belongs to the line of fundamental research and all the experiments in this paper were performed using publicly available data. However, we acknowledge that there are potential ethical implications of our work that need to be considered.

As with all machine learning models, biases from the training data can be encoded into the model, leading to inaccurate or discriminatory behavior towards underrepresented groups.

Furthermore, the data reconstructed with Implicit Neural Representations always contains some degree of error, and models trained with different hyperparameters can produce varying representations. These properties make INRs a potential tool to evade copyright detection, and current detection algorithms are not equipped to work on data stored as INR weights, further compounding the issue.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Szatkowski, F., Piczak, K.J., Spurek, P., Tabor, J., Trzciński, T. (2023). Hypernetworks Build Implicit Neural Representations of Sounds. In: Koutra, D., Plant, C., Gomez Rodriguez, M., Baralis, E., Bonchi, F. (eds) Machine Learning and Knowledge Discovery in Databases: Research Track. ECML PKDD 2023. Lecture Notes in Computer Science(), vol 14172. Springer, Cham. https://doi.org/10.1007/978-3-031-43421-1_39

Download citation

DOI: https://doi.org/10.1007/978-3-031-43421-1_39
Published: 18 September 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43420-4
Online ISBN: 978-3-031-43421-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

Hypernetworks Build Implicit Neural Representations of Sounds

Abstract

Access this chapter

Subscribe and save

Buy Now