Skip to main content
Log in

Audio style transfer using shallow convolutional networks and random filters

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Recently, with the advent of Convolutional Neural Network (CNN) era, Neural style transfer on images has become a very active research topic and the style of an image can be transferred to another image through a CNN so that the image retains both its own content and another style of image. In this work, we propose an algorithm for audio style transfer that uses the force of CNN to generate a new audio from a style audio. We use Continuous Wavelet Transfer(CWT) to convert the audio into a spectrogram and then use the spectrogram as the representation of the audio image through image style transfer method to obtain a new image, and finally, generate an audio using iterative phase reconstruction with Griffin-Lim. We succeed in transferring audio such as light music but had difficulty in transferring audio that has lyrics and high-level metrics such as emotion or tone. We propose several measures to improve the quality of audio and a lot of experimental results shows that our method is better than other methods in terms of sound quality.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

References

  1. Aytar Y, Vondrick C, Torralba A (2016) Soundnet: Learning sound representations from unlabeled video[C]. Advances in Neural Information Processing Systems:892–900

  2. Shaun Barry and Youngmoo Kim, Style transfer for musical audio using multiple time-frequency representations, Unpublished article available at: https://tinyurl.com/y7nu7r9s, 2018.

  3. Brunner G, Konrad A, Wang Y, et al. MIDI-VAE: Modeling dynamics and instrumentation of music with applications to style transfer[J]. arXiv preprint arXiv:1809.07600, 2018.

  4. Ephrat A, Mosseri I, Lang O et al (2018) Looking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation. ACM T Graphic. https://doi.org/10.1145/3197517.3201357

  5. Gatys L A, Ecker A S, Bethge M. A neural algorithm of artistic style[J]. arXiv preprint arXiv:1508.06576, 2015.

  6. Gatys LA, Ecker AS, Bethge M (2016) Image style transfer using convolutional neural networks[C]. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition:2414–2423

  7. Giurgiutiu V, Yu L (2003) Comparison of short-time Fourier transform and wavelet transform of transient and tone burst wave propagation signals for structural health monitoring[C]. Proceedings of 4th International Workshop on Structural Health Monitoring:1267–1274

  8. Griffin D, Lim J (1984) Signal estimation from modified short-time Fourier transform[J]. IEEE Trans Acoust Speech Signal Process 32(2):236–243

    Article  Google Scholar 

  9. Grinstein E, Duong NQK, Ozerov A et al (2018) Audio style transfer[C]//2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE:586–590

  10. He K, Wang Y, Hopcroft J (2016) A powerful generative model using random weights for the deep image representation[C]. Advances in Neural Information Processing Systems:631–639

  11. Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks[C]. Advances in Neural Information Processing Systems:1097–1105

  12. Lu H, Li Y, Chen M, Kim H, Serikawa S (2018) Brain intelligence: go beyond artificial intelligence. Mobile Networks and Applications 23:368–375

    Article  Google Scholar 

  13. Lu H, Li Y, Uemura T et al (2018) FDCNet: filtering deep convolutional network for marine organism classification[J]. Multimed Tools Appl 77(17):21847–21860

    Article  Google Scholar 

  14. Lu H, Li Y, Uemura T, Kim H, Serikawa S (2018) Low illumination underwater light field images reconstruction using deep convolutional neural networks. Futur Gener Comput Syst 82:142–148

    Article  Google Scholar 

  15. Lu H, Li Y, Mu S, Wang D, Kim H, Serikawa S (2018) Motor anomaly detection for unmanned aerial vehicles using reinforcement learning. IEEE Internet Things J 5(4):2315–2322

    Article  Google Scholar 

  16. Lu H, Wang D, Li Y et al (2019) CONet: a cognitive ocean network[J]. IEEE Wireless Communications 26(3):90–96

    Article  Google Scholar 

  17. Mital P K. Time domain neural audio style transfer[J]. arXiv preprint arXiv:1711.11160, 2017.

  18. Nash J (1951) Non-cooperative games[J]. Annals of Mathematics (Second Series) 54(2):286–295

    Article  MathSciNet  Google Scholar 

  19. Shih Y, Paris S, Durand F et al (2013) Data-driven hallucination of different times of day from a single outdoor photo[J]. ACM Transactions on Graphics (TOG) 32(6):200

    Article  Google Scholar 

  20. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.

  21. Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014.

  22. Ulyanov D, Lebedev V. Audio texture synthesis and style transfer[J]. URL https://dmitryulyanov.github.io/audio-texture-synthesis-and-style-transfer, 2016.

  23. Ustyuzhaninov I, Brendel W, Gatys L A, et al. Texture synthesis using shallow convolutional networks with random filters[J]. arXiv preprint arXiv:1606.00021, 2016.

  24. Verma P, Smith J O. Neural style transfer for audio spectrograms[J]. arXiv preprint arXiv:1801.01589, 2018.

  25. Wyse L. Audio spectrogram representations for processing with convolutional neural networks[J]. arXiv preprint arXiv:1706.09559, 2017.

  26. Xu X, He L, Shimada A et al (2016) Learning unified binary codes for cross-modal retrieval via latent semantic hashing[J]. Neurocomputing 213:191–203

    Article  Google Scholar 

  27. Xu X, Shen F, Yang Y et al (2017) Learning discriminative binary codes for large-scale cross-modal retrieval[J]. IEEE Transactions on Image Processing 26(5):2494–2507

    Article  MathSciNet  Google Scholar 

  28. Xu X, Zhou X, Shen F et al (2019) Fusion by synthesizing: a multi-view deep neural network for zero-shot recognition[J]. Signal Processing 164:354–367

    Article  Google Scholar 

  29. Zhu C, Byrd RH, Lu P et al (1997) Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization[J]. ACM Transactions on Mathematical Software (TOMS) 23(4):550–560

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (61503128, 61772179), Science and Technology Plan Project of Hunan Province (2016TP1020), Scientific Research Fund of Hunan Provincial Education Department (16C0226, 17C0223, and 18A333), Scientific Research Fund of Hunan Provincial Key Laboratory of Intelligent Information Processing and Application (IIPA19K05). We would like to thank NVIDIA for the GPU donation.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huihuang Zhao.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, J., Yang, G., Zhao, H. et al. Audio style transfer using shallow convolutional networks and random filters. Multimed Tools Appl 79, 15043–15057 (2020). https://doi.org/10.1007/s11042-020-08798-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-020-08798-6

Keywords

Navigation