research-article

Music Upscaling Using Convolutional Neural Networks

Authors:

Violet Johnson,

Ian ParberryAuthors Info & Claims

SSIP '20: Proceedings of the 2020 3rd International Conference on Sensors, Signal and Image Processing

Pages 58 - 62

https://doi.org/10.1145/3441233.3441240

Published: 06 March 2021 Publication History

Abstract

Audio upscaling with generative neural networks has been studied in the fields of super-resolution and speech bandwidth expansion. Previous approaches have worked well for speech, but not for music. We propose a convolutional neural network approach with a novel dilated and residual architecture for this domain and an additional refinement method which outperforms the cubic spline baseline when upscaling music according to a spectral distance error metric.

References

[1]

Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. 2015. Image super-resolution using deep convolutional networks. IEEE transactions on pattern analysis and machine intelligence 38, 2(2015), 295–307.

[2]

Per Ekstrand. 2002. Bandwidth extension of audio signals by spectral band replication. In in Proceedings of the 1st IEEE Benelux Workshop on Model Based Processing and Coding of Audio (MPCA’02. Citeseer.

[3]

Felix A Gers, Douglas Eck, and Jürgen Schmidhuber. 2002. Applying LSTM to time series predictable through time-window approaches. In Neural Nets WIRN Vietri-01. Springer, 193–200.

Digital Library

[4]

Md Rashidul Hasan, Mustafa Jamil, MGRMS Rahman, 2004. Speaker identification using mel frequency cepstral coefficients. variations 1, 4 (2004).

[5]

Satoshi Imai. 1983. Cepstral analysis synthesis on the mel frequency scale. In ICASSP’83. IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 8. IEEE, 93–96.

[6]

Satoshi Imai, Kazuo Sumita, and Chieko Furuichi. 1983. Mel log spectrum approximation (MLSA) filter for speech synthesis. Electronics and Communications in Japan (Part I: Communications) 66, 2(1983), 10–18.

[7]

Bernd Iser and Gerhard Schmidt. 2003. Neural networks versus codebooks in an application for bandwidth extension of speech signals. In Eighth European Conference on Speech Communication and Technology.

[8]

Jui-Hsin Lai, Chieh-Chi Kao, and Shao-Yi Chien. 2009. Super-resolution sprite with foreground removal. In 2009 IEEE International Conference on Multimedia and Expo. IEEE, 1306–1309.

[9]

Christian Ledig, Lucas Theis, Ferenc Huszár, Jose Caballero, Andrew Cunningham, Alejandro Acosta, Andrew Aitken, Alykhan Tejani, Johannes Totz, Zehan Wang, 2017. Photo-realistic single image super-resolution using a generative adversarial network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4681–4690.

[10]

Abdel-rahman Mohamed, George E Dahl, and Geoffrey Hinton. 2011. Acoustic modeling using deep belief networks. IEEE transactions on audio, speech, and language processing 20, 1(2011), 14–22.

[11]

Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. 2016. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759(2016).

[12]

Kun-Youl Park and Hyung Soon Kim. 2000. Narrowband to wideband conversion of speech using GMM based transformation. In 2000 IEEE international conference on acoustics, speech, and signal processing. Proceedings (Cat. No. 00CH37100), Vol. 3. IEEE, 1843–1846.

[13]

Se Rim Park and Jinwon Lee. 2016. A fully convolutional neural network for speech enhancement. arXiv preprint arXiv:1609.07132(2016).

[14]

Wenzhe Shi, Jose Caballero, Ferenc Huszár, Johannes Totz, Andrew P Aitken, Rob Bishop, Daniel Rueckert, and Zehan Wang. 2016. Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1874–1883.

[15]

Sasha Targ, Diogo Almeida, and Kevin Lyman. 2016. Resnet in resnet: Generalizing residual architectures. arXiv preprint arXiv:1603.08029(2016).

[16]

Tong Tong, Gen Li, Xiejie Liu, and Qinquan Gao. 2017. Image super-resolution using dense skip connections. In Proceedings of the IEEE International Conference on Computer Vision. 4799–4807.

[17]

Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. 2016. WaveNet: A Generative Model for Raw Audio. CoRR abs/1609.03499(2016). arxiv:1609.03499http://arxiv.org/abs/1609.03499

[18]

Xintao Wang, Ke Yu, Shixiang Wu, Jinjin Gu, Yihao Liu, Chao Dong, Yu Qiao, and Chen Change Loy. 2018. Esrgan: Enhanced super-resolution generative adversarial networks. In Proceedings of the European Conference on Computer Vision (ECCV). 0–0.

[19]

Sergey Zagoruyko and Nikos Komodakis. 2016. Wide residual networks. arXiv preprint arXiv:1605.07146(2016).

Recommendations

Volume upscaling with convolutional neural networks
CGI '17: Proceedings of the Computer Graphics International Conference

Volume upscaling generates high-resolution volumes from low-resolution volumes to make data exploration more effective. Traditional methods, such as the simple trilinear or cubic-spline interpolation, may blur boundaries of features and lead to jagged ...
Deep Convolutional Neural Networks for Large-scale Speech Tasks

Convolutional Neural Networks (CNNs) are an alternative type of neural network that can be used to reduce spectral variations and model spectral correlations which exist in signals. Since speech signals exhibit both of these properties, we hypothesize ...
Genre Classification in Music using Convolutional Neural Networks
Advances in Visual Informatics
Abstract
With the advancement of technology and computational power, crafting a chart-topping song has become more effortless than before, achievable from the convenience of our residences with just a computer at hand. This has led to the emergence of vast ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

SSIP '20: Proceedings of the 2020 3rd International Conference on Sensors, Signal and Image Processing

October 2020

95 pages

ISBN:9781450388283

DOI:10.1145/3441233

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 March 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

SSIP 2020

SSIP 2020: 2020 3rd International Conference on Sensors, Signal and Image Processing

October 9 - 11, 2020

Prague, Czech Republic

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
48
Total Downloads

Downloads (Last 12 months)8
Downloads (Last 6 weeks)0

Reflects downloads up to 21 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents