research-article

AutoSFX: Automatic Sound Effect Generation for Videos

Authors:

Hua HuangAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 9923 - 9932

https://doi.org/10.1145/3664647.3681109

Published: 28 October 2024 Publication History

Abstract

Sound Effect (SFX) generation, primarily aims to automatically produce sound waves for sounding visual objects in images or videos. Rather than learning an automatic solution to this task, we aim to propose a much broader system, AutoSFX, designed to automate sound design for videos in a more efficient and applicable manner. AutoSFX capitalizes on this concept by aggregating multimodal representations by cross-attention and leverages a diffusion model to generate sound with visual information embedded. AutoSFX also optimizes the generated sounds to render the entire soundtrack for the input video, leading to a more immersive and engaging multimedia experience by performing seamless transitions between sound clips and harmoniously mixing sounds playing simultaneously. We have developed a user-friendly interface for AutoSFX enabling users to interactively engage in the SFX generation for their videos with particular needs. To validate the capability of our vision-to-sound generation, we conducted comprehensive experiments and analyses using the widely recognized VEGAS and VGGSound test sets, yielding promising results. We also conducted a user study to evaluate the performance of the optimized soundtrack and the usability of the interface. Overall, the results revealed that our AutoSFX provides a viable sound landscape solution for making attractive videos.

References

[1]

Andrea Agostinelli, Timo I Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, et al. 2023. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325 (2023).

[2]

Yatong Bai, Trung Dang, Dung Tran, Kazuhito Koishida, and Somayeh Sojoudi. 2023. Accelerating diffusion-based text-to-audio generation with consistency distillation. arXiv preprint arXiv:2309.10740 (2023).

[3]

Casady Bowman and Takashi Yamauchi. 2016. Perceiving categorical emotion in sound: The role of timbre. Psychomusicology: Music, mind, and brain, Vol. 26, 1 (2016), 15.

[4]

Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. 2020. Vggsound: A large-scale audio-visual dataset. In ICASSP. IEEE, 721--725.

[5]

Lele Chen, Sudhanshu Srivastava, Zhiyao Duan, and Chenliang Xu. 2017. Deep cross-modal audio-visual generation. In Thematic Workshops of ACM MM. 349--357.

Digital Library

[6]

Peihao Chen, Yang Zhang, Mingkui Tan, Hongdong Xiao, Deng Huang, and Chuang Gan. 2020. Generating visually aligned sound from videos. IEEE Transactions on Image Processing (TIP), Vol. 29 (2020), 8292--8302.

[7]

Woosung Choi, Minseok Kim, Marco A Martínez Ramírez, Jaehwa Chung, and Soonyoung Jung. 2021. Amss-net: Audio manipulation on user-specified sources with textual queries. In ACM MM. 1775--1783.

[8]

Alessandro Cipriani and Maurizio Giri. 2010. Electronic music and sound design. Vol. 1. Contemponet.

[9]

Gabriel Cirio, Dingzeyu Li, Eitan Grinspun, Miguel A Otaduy, and Changxi Zheng. 2016. Crumpling sound synthesis. TOG, Vol. 35, 6 (2016), 1--11.

Digital Library

[10]

Qi Dai, Vaishakh Patil, Simon Hecker, Dengxin Dai, Luc Van Gool, and Konrad Schindler. 2020. Self-supervised object motion and depth estimation from video. In CVPR. 1004--1005.

[11]

Abe Davis and Maneesh Agrawala. 2018. Visual rhythm and beat. TOG, Vol. 37, 4 (2018), 1--11.

Digital Library

[12]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR.

[13]

Daniel PW Ellis and Graham E Poliner. 2007. Identifyingcover songs' with chroma features and dynamic programming beat tracking. In ICASSP, Vol. 4. IEEE, IV--1429.

[14]

William W Gaver. 1993. What in the world do we hear?: An ecological approach to auditory event perception. Ecological psychology, Vol. 5, 1 (1993), 1--29.

[15]

Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In ICASSP. IEEE, 776--780.

[16]

Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, and Soujanya Poria. 2023. Text-to-audio generation using instruction-tuned llm and latent diffusion model. arXiv preprint arXiv:2304.13731 (2023).

[17]

Dawei Hao, Yuxin Mao, Bowen He, Xiaodong Han, Yuchao Dai, and Yiran Zhong. 2024. Improving audio-visual segmentation with bidirectional generation. In AAAI.

[18]

Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. 2017. CNN architectures for large-scale audio classification. In ICASSP. IEEE, 131--135.

[19]

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. In ICML. PMLR, 2790--2799.

[20]

Haikun Huang, Michael Solah, Dingzeyu Li, and Lap-Fai Yu. 2019. Audible panorama: Automatic spatial audio generation for panorama imagery. In ACM CHI. 1--11.

[21]

Vladimir Iashin and Esa Rahtu. 2021. Taming visually guided sound generation. In British Machine Vision Conference (BMVC).

[22]

Xutong Jin, Sheng Li, Guoping Wang, and Dinesh Manocha. 2022. NeuralSound: learning-based modal sound synthesis with acoustic transfer. TOG, Vol. 41, 4 (2022), 1--15.

Digital Library

[23]

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. 2023. Segment anything. In ICCV. 4015--4026.

[24]

Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, and Yossi Adi. 2023. AudioGen: Textually Guided Audio Generation. In ICLR.

[25]

Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. 2023. AudioLDM: Text-to-audio generation with latent diffusion models. ICML.

[26]

Jinxiang Liu, Yu Wang, Chen Ju, Ya Zhang, and Weidi Xie. 2024. Annotation-free Audio-Visual Segmentation. In WACV.

[27]

Shiguang Liu, Haonan Cheng, and Yiying Tong. 2019. Physically-based statistical simulation of rain sound. TOG, Vol. 38, 4 (2019), 1--14.

Digital Library

[28]

Shiguang Liu, Sijia Li, and Haonan Cheng. 2021. Towards an end-to-end visual-to-raw-audio generation with GAN. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 32, 3 (2021), 1299--1312.

[29]

Songhua Liu, Jingwen Ye, and Xinchao Wang. 2023. Any-to-any style transfer: Making picasso and da vinci collaborate. arXiv preprint arXiv:2304.09728 (2023).

[30]

Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao. 2024. Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models. NeurIPS, Vol. 36 (2024).

[31]

Shentong Mo and Yapeng Tian. 2023. AV-SAM: Segment anything model meets audio-visual localization and segmentation. arXiv preprint arXiv:2305.01836 (2023).

[32]

Mubert. 2023. Human × AI Generative Music. https://mubert.com/ Retrieved 2023 from

[33]

Leo Murray. 2019. Sound design theory and practice: Working with sound. Routledge.

[34]

Qiutang Qi, Haonan Cheng, Yang Wang, Long Ye, and Shaobin Li. 2023. RD-FGFS: A Rule-Data Hybrid Framework for Fine-Grained Footstep Sound Synthesis from Visual Guidance. In ACM MM. 8525--8533.

[35]

Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, and Baining Guo. 2023. Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. In CVPR. 10219--10228.

[36]

Inês Salselas, Rui Penha, and Gilberto Bernardes. 2021. Sound design inducing attention in the context of audiovisual immersive environments. Personal and Ubiquitous Computing, Vol. 25 (2021), 737--748.

Digital Library

[37]

Kun Su, Xiulong Liu, and Eli Shlizerman. 2021. How does it sound? Advances in Neural Information Processing Systems, Vol. 34 (2021), 29258--29273.

[38]

Quentin Summerfield. 1987. ch. Some Preliminaries to a Compreensive Account of Audio-Visual Speech Perception. Hearing by eye: The psychology of lip-reading (1987), 3--52.

[39]

Patricia Valdez and Albert Mehrabian. 1994. Effects of color on emotions. Journal of experimental psychology: General, Vol. 123, 4 (1994), 394.

[40]

Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu, et al. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, Vol. 12 (2016).

[41]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. NeurIPS, Vol. 30 (2017).

[42]

Jui-Hsien Wang, Ante Qu, Timothy R Langlois, and Doug L James. 2018. Toward wave-based sound synthesis for computer animation. TOG, Vol. 37, 4 (2018), 109--1.

Digital Library

[43]

Yujia Wang, Wei Liang, Wanwan Li, Dingzeyu Li, and Lap-Fai Yu. 2020. Scene-aware background music synthesis. In ACM MM. 1162--1170.

[44]

Yaoting Wang, Weisong Liu, Guangyao Li, Jian Ding, Di Hu, and Xi Li. 2024. Prompting segmentation with sound is generalizable audio-visual source localizer. In AAAI, Vol. 38. 5669--5677.

[45]

Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, and Dong Yu. 2023. Diffsound: Discrete diffusion model for text-to-sound generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing (2023).

[46]

Qiang Yin and Shiguang Liu. 2016. Sounding solid combustibles: non-premixed flame sound synthesis for different solid combustibles. IEEE Transactions on Visualization and Computer Graphics (TVCG), Vol. 24, 2 (2016), 1179--1189.

[47]

Jiashuo Yu, Yaohui Wang, Xinyuan Chen, Xiao Sun, and Yu Qiao. 2023. Long-term rhythmic video soundtracker. In International Conference on Machine Learning. PMLR, 40339--40353.

[48]

Tao Yu, Runseng Feng, Ruoyu Feng, Jinming Liu, Xin Jin, Wenjun Zeng, and Zhibo Chen. 2023. Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790 (2023).

[49]

A Zanella, CM Harrison, S Lenzi, J Cooke, P Damsma, and SW Fleming. 2022. Sonification and sound design for astronomy research, education and public engagement. Nature Astronomy (2022), 1--8.

[50]

Chunhui Zhang, Li Liu, Yawen Cui, Guanjie Huang, Weilin Lin, Yiqian Yang, and Yuehong Hu. 2023. A comprehensive survey on segment anything model for vision and beyond. arXiv preprint arXiv:2305.08196 (2023).

[51]

Jinta Zheng, Shih-Hsuan Hung, Kyle Hiebel, and Yue Zhang. 2020. Real-time rendering of decorative sound textures for soundscapes. TOG, Vol. 39, 6 (2020), 1--12.

[52]

Hang Zhou, Ziwei Liu, Xudong Xu, Ping Luo, and Xiaogang Wang. 2019. Vision-infused deep audio inpainting. In CVPR. 283--292.

[53]

Jinxing Zhou, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, and Yiran Zhong. 2022. Audio--visual segmentation. In ECCV. Springer, 386--403.

[54]

Yipin Zhou, Zhaowen Wang, Chen Fang, Trung Bui, and Tamara L Berg. 2018. Visual to sound: Generating natural sound for videos in the wild. In CVPR. 3550--3558.

[55]

Ye Zhu, Yu Wu, Kyle Olszewski, Jian Ren, Sergey Tulyakov, and Yan Yan. 2022. Discrete contrastive diffusion for cross-modal music and image generation. arXiv preprint arXiv:2206.07771 (2022).

Index Terms

AutoSFX: Automatic Sound Effect Generation for Videos
1. Applied computing
  1. Arts and humanities
    1. Sound and music computing
2. Human-centered computing
  1. Interaction design

Recommendations

Sound through the rabbit hole: sound design based on reports of auditory hallucination
AM '14: Proceedings of the 9th Audio Mostly: A Conference on Interaction With Sound

As video game developers seek to provide increasing levels of realism and sophistication, there is a need for game characters to be able to exhibit psychological states including 'altered states of consciousness' (ASC) realistically. 'Auditory ...
The sound of touch: physical manipulation of digital sound
CHI '08: Proceedings of the SIGCHI Conference on Human Factors in Computing Systems

The Sound of Touch is a new tool for real-time capture and sensitive physical stimulation of sound samples using digital convolution. Our hand-held wand can be used to (1) record sound, then (2) play back the recording by brushing, scraping, striking or ...
Sound Design Impacts User Experience and Attention in Serious Game
Serious Games
Abstract
We study the impact of sound design – soundscape, sound effects, and auditory notifications, namely earcons – on the player’s experience of serious games. Three sound design versions for the game Venci’s Adventures have been developed: 1) no sound;...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

October 2024

11719 pages

ISBN:9798400706868

DOI:10.1145/3664647

General Chairs:
Jianfei Cai
Monash University, Australia
,
Mohan Kankanhalli
NUS, Singapore
,
Balakrishnan Prabhakaran
UT Dallas, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Program Chairs:
Ramanathan Subramanian
University of Canberra & IIT Ropar, Australia
,
Liang Zheng
Australian National University, Australia
,
Vivek K. Singh
Rutgers University, USA
,
Pablo Cesar
Centrum Wiskunde & Informatica, Netherlands
,
Lexing Xie
Australian National University, Australia
,
Dong Xu
University of Hong Kong, Hong Kong

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

China Postdoctoral Science Foundation Fellowship

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
112
Total Downloads

Downloads (Last 12 months)112
Downloads (Last 6 weeks)65

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten