skip to main content
10.1145/3664647.3681109acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

AutoSFX: Automatic Sound Effect Generation for Videos

Published: 28 October 2024 Publication History

Abstract

Sound Effect (SFX) generation, primarily aims to automatically produce sound waves for sounding visual objects in images or videos. Rather than learning an automatic solution to this task, we aim to propose a much broader system, AutoSFX, designed to automate sound design for videos in a more efficient and applicable manner. AutoSFX capitalizes on this concept by aggregating multimodal representations by cross-attention and leverages a diffusion model to generate sound with visual information embedded. AutoSFX also optimizes the generated sounds to render the entire soundtrack for the input video, leading to a more immersive and engaging multimedia experience by performing seamless transitions between sound clips and harmoniously mixing sounds playing simultaneously. We have developed a user-friendly interface for AutoSFX enabling users to interactively engage in the SFX generation for their videos with particular needs. To validate the capability of our vision-to-sound generation, we conducted comprehensive experiments and analyses using the widely recognized VEGAS and VGGSound test sets, yielding promising results. We also conducted a user study to evaluate the performance of the optimized soundtrack and the usability of the interface. Overall, the results revealed that our AutoSFX provides a viable sound landscape solution for making attractive videos.

References

[1]
Andrea Agostinelli, Timo I Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, et al. 2023. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325 (2023).
[2]
Yatong Bai, Trung Dang, Dung Tran, Kazuhito Koishida, and Somayeh Sojoudi. 2023. Accelerating diffusion-based text-to-audio generation with consistency distillation. arXiv preprint arXiv:2309.10740 (2023).
[3]
Casady Bowman and Takashi Yamauchi. 2016. Perceiving categorical emotion in sound: The role of timbre. Psychomusicology: Music, mind, and brain, Vol. 26, 1 (2016), 15.
[4]
Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zisserman. 2020. Vggsound: A large-scale audio-visual dataset. In ICASSP. IEEE, 721--725.
[5]
Lele Chen, Sudhanshu Srivastava, Zhiyao Duan, and Chenliang Xu. 2017. Deep cross-modal audio-visual generation. In Thematic Workshops of ACM MM. 349--357.
[6]
Peihao Chen, Yang Zhang, Mingkui Tan, Hongdong Xiao, Deng Huang, and Chuang Gan. 2020. Generating visually aligned sound from videos. IEEE Transactions on Image Processing (TIP), Vol. 29 (2020), 8292--8302.
[7]
Woosung Choi, Minseok Kim, Marco A Martínez Ramírez, Jaehwa Chung, and Soonyoung Jung. 2021. Amss-net: Audio manipulation on user-specified sources with textual queries. In ACM MM. 1775--1783.
[8]
Alessandro Cipriani and Maurizio Giri. 2010. Electronic music and sound design. Vol. 1. Contemponet.
[9]
Gabriel Cirio, Dingzeyu Li, Eitan Grinspun, Miguel A Otaduy, and Changxi Zheng. 2016. Crumpling sound synthesis. TOG, Vol. 35, 6 (2016), 1--11.
[10]
Qi Dai, Vaishakh Patil, Simon Hecker, Dengxin Dai, Luc Van Gool, and Konrad Schindler. 2020. Self-supervised object motion and depth estimation from video. In CVPR. 1004--1005.
[11]
Abe Davis and Maneesh Agrawala. 2018. Visual rhythm and beat. TOG, Vol. 37, 4 (2018), 1--11.
[12]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR.
[13]
Daniel PW Ellis and Graham E Poliner. 2007. Identifyingcover songs' with chroma features and dynamic programming beat tracking. In ICASSP, Vol. 4. IEEE, IV--1429.
[14]
William W Gaver. 1993. What in the world do we hear?: An ecological approach to auditory event perception. Ecological psychology, Vol. 5, 1 (1993), 1--29.
[15]
Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio set: An ontology and human-labeled dataset for audio events. In ICASSP. IEEE, 776--780.
[16]
Deepanway Ghosal, Navonil Majumder, Ambuj Mehrish, and Soujanya Poria. 2023. Text-to-audio generation using instruction-tuned llm and latent diffusion model. arXiv preprint arXiv:2304.13731 (2023).
[17]
Dawei Hao, Yuxin Mao, Bowen He, Xiaodong Han, Yuchao Dai, and Yiran Zhong. 2024. Improving audio-visual segmentation with bidirectional generation. In AAAI.
[18]
Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. 2017. CNN architectures for large-scale audio classification. In ICASSP. IEEE, 131--135.
[19]
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, Quentin De Laroussilhe, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Parameter-efficient transfer learning for NLP. In ICML. PMLR, 2790--2799.
[20]
Haikun Huang, Michael Solah, Dingzeyu Li, and Lap-Fai Yu. 2019. Audible panorama: Automatic spatial audio generation for panorama imagery. In ACM CHI. 1--11.
[21]
Vladimir Iashin and Esa Rahtu. 2021. Taming visually guided sound generation. In British Machine Vision Conference (BMVC).
[22]
Xutong Jin, Sheng Li, Guoping Wang, and Dinesh Manocha. 2022. NeuralSound: learning-based modal sound synthesis with acoustic transfer. TOG, Vol. 41, 4 (2022), 1--15.
[23]
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. 2023. Segment anything. In ICCV. 4015--4026.
[24]
Felix Kreuk, Gabriel Synnaeve, Adam Polyak, Uriel Singer, Alexandre Défossez, Jade Copet, Devi Parikh, Yaniv Taigman, and Yossi Adi. 2023. AudioGen: Textually Guided Audio Generation. In ICLR.
[25]
Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo Mandic, Wenwu Wang, and Mark D Plumbley. 2023. AudioLDM: Text-to-audio generation with latent diffusion models. ICML.
[26]
Jinxiang Liu, Yu Wang, Chen Ju, Ya Zhang, and Weidi Xie. 2024. Annotation-free Audio-Visual Segmentation. In WACV.
[27]
Shiguang Liu, Haonan Cheng, and Yiying Tong. 2019. Physically-based statistical simulation of rain sound. TOG, Vol. 38, 4 (2019), 1--14.
[28]
Shiguang Liu, Sijia Li, and Haonan Cheng. 2021. Towards an end-to-end visual-to-raw-audio generation with GAN. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 32, 3 (2021), 1299--1312.
[29]
Songhua Liu, Jingwen Ye, and Xinchao Wang. 2023. Any-to-any style transfer: Making picasso and da vinci collaborate. arXiv preprint arXiv:2304.09728 (2023).
[30]
Simian Luo, Chuanhao Yan, Chenxu Hu, and Hang Zhao. 2024. Diff-foley: Synchronized video-to-audio synthesis with latent diffusion models. NeurIPS, Vol. 36 (2024).
[31]
Shentong Mo and Yapeng Tian. 2023. AV-SAM: Segment anything model meets audio-visual localization and segmentation. arXiv preprint arXiv:2305.01836 (2023).
[32]
Mubert. 2023. Human × AI Generative Music. https://mubert.com/ Retrieved 2023 from
[33]
Leo Murray. 2019. Sound design theory and practice: Working with sound. Routledge.
[34]
Qiutang Qi, Haonan Cheng, Yang Wang, Long Ye, and Shaobin Li. 2023. RD-FGFS: A Rule-Data Hybrid Framework for Fine-Grained Footstep Sound Synthesis from Visual Guidance. In ACM MM. 8525--8533.
[35]
Ludan Ruan, Yiyang Ma, Huan Yang, Huiguo He, Bei Liu, Jianlong Fu, Nicholas Jing Yuan, Qin Jin, and Baining Guo. 2023. Mm-diffusion: Learning multi-modal diffusion models for joint audio and video generation. In CVPR. 10219--10228.
[36]
Inês Salselas, Rui Penha, and Gilberto Bernardes. 2021. Sound design inducing attention in the context of audiovisual immersive environments. Personal and Ubiquitous Computing, Vol. 25 (2021), 737--748.
[37]
Kun Su, Xiulong Liu, and Eli Shlizerman. 2021. How does it sound? Advances in Neural Information Processing Systems, Vol. 34 (2021), 29258--29273.
[38]
Quentin Summerfield. 1987. ch. Some Preliminaries to a Compreensive Account of Audio-Visual Speech Perception. Hearing by eye: The psychology of lip-reading (1987), 3--52.
[39]
Patricia Valdez and Albert Mehrabian. 1994. Effects of color on emotions. Journal of experimental psychology: General, Vol. 123, 4 (1994), 394.
[40]
Aaron Van Den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, Koray Kavukcuoglu, et al. 2016. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, Vol. 12 (2016).
[41]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. NeurIPS, Vol. 30 (2017).
[42]
Jui-Hsien Wang, Ante Qu, Timothy R Langlois, and Doug L James. 2018. Toward wave-based sound synthesis for computer animation. TOG, Vol. 37, 4 (2018), 109--1.
[43]
Yujia Wang, Wei Liang, Wanwan Li, Dingzeyu Li, and Lap-Fai Yu. 2020. Scene-aware background music synthesis. In ACM MM. 1162--1170.
[44]
Yaoting Wang, Weisong Liu, Guangyao Li, Jian Ding, Di Hu, and Xi Li. 2024. Prompting segmentation with sound is generalizable audio-visual source localizer. In AAAI, Vol. 38. 5669--5677.
[45]
Dongchao Yang, Jianwei Yu, Helin Wang, Wen Wang, Chao Weng, Yuexian Zou, and Dong Yu. 2023. Diffsound: Discrete diffusion model for text-to-sound generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing (2023).
[46]
Qiang Yin and Shiguang Liu. 2016. Sounding solid combustibles: non-premixed flame sound synthesis for different solid combustibles. IEEE Transactions on Visualization and Computer Graphics (TVCG), Vol. 24, 2 (2016), 1179--1189.
[47]
Jiashuo Yu, Yaohui Wang, Xinyuan Chen, Xiao Sun, and Yu Qiao. 2023. Long-term rhythmic video soundtracker. In International Conference on Machine Learning. PMLR, 40339--40353.
[48]
Tao Yu, Runseng Feng, Ruoyu Feng, Jinming Liu, Xin Jin, Wenjun Zeng, and Zhibo Chen. 2023. Inpaint anything: Segment anything meets image inpainting. arXiv preprint arXiv:2304.06790 (2023).
[49]
A Zanella, CM Harrison, S Lenzi, J Cooke, P Damsma, and SW Fleming. 2022. Sonification and sound design for astronomy research, education and public engagement. Nature Astronomy (2022), 1--8.
[50]
Chunhui Zhang, Li Liu, Yawen Cui, Guanjie Huang, Weilin Lin, Yiqian Yang, and Yuehong Hu. 2023. A comprehensive survey on segment anything model for vision and beyond. arXiv preprint arXiv:2305.08196 (2023).
[51]
Jinta Zheng, Shih-Hsuan Hung, Kyle Hiebel, and Yue Zhang. 2020. Real-time rendering of decorative sound textures for soundscapes. TOG, Vol. 39, 6 (2020), 1--12.
[52]
Hang Zhou, Ziwei Liu, Xudong Xu, Ping Luo, and Xiaogang Wang. 2019. Vision-infused deep audio inpainting. In CVPR. 283--292.
[53]
Jinxing Zhou, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, and Yiran Zhong. 2022. Audio--visual segmentation. In ECCV. Springer, 386--403.
[54]
Yipin Zhou, Zhaowen Wang, Chen Fang, Trung Bui, and Tamara L Berg. 2018. Visual to sound: Generating natural sound for videos in the wild. In CVPR. 3550--3558.
[55]
Ye Zhu, Yu Wu, Kyle Olszewski, Jian Ren, Sergey Tulyakov, and Yan Yan. 2022. Discrete contrastive diffusion for cross-modal music and image generation. arXiv preprint arXiv:2206.07771 (2022).

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '24: Proceedings of the 32nd ACM International Conference on Multimedia
October 2024
11719 pages
ISBN:9798400706868
DOI:10.1145/3664647
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. audiovisual consistency
  2. sound design
  3. sound effect generation

Qualifiers

  • Research-article

Funding Sources

  • China Postdoctoral Science Foundation Fellowship

Conference

MM '24
Sponsor:
MM '24: The 32nd ACM International Conference on Multimedia
October 28 - November 1, 2024
Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 112
    Total Downloads
  • Downloads (Last 12 months)112
  • Downloads (Last 6 weeks)65
Reflects downloads up to 28 Feb 2025

Other Metrics

Citations

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media