research-article

Deep-Modal: Real-Time Impact Sound Synthesis for Arbitrary Shapes

Authors:

Dinesh Manocha,

Guoping WangAuthors Info & Claims

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Pages 1171 - 1179

https://doi.org/10.1145/3394171.3413572

Published: 12 October 2020 Publication History

Abstract

Model sound synthesis is a physically-based sound synthesis method used to generate audio content in games and virtual worlds. We present a novel learning-based impact sound synthesis algorithm called Deep-Modal. Our approach can handle sound synthesis for common arbitrary objects, especially dynamic generated objects, in real-time. We present a new compact strategy to represent the mode data, corresponding to frequency and amplitude, as fixed-length vectors. This is combined with a new network architecture that can convert shape features of 3D objects into mode data. Our network is based on an encoder-decoder architecture with the contact positions of objects and external forces embedded. Our method can synthesize interactive sounds related to objects of various shapes at any contact position, as well as objects of different materials and sizes. The synthesis process only takes ~0.01s on a GTX 1080 Ti GPU. We show the effectiveness of Deep-Modal through extensive evaluation using different metrics, including recall and precision of prediction, sound spectrogram, and a user study.

Supplementary Material

MP4 File (3394171.3413572.mp4)

Presentation Video for the paper "Deep-Modal: Real-Time Impact Sound Synthesis for Arbitrary Shapes"

Download
77.02 MB

References

[1]

Lakulish Antani, Anish Chandak, Micah Taylor, and Dinesh Manocha. 2011. Direct-to-indirect acoustic radiance transfer. IEEE Transactions on Visualization and Computer Graphics 18, 2 (2011), 261--269.

Digital Library

[2]

Marc Aretz. 2012. Combined wave and ray based room acoustic simulations of small rooms. Vol. 12. Logos Verlag Berlin GmbH.

[3]

Nicolas Bonneel, George Drettakis, Nicolas Tsingos, Isabelle Viaud-Delmon, and Doug James. 2008. Fast modal sounds with scalable frequency-domain synthesis. In ACM SIGGRAPH 2008 papers. 1--9.

Digital Library

[4]

Jeffrey N Chadwick, Changxi Zheng, and Doug L James. 2012. Precomputed acceleration noise for improved rigid-body sound. ACM Transactions on Graphics (TOG) 31, 4 (2012), 1--9.

Digital Library

[5]

Adam Dziekonski and Michal Mrozowski. 2018. A GPU solver for sparse generalized eigenvalue problems with symmetric complex-valued matrices obtained using higher-order FEM. IEEE Access 6 (2018), 69826--69834.

[6]

Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T Freeman, and Michael Rubinstein. 2018. Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Transactions on Graphics (TOG) 37, 4 (2018), 1--11.

Digital Library

[7]

Ziqi Fan, Vibhav Vineet, Hannes Gamper, and Nikunj Raghuvanshi. 2020. Fast acoustic scattering using convolutional neural networks. In ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 171--175.

[8]

Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6546--6555.

[9]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[10]

DL James, TR Langlois, R Mehra, and C Zheng. 2016. Physically Based Sound for Computer Animation and Virtual Environments. ACM SIGGRAPH 2016 Course.

[11]

Doug L James, and Dinesh K Pai. 2006. Precomputed acoustic transfer: output-sensitive, accurate sound generation for geometrically complex vibration sources. ACM Transactions on Graphics (TOG) 25, 3 (2006), 987--995.

Digital Library

[12]

Asako Kanezaki, Yasuyuki Matsushita, and Yoshifumi Nishida. 2018. Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5010--5019.

[13]

Ignacio Llamas. 2007. Real-time voxelization of triangle meshes on the GPU. In SIGGRAPH Sketches. 18.

[14]

Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3431--3440.

[15]

Daniel Maturana and Sebastian Scherer. 2015. Voxnet: A 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 922--928.

Digital Library

[16]

Ravish Mehra, Nikunj Raghuvanshi, Lakulish Antani, Anish Chandak, Sean Curtis, and Dinesh Manocha. 2013. Wave-based sound propagation in large open scenes using an equivalent source formulation. ACM Transactions on Graphics (TOG) 32, 2 (2013), 1--13.

Digital Library

[17]

Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In European conference on computer vision. Springer, 483--499.

[18]

James F O'Brien, Chen Shen, and Christine M Gatchalian. 2002. Synthesizing sounds from rigid-body simulations. In Proceedings of the 2002 ACM SIGGRAPH/ Eurographics symposium on Computer animation. 175--181.

Digital Library

[19]

Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H Adelson, and William T Freeman. 2016. Visually indicated sounds. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2405--2413.

[20]

R Falcon Perez. 2018. Machine-learning-based estimation of room acoustic parameters. Ph.D. Dissertation. Aalto University, School of Electrical Engineering.

[21]

Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. 2017. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 652--660.

[22]

Charles R Qi, Hao Su, Matthias Nießner, Angela Dai, Mengyuan Yan, and Leonidas J Guibas. 2016. Volumetric and multi-view cnns for object classification on 3d data. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5648--5656.

[23]

Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. 2017. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems. 5099--5108.

[24]

Nikunj Raghuvanshi and Ming C Lin. 2006. Interactive sound synthesis for large scale environments. In Proceedings of the 2006 symposium on Interactive 3D graphics and games. 101--108.

Digital Library

[25]

Nikunj Raghuvanshi, Rahul Narain, and Ming C Lin. 2009. Efficient and accurate sound propagation using adaptive rectangular decomposition. IEEE Transactions on Visualization and Computer Graphics 15, 5 (2009), 789--801.

Digital Library

[26]

Zhimin Ren, Hengchin Yeh, and Ming C Lin. 2013. Example-guided physically based modal sound synthesis. ACM Transactions on Graphics (TOG) 32, 1 (2013), 1--16.

Digital Library

[27]

Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234--241.

[28]

Atul Rungta, Carl Schissler, Ravish Mehra, Chris Malloy, Ming Lin, and Dinesh Manocha. 2016. SynCoPation: Interactive synthesis-coupled sound propagation. IEEE transactions on visualization and computer graphics 22, 4 (2016), 1346--1355.

[29]

Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4779--4783.

Digital Library

[30]

Auston Sterling and Ming C Lin. 2016. Interactive modal sound synthesis using generalized proportional damping. In Proceedings of the 20th ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games. 79--86.

Digital Library

[31]

Auston Sterling, Nicholas Rewkowski, Roberta L Klatzky, and Ming C Lin. 2019. Audio-material reconstruction for virtualized reality using a probabilistic damping model. IEEE transactions on visualization and computer graphics 25, 5 (2019), 1855--1864.

[32]

Auston Sterling, Justin Wilson, Sam Lowe, and Ming C. Lin. 2018. ISNN: Impact Sound Neural Network for Audio-Visual Object Classification. In Computer Vision -- ECCV 2018, Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (Eds.). Springer International Publishing, Cham, 578--595.

[33]

Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. 2015. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision. 945--953.

Digital Library

[34]

Zhenyu Tang, Nicholas J Bryan, Dingzeyu Li, Timothy R Langlois, and Dinesh Manocha. 2020. Scene-Aware Audio Rendering via Deep Acoustic Analysis. IEEE Transactions on Visualization and Computer Graphics 26, 5 (2020), 1991--2001.

[35]

Zhenyu Tang, John D. Kanu, Kevin Hogan, and Dinesh Manocha. 2019. Regression and Classification for Direction-of-Arrival Estimation with Convolutional Recurrent Neural Networks. In Proc. Interspeech 2019. 654--658.

[36]

Kees van de Doel and Dinesh K Pai. 1996. Synthesis of shape dependent sounds with physical modeling. Georgia Institute of Technology.

[37]

Kees Van Den Doel, Paul G Kry, and Dinesh K Pai. 2001. FoleyAutomatic: physically-based sound effects for interactive simulation and animation. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques. 537--544.

Digital Library

[38]

Kees Van den Doel and Dinesh K Pai. 1998. The sounds of physical shapes. Presence 7, 4 (1998), 382--395.

Digital Library

[39]

Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. [n.d.]. WaveNet: A Generative Model for Raw Audio. In 9th ISCA Speech Synthesis Workshop. 125--125.

[40]

Jui-Hsien Wang and Doug L James. 2019. KleinPAT: optimal mode conflation for time-domain precomputation of acoustic transfer. ACM Trans. Graph. 38, 4 (2019), 122--1.

Digital Library

[41]

Jui-Hsien Wang, Ante Qu, Timothy R Langlois, and Doug L James. 2018. Toward wave-based sound synthesis for computer animation. ACM Trans. Graph. 37, 4 (2018), 109--1.

Digital Library

[42]

YuxuanWang, R.J. Skerry-Ryan, Daisy Stanton, YonghuiWu, RonWeiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif Saurous. 2017. Tacotron: Towards End-to- End Speech Synthesis. In Proc. Interspeech 2017. 4006--4010.

[43]

Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. 2019. Dynamic graph cnn for learning on point clouds. ACM Transactions On Graphics (TOG) 38, 5 (2019), 1--12.

Digital Library

[44]

Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 2015. 3D shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1912--1920.

[45]

Bin Xiao, Haiping Wu, and Yichen Wei. 2018. Simple baselines for human pose estimation and tracking. In Proceedings of the European conference on computer vision (ECCV). 466--481.

Digital Library

[46]

Qiong Zhang, Lu Ye, and Zhigeng Pan. 2005. Physically-based sound synthesis on GPUs. In International Conference on Entertainment Computing. Springer, 328--333.

Digital Library

[47]

Zhoutong Zhang, Qiujia Li, Zhengjia Huang, JiajunWu, Josh Tenenbaum, and Bill Freeman. 2017. Shape and material from sound. In Advances in Neural Information Processing Systems. 1278--1288.

[48]

Changxi Zheng and Doug L James. 2010. Rigid-body fracture sound with precomputed soundbanks. In ACM SIGGRAPH 2010 papers. 1--13.

Digital Library

[49]

Changxi Zheng and Doug L James. 2011. Toward high-quality modal contact sound. In ACM SIGGRAPH 2011 papers. 1--12.

Digital Library

Cited By

Liu TXiao YHu MSha HMa SGao BGuo SLiu YSong W(2025)AudioGest: Gesture-Based Interaction for Virtual Reality Using Audio DevicesIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.339786831:2(1569-1581)Online publication date: Feb-2025
https://doi.org/10.1109/TVCG.2024.3397868
Su XFroehlich JKoh EXiao C(2024)SonifyAR: Context-Aware Sound Generation in Augmented RealityProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676406(1-13)Online publication date: 13-Oct-2024
https://dl.acm.org/doi/10.1145/3654777.3676406
Jin XXu CGao RWu JWang GLi S(2024)DiffSound: Differentiable Modal Sound Rendering and Inverse Rendering for Diverse Inference TasksACM SIGGRAPH 2024 Conference Papers10.1145/3641519.3657493(1-12)Online publication date: 13-Jul-2024
https://dl.acm.org/doi/10.1145/3641519.3657493
Show More Cited By

Index Terms

Deep-Modal: Real-Time Impact Sound Synthesis for Arbitrary Shapes
1. Applied computing
  1. Arts and humanities
    1. Sound and music computing

Recommendations

Example-guided physically based modal sound synthesis

Linear modal synthesis methods have often been used to generate sounds for rigid bodies. One of the key challenges in widely adopting such techniques is the lack of automatic determination of satisfactory material parameters that recreate realistic ...
Real-time rendering of decorative sound textures for soundscapes

Audio recordings contain rich information about sound sources and their properties such as the location, loudness, and frequency of events. One prevalent component in sound recordings is the sound texture, which contains a massive number of events. In ...
Locating virtual sound sources at arbitrary distances in real-time binaural reproduction

A real-time system for sound spatialization via headphones is presented. Conventional headphone spatialization techniques effectively place sources on the surface of a virtual sphere around the listener. In the new system, sources can be spatialized at ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '20: Proceedings of the 28th ACM International Conference on Multimedia

October 2020

4889 pages

ISBN:9781450379885

DOI:10.1145/3394171

General Chairs:
Chang Wen Chen
Chinese University of Hong Kong, Shenzhen, China
,
Rita Cucchiara
UNIMORE, Italy
,
Xian-Sheng Hua
Alibaba Group, China
,
Program Chairs:
Guo-Jun Qi
Futurewei Technologies, USA
,
Elisa Ricci
UNITN & Fondazione Bruno Kessler, Italy
,
Zhengyou Zhang
Tencent, China
,
Roger Zimmermann
National University of Singapore, Singapore

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 October 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Key Research and Development
National Natural Science Foundation of China

Conference

MM '20

Sponsor:

SIGMM

MM '20: The 28th ACM International Conference on Multimedia

October 12 - 16, 2020

WA, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

11
Total Citations
View Citations
453
Total Downloads

Downloads (Last 12 months)69
Downloads (Last 6 weeks)5

Reflects downloads up to 08 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Liu TXiao YHu MSha HMa SGao BGuo SLiu YSong W(2025)AudioGest: Gesture-Based Interaction for Virtual Reality Using Audio DevicesIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.339786831:2(1569-1581)Online publication date: Feb-2025
https://doi.org/10.1109/TVCG.2024.3397868
Su XFroehlich JKoh EXiao C(2024)SonifyAR: Context-Aware Sound Generation in Augmented RealityProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676406(1-13)Online publication date: 13-Oct-2024
https://dl.acm.org/doi/10.1145/3654777.3676406
Jin XXu CGao RWu JWang GLi S(2024)DiffSound: Differentiable Modal Sound Rendering and Inverse Rendering for Diverse Inference TasksACM SIGGRAPH 2024 Conference Papers10.1145/3641519.3657493(1-12)Online publication date: 13-Jul-2024
https://dl.acm.org/doi/10.1145/3641519.3657493
Petitjean APoirier‐Ginter YTewari ACordonnier GDrettakis G(2023)ModalNeRF: Neural Modal Analysis and Synthesis for Free‐Viewpoint Navigation in Dynamically Vibrating ScenesComputer Graphics Forum10.1111/cgf.1488842:4Online publication date: 26-Jul-2023
https://doi.org/10.1111/cgf.14888
Diaz RHayes BSaitis CFazekas GSandler M(2023)Rigid-Body Sound Synthesis with Differentiable Modal ResonatorsICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10095139(1-5)Online publication date: 4-Jun-2023
https://doi.org/10.1109/ICASSP49357.2023.10095139
Clarke SGao RWang MRau MXu JWang JJames DWu J(2023)REALIMPACT: A Dataset of Impact Sound Fields for Real Objects2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.00152(1516-1525)Online publication date: Jun-2023
https://doi.org/10.1109/CVPR52729.2023.00152
Tang ZAralikatti RRatnarajah AManocha D(2022)GWA: A Large High-Quality Acoustic Dataset for Audio ProcessingACM SIGGRAPH 2022 Conference Proceedings10.1145/3528233.3530731(1-9)Online publication date: 27-Jul-2022
https://dl.acm.org/doi/10.1145/3528233.3530731
Jin XLi SWang GManocha D(2022)NeuralSoundACM Transactions on Graphics10.1145/3528223.353018441:4(1-15)Online publication date: 22-Jul-2022
https://dl.acm.org/doi/10.1145/3528223.3530184
Gao RSi ZChang YClarke SBohg JFei-Fei LYuan WWu J(2022)ObjectFolder 2.0: A Multisensory Object Dataset for Sim2Real Transfer2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52688.2022.01034(10588-10598)Online publication date: Jun-2022
https://doi.org/10.1109/CVPR52688.2022.01034
Fagerstrom JSchlecht SValimaki V(2021)One-to-Many Conversion for Percussive Samples2021 24th International Conference on Digital Audio Effects (DAFx)10.23919/DAFx51585.2021.9768256(129-135)Online publication date: 8-Sep-2021
https://doi.org/10.23919/DAFx51585.2021.9768256
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten