skip to main content
10.1145/3394171.3413572acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Deep-Modal: Real-Time Impact Sound Synthesis for Arbitrary Shapes

Published: 12 October 2020 Publication History

Abstract

Model sound synthesis is a physically-based sound synthesis method used to generate audio content in games and virtual worlds. We present a novel learning-based impact sound synthesis algorithm called Deep-Modal. Our approach can handle sound synthesis for common arbitrary objects, especially dynamic generated objects, in real-time. We present a new compact strategy to represent the mode data, corresponding to frequency and amplitude, as fixed-length vectors. This is combined with a new network architecture that can convert shape features of 3D objects into mode data. Our network is based on an encoder-decoder architecture with the contact positions of objects and external forces embedded. Our method can synthesize interactive sounds related to objects of various shapes at any contact position, as well as objects of different materials and sizes. The synthesis process only takes ~0.01s on a GTX 1080 Ti GPU. We show the effectiveness of Deep-Modal through extensive evaluation using different metrics, including recall and precision of prediction, sound spectrogram, and a user study.

Supplementary Material

MP4 File (3394171.3413572.mp4)
Presentation Video for the paper "Deep-Modal: Real-Time Impact Sound Synthesis for Arbitrary Shapes"

References

[1]
Lakulish Antani, Anish Chandak, Micah Taylor, and Dinesh Manocha. 2011. Direct-to-indirect acoustic radiance transfer. IEEE Transactions on Visualization and Computer Graphics 18, 2 (2011), 261--269.
[2]
Marc Aretz. 2012. Combined wave and ray based room acoustic simulations of small rooms. Vol. 12. Logos Verlag Berlin GmbH.
[3]
Nicolas Bonneel, George Drettakis, Nicolas Tsingos, Isabelle Viaud-Delmon, and Doug James. 2008. Fast modal sounds with scalable frequency-domain synthesis. In ACM SIGGRAPH 2008 papers. 1--9.
[4]
Jeffrey N Chadwick, Changxi Zheng, and Doug L James. 2012. Precomputed acceleration noise for improved rigid-body sound. ACM Transactions on Graphics (TOG) 31, 4 (2012), 1--9.
[5]
Adam Dziekonski and Michal Mrozowski. 2018. A GPU solver for sparse generalized eigenvalue problems with symmetric complex-valued matrices obtained using higher-order FEM. IEEE Access 6 (2018), 69826--69834.
[6]
Ariel Ephrat, Inbar Mosseri, Oran Lang, Tali Dekel, Kevin Wilson, Avinatan Hassidim, William T Freeman, and Michael Rubinstein. 2018. Looking to listen at the cocktail party: a speaker-independent audio-visual model for speech separation. ACM Transactions on Graphics (TOG) 37, 4 (2018), 1--11.
[7]
Ziqi Fan, Vibhav Vineet, Hannes Gamper, and Nikunj Raghuvanshi. 2020. Fast acoustic scattering using convolutional neural networks. In ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 171--175.
[8]
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can spatiotemporal 3d cnns retrace the history of 2d cnns and imagenet?. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6546--6555.
[9]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[10]
DL James, TR Langlois, R Mehra, and C Zheng. 2016. Physically Based Sound for Computer Animation and Virtual Environments. ACM SIGGRAPH 2016 Course.
[11]
Doug L James, and Dinesh K Pai. 2006. Precomputed acoustic transfer: output-sensitive, accurate sound generation for geometrically complex vibration sources. ACM Transactions on Graphics (TOG) 25, 3 (2006), 987--995.
[12]
Asako Kanezaki, Yasuyuki Matsushita, and Yoshifumi Nishida. 2018. Rotationnet: Joint object categorization and pose estimation using multiviews from unsupervised viewpoints. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 5010--5019.
[13]
Ignacio Llamas. 2007. Real-time voxelization of triangle meshes on the GPU. In SIGGRAPH Sketches. 18.
[14]
Jonathan Long, Evan Shelhamer, and Trevor Darrell. 2015. Fully convolutional networks for semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3431--3440.
[15]
Daniel Maturana and Sebastian Scherer. 2015. Voxnet: A 3d convolutional neural network for real-time object recognition. In 2015 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 922--928.
[16]
Ravish Mehra, Nikunj Raghuvanshi, Lakulish Antani, Anish Chandak, Sean Curtis, and Dinesh Manocha. 2013. Wave-based sound propagation in large open scenes using an equivalent source formulation. ACM Transactions on Graphics (TOG) 32, 2 (2013), 1--13.
[17]
Alejandro Newell, Kaiyu Yang, and Jia Deng. 2016. Stacked hourglass networks for human pose estimation. In European conference on computer vision. Springer, 483--499.
[18]
James F O'Brien, Chen Shen, and Christine M Gatchalian. 2002. Synthesizing sounds from rigid-body simulations. In Proceedings of the 2002 ACM SIGGRAPH/ Eurographics symposium on Computer animation. 175--181.
[19]
Andrew Owens, Phillip Isola, Josh McDermott, Antonio Torralba, Edward H Adelson, and William T Freeman. 2016. Visually indicated sounds. In Proceedings of the IEEE conference on computer vision and pattern recognition. 2405--2413.
[20]
R Falcon Perez. 2018. Machine-learning-based estimation of room acoustic parameters. Ph.D. Dissertation. Aalto University, School of Electrical Engineering.
[21]
Charles R Qi, Hao Su, Kaichun Mo, and Leonidas J Guibas. 2017. Pointnet: Deep learning on point sets for 3d classification and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition. 652--660.
[22]
Charles R Qi, Hao Su, Matthias Nießner, Angela Dai, Mengyuan Yan, and Leonidas J Guibas. 2016. Volumetric and multi-view cnns for object classification on 3d data. In Proceedings of the IEEE conference on computer vision and pattern recognition. 5648--5656.
[23]
Charles Ruizhongtai Qi, Li Yi, Hao Su, and Leonidas J Guibas. 2017. Pointnet++: Deep hierarchical feature learning on point sets in a metric space. In Advances in neural information processing systems. 5099--5108.
[24]
Nikunj Raghuvanshi and Ming C Lin. 2006. Interactive sound synthesis for large scale environments. In Proceedings of the 2006 symposium on Interactive 3D graphics and games. 101--108.
[25]
Nikunj Raghuvanshi, Rahul Narain, and Ming C Lin. 2009. Efficient and accurate sound propagation using adaptive rectangular decomposition. IEEE Transactions on Visualization and Computer Graphics 15, 5 (2009), 789--801.
[26]
Zhimin Ren, Hengchin Yeh, and Ming C Lin. 2013. Example-guided physically based modal sound synthesis. ACM Transactions on Graphics (TOG) 32, 1 (2013), 1--16.
[27]
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention. Springer, 234--241.
[28]
Atul Rungta, Carl Schissler, Ravish Mehra, Chris Malloy, Ming Lin, and Dinesh Manocha. 2016. SynCoPation: Interactive synthesis-coupled sound propagation. IEEE transactions on visualization and computer graphics 22, 4 (2016), 1346--1355.
[29]
Jonathan Shen, Ruoming Pang, Ron J Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, Rj Skerrv-Ryan, et al. 2018. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4779--4783.
[30]
Auston Sterling and Ming C Lin. 2016. Interactive modal sound synthesis using generalized proportional damping. In Proceedings of the 20th ACM SIGGRAPH Symposium on Interactive 3D Graphics and Games. 79--86.
[31]
Auston Sterling, Nicholas Rewkowski, Roberta L Klatzky, and Ming C Lin. 2019. Audio-material reconstruction for virtualized reality using a probabilistic damping model. IEEE transactions on visualization and computer graphics 25, 5 (2019), 1855--1864.
[32]
Auston Sterling, Justin Wilson, Sam Lowe, and Ming C. Lin. 2018. ISNN: Impact Sound Neural Network for Audio-Visual Object Classification. In Computer Vision -- ECCV 2018, Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (Eds.). Springer International Publishing, Cham, 578--595.
[33]
Hang Su, Subhransu Maji, Evangelos Kalogerakis, and Erik Learned-Miller. 2015. Multi-view convolutional neural networks for 3d shape recognition. In Proceedings of the IEEE international conference on computer vision. 945--953.
[34]
Zhenyu Tang, Nicholas J Bryan, Dingzeyu Li, Timothy R Langlois, and Dinesh Manocha. 2020. Scene-Aware Audio Rendering via Deep Acoustic Analysis. IEEE Transactions on Visualization and Computer Graphics 26, 5 (2020), 1991--2001.
[35]
Zhenyu Tang, John D. Kanu, Kevin Hogan, and Dinesh Manocha. 2019. Regression and Classification for Direction-of-Arrival Estimation with Convolutional Recurrent Neural Networks. In Proc. Interspeech 2019. 654--658.
[36]
Kees van de Doel and Dinesh K Pai. 1996. Synthesis of shape dependent sounds with physical modeling. Georgia Institute of Technology.
[37]
Kees Van Den Doel, Paul G Kry, and Dinesh K Pai. 2001. FoleyAutomatic: physically-based sound effects for interactive simulation and animation. In Proceedings of the 28th annual conference on Computer graphics and interactive techniques. 537--544.
[38]
Kees Van den Doel and Dinesh K Pai. 1998. The sounds of physical shapes. Presence 7, 4 (1998), 382--395.
[39]
Aäron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, Andrew Senior, and Koray Kavukcuoglu. [n.d.]. WaveNet: A Generative Model for Raw Audio. In 9th ISCA Speech Synthesis Workshop. 125--125.
[40]
Jui-Hsien Wang and Doug L James. 2019. KleinPAT: optimal mode conflation for time-domain precomputation of acoustic transfer. ACM Trans. Graph. 38, 4 (2019), 122--1.
[41]
Jui-Hsien Wang, Ante Qu, Timothy R Langlois, and Doug L James. 2018. Toward wave-based sound synthesis for computer animation. ACM Trans. Graph. 37, 4 (2018), 109--1.
[42]
YuxuanWang, R.J. Skerry-Ryan, Daisy Stanton, YonghuiWu, RonWeiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc Le, Yannis Agiomyrgiannakis, Rob Clark, and Rif Saurous. 2017. Tacotron: Towards End-to- End Speech Synthesis. In Proc. Interspeech 2017. 4006--4010.
[43]
Yue Wang, Yongbin Sun, Ziwei Liu, Sanjay E Sarma, Michael M Bronstein, and Justin M Solomon. 2019. Dynamic graph cnn for learning on point clouds. ACM Transactions On Graphics (TOG) 38, 5 (2019), 1--12.
[44]
Zhirong Wu, Shuran Song, Aditya Khosla, Fisher Yu, Linguang Zhang, Xiaoou Tang, and Jianxiong Xiao. 2015. 3D shapenets: A deep representation for volumetric shapes. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1912--1920.
[45]
Bin Xiao, Haiping Wu, and Yichen Wei. 2018. Simple baselines for human pose estimation and tracking. In Proceedings of the European conference on computer vision (ECCV). 466--481.
[46]
Qiong Zhang, Lu Ye, and Zhigeng Pan. 2005. Physically-based sound synthesis on GPUs. In International Conference on Entertainment Computing. Springer, 328--333.
[47]
Zhoutong Zhang, Qiujia Li, Zhengjia Huang, JiajunWu, Josh Tenenbaum, and Bill Freeman. 2017. Shape and material from sound. In Advances in Neural Information Processing Systems. 1278--1288.
[48]
Changxi Zheng and Doug L James. 2010. Rigid-body fracture sound with precomputed soundbanks. In ACM SIGGRAPH 2010 papers. 1--13.
[49]
Changxi Zheng and Doug L James. 2011. Toward high-quality modal contact sound. In ACM SIGGRAPH 2011 papers. 1--12.

Cited By

View all
  • (2025)AudioGest: Gesture-Based Interaction for Virtual Reality Using Audio DevicesIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.339786831:2(1569-1581)Online publication date: Feb-2025
  • (2024)SonifyAR: Context-Aware Sound Generation in Augmented RealityProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676406(1-13)Online publication date: 13-Oct-2024
  • (2024)DiffSound: Differentiable Modal Sound Rendering and Inverse Rendering for Diverse Inference TasksACM SIGGRAPH 2024 Conference Papers10.1145/3641519.3657493(1-12)Online publication date: 13-Jul-2024
  • Show More Cited By

Index Terms

  1. Deep-Modal: Real-Time Impact Sound Synthesis for Arbitrary Shapes

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '20: Proceedings of the 28th ACM International Conference on Multimedia
    October 2020
    4889 pages
    ISBN:9781450379885
    DOI:10.1145/3394171
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 12 October 2020

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. amplitude
    2. dynamic object
    3. frequency
    4. impact
    5. impact sound
    6. neural networks
    7. shape feature
    8. sound synthesis

    Qualifiers

    • Research-article

    Funding Sources

    • National Key Research and Development
    • National Natural Science Foundation of China

    Conference

    MM '20
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)69
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 08 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)AudioGest: Gesture-Based Interaction for Virtual Reality Using Audio DevicesIEEE Transactions on Visualization and Computer Graphics10.1109/TVCG.2024.339786831:2(1569-1581)Online publication date: Feb-2025
    • (2024)SonifyAR: Context-Aware Sound Generation in Augmented RealityProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676406(1-13)Online publication date: 13-Oct-2024
    • (2024)DiffSound: Differentiable Modal Sound Rendering and Inverse Rendering for Diverse Inference TasksACM SIGGRAPH 2024 Conference Papers10.1145/3641519.3657493(1-12)Online publication date: 13-Jul-2024
    • (2023)ModalNeRF: Neural Modal Analysis and Synthesis for Free‐Viewpoint Navigation in Dynamically Vibrating ScenesComputer Graphics Forum10.1111/cgf.1488842:4Online publication date: 26-Jul-2023
    • (2023)Rigid-Body Sound Synthesis with Differentiable Modal ResonatorsICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP49357.2023.10095139(1-5)Online publication date: 4-Jun-2023
    • (2023)REALIMPACT: A Dataset of Impact Sound Fields for Real Objects2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52729.2023.00152(1516-1525)Online publication date: Jun-2023
    • (2022)GWA: A Large High-Quality Acoustic Dataset for Audio ProcessingACM SIGGRAPH 2022 Conference Proceedings10.1145/3528233.3530731(1-9)Online publication date: 27-Jul-2022
    • (2022)NeuralSoundACM Transactions on Graphics10.1145/3528223.353018441:4(1-15)Online publication date: 22-Jul-2022
    • (2022)ObjectFolder 2.0: A Multisensory Object Dataset for Sim2Real Transfer2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)10.1109/CVPR52688.2022.01034(10588-10598)Online publication date: Jun-2022
    • (2021)One-to-Many Conversion for Percussive Samples2021 24th International Conference on Digital Audio Effects (DAFx)10.23919/DAFx51585.2021.9768256(129-135)Online publication date: 8-Sep-2021
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media