skip to main content
10.1145/3206025.3206067acmconferencesArticle/Chapter ViewAbstractPublication PagesicmrConference Proceedingsconference-collections
research-article

Class-aware Self-Attention for Audio Event Recognition

Published: 05 June 2018 Publication History

Abstract

Audio event recognition (AER) has been an important research problem with a wide range of applications. However, it is very challenging to develop large scale audio event recognition models. On the one hand, usually there are only "weak" labeled audio training data available, which only contains labels of audio events without temporal boundaries. On the other hand, the distribution of audio events is generally long-tailed, with only a few positive samples for large amounts of audio events. These two issues make it hard to learn discriminative acoustic features to recognize audio events especially for long-tailed events. In this paper, we propose a novel class-aware self-attention mechanism with attention factor sharing to generate discriminative clip-level features for audio event recognition. Since a target audio event only occurs in part of an entire audio clip and its corresponding temporal interval varies, the proposed class-aware self-attention approach learns to highlight relevant temporal intervals and to suppress irrelevant noises at the same time. In order to learn attention patterns effectively for those long-tailed events, we combine both the domain knowledge and data driven strategies to share attention factors in the proposed attention mechanism, which transfers the common knowledge learned from other similar events to the rare events. The proposed attention mechanism is a pluggable component and can be trained end-to-end in the overall AER model. We evaluate our model on a large-scale audio event corpus "Audio Set" with both short-term and long-term acoustic features. The experimental results demonstrate the effectiveness of our model, which improves the overall audio event recognition performance with different acoustic features especially for events with low resources. Moreover, the experiments also show that our proposed model is able to learn new audio events with a few training examples effectively and efficiently without disturbing the previously learned audio events.

References

[1]
Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems. 892--900.
[2]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
[3]
Susanne Burger, Qin Jin, Peter F Schulam, and Florian Metze. 2012. Noisemes: Manual annotation of environmental noise in audio streams. (2012).
[4]
Sourish Chaudhuri, Mark Harvilla, and Bhiksha Raj. 2011. Unsupervised Learning of Acoustic Unit Descriptors for Audio Content Representation and Classification. In Interspeech. 2265--2268.
[5]
Shizhe Chen, Qin Jin, Xirong Li, Gang Yang, and Jieping Xu. 2014. Speech emotion classification using acoustic features. In Chinese Spoken Language Processing (ISCSLP), 2014 9th International Symposium on. IEEE, 579--583.
[6]
Shizhe Chen, Xinrui Li, Qin Jin, Shilei Zhang, and Yong Qin. 2016. Video emotion recognition in the wild based on fusion of multimodal features. In Proceedings of the 18th ACM International Conference on Multimodal Interaction. ACM, 494--500.
[7]
Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet. 2011. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing 19, 4 (2011), 788--798.
[8]
Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio Set: An ontology and human-labeled dataset for audio events. In IEEE ICASSP.
[9]
Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. 2017. CNN architectures for large-scale audio classification. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 131--135.
[10]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation 9, 8 (1997), 1735--1780.
[11]
Aren Jansen, Jort F Gemmeke, Daniel PW Ellis, Xiaofeng Liu, Wade Lawrence, and Dylan Freedman. 2017. Large-scale audio event discovery in one million youtube videos. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 786--790.
[12]
Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).
[13]
Anurag Kumar and Bhiksha Raj. 2016. Audio event detection using weakly labeled data. In Proceedings of the 2016 ACM on Multimedia Conference. ACM, 1038--1047.
[14]
Anurag Kumar and Bhiksha Raj. 2017. Deep cnn framework for audio event recognition using weakly labeled web data. arXiv preprint arXiv:1707.02530 (2017).
[15]
Zhouhan Lin, Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130 (2017).
[16]
Annamaria Mesaros, Toni Heittola, Aleksandr Diment, Benjamin Elizalde, Ankit Shah, Emmanuel Vincent, Bhiksha Raj, and Tuomas Virtanen. 2017. DCASE 2017 challenge setup: Tasks, datasets and baseline system. In DCASE 2017-Workshop on Detection and Classification of Acoustic Scenes and Events.
[17]
Seyedmahdad Mirsamadi, Emad Barsoum, and Cha Zhang. 2017. Automatic speech emotion recognition using recurrent neural networks with local attention. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on. IEEE, 2227--2231.
[18]
Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. 2014. Recurrent models of visual attention. In Advances in neural information processing systems. 2204--2212.
[19]
Namhoon Lee Philip Torr Saumya Jetley, Nicholas A. Lord. 2018. Learn to Pay Attention. International Conference on Learning Representations (2018). https://openreview.net/forum?id=Hyzbhf WRW accepted as poster.
[20]
Yun Wang and Florian Metze. 2016. Recurrent Support Vector Machines for Audio-Based Multimedia Event Detection. In Proceedings of the 2016 ACM on International Conference on Multimedia Retrieval. ACM, 265--269.
[21]
Yun Wang and Florian Metze. 2017. A first attempt at polyphonic sound event detection using connectionist temporal classification. In Proc. of ICASSP.
[22]
Yun Wang, Leonardo Neves, and Florian Metze. 2016. Audio-based multimedia event detection using deep recurrent neural networks. In Acoustics, Speech and Signal Processing (ICASSP), 2016 IEEE International Conference on. IEEE, 2742--2746.
[23]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. 2048--2057.
[24]
Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, and Aaron Courville. 2015. Describing videos by exploiting temporal structure. In Proceedings of the IEEE international conference on computer vision. 4507--4515.
[25]
Fang Zheng, Guoliang Zhang, and Zhanjiang Song. 2001. Comparison of different implementations of MFCC. Journal of Computer Science and Technology 16, 6 (2001), 582--589.
[26]
Bolei Zhou, Aditya Khosla, Agata Lapedriza, Aude Oliva, and Antonio Torralba. 2016. Learning deep features for discriminative localization. In Computer Vision and Pattern Recognition (CVPR), 2016 IEEE Conference on. IEEE, 2921--2

Cited By

View all
  • (2024)A fusion analytic framework for investigating functional brain connectivity differences using resting-state fMRIFrontiers in Neuroscience10.3389/fnins.2024.140265718Online publication date: 11-Dec-2024
  • (2021)PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and AggregationIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2021.312063329(3292-3306)Online publication date: 15-Oct-2021
  • (2020)A sequential self teaching approach for improving generalization in sound event recognitionProceedings of the 37th International Conference on Machine Learning10.5555/3524938.3525443(5447-5457)Online publication date: 13-Jul-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMR '18: Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval
June 2018
550 pages
ISBN:9781450350464
DOI:10.1145/3206025
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 June 2018

Permissions

Request permissions for this article.

Check for updates

Badges

  • Honorable Mention

Author Tags

  1. attention factor sharing
  2. audio event recognition
  3. class-aware self-attention

Qualifiers

  • Research-article

Funding Sources

Conference

ICMR '18
Sponsor:

Acceptance Rates

ICMR '18 Paper Acceptance Rate 44 of 136 submissions, 32%;
Overall Acceptance Rate 254 of 830 submissions, 31%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)13
  • Downloads (Last 6 weeks)1
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A fusion analytic framework for investigating functional brain connectivity differences using resting-state fMRIFrontiers in Neuroscience10.3389/fnins.2024.140265718Online publication date: 11-Dec-2024
  • (2021)PSLA: Improving Audio Tagging With Pretraining, Sampling, Labeling, and AggregationIEEE/ACM Transactions on Audio, Speech and Language Processing10.1109/TASLP.2021.312063329(3292-3306)Online publication date: 15-Oct-2021
  • (2020)A sequential self teaching approach for improving generalization in sound event recognitionProceedings of the 37th International Conference on Machine Learning10.5555/3524938.3525443(5447-5457)Online publication date: 13-Jul-2020
  • (2020)At the Speed of Sound: Efficient Audio Scene ClassificationProceedings of the 2020 International Conference on Multimedia Retrieval10.1145/3372278.3390730(301-305)Online publication date: 8-Jun-2020
  • (2019)Self-supervised Attention Model for Weakly Labeled Audio Event Classification2019 27th European Signal Processing Conference (EUSIPCO)10.23919/EUSIPCO.2019.8902567(1-5)Online publication date: Sep-2019
  • (2019)Visual Relation Detection with Multi-Level AttentionProceedings of the 27th ACM International Conference on Multimedia10.1145/3343031.3350962(121-129)Online publication date: 15-Oct-2019
  • (2019)Cosine-similarity penalty to discriminate sound classes in weakly-supervised sound event detection2019 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN.2019.8852143(1-8)Online publication date: Jul-2019
  • (2019)A Comparison of Five Multiple Instance Learning Pooling Functions for Sound Event Detection with Weak LabelingICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP.2019.8682847(31-35)Online publication date: May-2019

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media