research-article

Class-aware Self-Attention for Audio Event Recognition

Authors:

Shizhe Chen,

Jia Chen,

Qin Jin,

Alexander HauptmannAuthors Info & Claims

ICMR '18: Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval

Pages 28 - 36

https://doi.org/10.1145/3206025.3206067

Published: 05 June 2018 Publication History

Get Access

Abstract

Audio event recognition (AER) has been an important research problem with a wide range of applications. However, it is very challenging to develop large scale audio event recognition models. On the one hand, usually there are only "weak" labeled audio training data available, which only contains labels of audio events without temporal boundaries. On the other hand, the distribution of audio events is generally long-tailed, with only a few positive samples for large amounts of audio events. These two issues make it hard to learn discriminative acoustic features to recognize audio events especially for long-tailed events. In this paper, we propose a novel class-aware self-attention mechanism with attention factor sharing to generate discriminative clip-level features for audio event recognition. Since a target audio event only occurs in part of an entire audio clip and its corresponding temporal interval varies, the proposed class-aware self-attention approach learns to highlight relevant temporal intervals and to suppress irrelevant noises at the same time. In order to learn attention patterns effectively for those long-tailed events, we combine both the domain knowledge and data driven strategies to share attention factors in the proposed attention mechanism, which transfers the common knowledge learned from other similar events to the rare events. The proposed attention mechanism is a pluggable component and can be trained end-to-end in the overall AER model. We evaluate our model on a large-scale audio event corpus "Audio Set" with both short-term and long-term acoustic features. The experimental results demonstrate the effectiveness of our model, which improves the overall audio event recognition performance with different acoustic features especially for events with low resources. Moreover, the experiments also show that our proposed model is able to learn new audio events with a few training examples effectively and efficiently without disturbing the previously learned audio events.

References

[1]

Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. Soundnet: Learning sound representations from unlabeled video. In Advances in Neural Information Processing Systems. 892--900.

Abstract

References

Cited By

Index Terms

Recommendations

A review of deep learning techniques in audio event recognition (AER) applications

Federated Self-training for Semi-supervised Audio Recognition

Audio Event Detection using Weakly Labeled Data

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Badges

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations