Abstract:
Audio event detection aims at discovering the elements inside an audio clip. In addition to labeling the clips with the audio events, we want to find out the temporal loc...Show MoreMetadata
Abstract:
Audio event detection aims at discovering the elements inside an audio clip. In addition to labeling the clips with the audio events, we want to find out the temporal locations of these events. However, creating clearly annotated training data can be time-consuming. Therefore, we provide a model based on convolutional neural networks that relies only on weakly-supervised data for training. These data can be directly obtained from online platforms, such as Freesound, with the clip-level labels assigned by the uploaders. The structure of our model is extended to a fully convolutional networks, and an event-specific Gaussian filter layer is designed to advance its learning ability. Besides, this model is able to detect frame-level information, e.g., the temporal position of sounds, even when it is trained merely with clip-level labels.
Published in: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Date of Conference: 05-09 March 2017
Date Added to IEEE Xplore: 19 June 2017
ISBN Information:
Electronic ISSN: 2379-190X