Abstract:
AudioSet, comprising over 2 million human-labeled sound clips, remains one of the biggest and most versatile publicly available audio events datasets. Deep neural network...Show MoreMetadata
Abstract:
AudioSet, comprising over 2 million human-labeled sound clips, remains one of the biggest and most versatile publicly available audio events datasets. Deep neural networks trained on this data are able to detect 527 types of sounds organized in a hierarchical (tree-like) structure named ontology. However, these models are also often used as feature extractors or serve as a basis for knowledge transfer to other sound detection and classification tasks. When describing the AudioSet recordings, raters were asked to choose one or more labels from the ontology. Analysis of the dataset reveals that raters were inconsistent and imprecise when dealing with the hierarchy of sounds. For example, some raters selected only the most precise labels while others selected all relevant labels (i.e. all parents of selected child labels). Additionally, a large fraction of sound clips are labeled with general labels without providing any fine-grained labels. These issues harm the quality of features learned by the models trained on AudioSet. As a remedy, we propose two ways in which the dataset can be automatically re-labeled to achieve specific, consistent and complete label definitions on all levels of the ontology tree. Experimental results show significant improvement in the performance of new models trained on features extracted from, or initialized with weights transferred from base models trained with re-labeled AudioSet data. In a more general view, this work highlights the importance of paying attention to the labeling of data as a way to improve model accuracy.
Published in: 2024 Signal Processing: Algorithms, Architectures, Arrangements, and Applications (SPA)
Date of Conference: 25-27 September 2024
Date Added to IEEE Xplore: 17 October 2024
ISBN Information: