Abstract
Audio tagging aims to predict one or several labels in an audio clip. Many previous works use weakly labelled data (WLD) for audio tagging, where only presence or absence of sound events is known, but the order of sound events is unknown. To use the order information of sound events, we propose sequentially labelled data (SLD), where both the presence or absence and the order information of sound events are known. To utilize SLD in audio tagging, we propose a convolutional recurrent neural network followed by a connectionist temporal classification (CRNN-CTC) objective function to map from an audio clip spectrogram to SLD. Experiments show that CRNN-CTC obtains an area under curve (AUC) score of 0.986 in audio tagging, outperforming the baseline CRNN of 0.908 and 0.815 with max pooling and average pooling, respectively. In addition, we show CRNN-CTC has the ability to predict the order of sound events in an audio clip.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Guo, G, and S. Z. Li. Content-based audio classification and retrieval by support vector machines. IEEE Press, 2003.
Xu Y, Kong Q, Wang W and Plumbley MD. “Largescale weakly supervised audio classification using gated convolutional neural network,” arXiv preprint arXiv:1710.00343, 2017.
Stowell D, Giannoulis D, Benetos E, Lagrange M, and Plumbley MD. “Detection and classification of acoustic scenes and events,” IEEE Transactions on Multimedia. 17(10): 1733–1746, 2015.
Mesaros A, Heittola T, et al. Dcase 2017 challenge setup: Tasks, datasets and baseline system, in Workshop on DCASE 2017, Munich, Germany, 2017.
Kong Q, Xu Y, Wang W and Plumbley MD. A joint separation-classification model for sound event detection of weakly labelled data, in IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2018.
Graves A and Jaitly N. Towards end-to-end speech recognition with recurrent neural networks, in Proc. of ICML, 2014.
Graves A and Gomez F, Connectionist temporal classification:labelling unsegmented sequence data with recurrent neural networks, in ICML, 2006, pp. 369–376.
Xu Y, Kong Q, Huang Q, Wang W and Plumbley MD. Attention and localization based on a deep convolutional recurrent model for weakly supervised audio tagging, in INTERSPEECH, 207, pp. 3083–3087.
Bhavna K, Jain K and Sharma SK. Estimation of Area under Receiver Operating Characteristic Curve for Bi-Pareto and Bi-Two Parameter Exponential Models. Open Journal of Statistics 4.1(2014):1–10.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Hou, Y., Kong, Q., Li, S. (2020). Audio Tagging With Connectionist Temporal Classification Model Using Sequentially Labelled Data. In: Liang, Q., Liu, X., Na, Z., Wang, W., Mu, J., Zhang, B. (eds) Communications, Signal Processing, and Systems. CSPS 2018. Lecture Notes in Electrical Engineering, vol 516. Springer, Singapore. https://doi.org/10.1007/978-981-13-6504-1_114
Download citation
DOI: https://doi.org/10.1007/978-981-13-6504-1_114
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-6503-4
Online ISBN: 978-981-13-6504-1
eBook Packages: EngineeringEngineering (R0)