Audio Tagging With Connectionist Temporal Classification Model Using Sequentially Labelled Data

Hou, Yuanbo; Kong, Qiuqiang; Li, Shengchen

doi:10.1007/978-981-13-6504-1_114

Yuanbo Hou⁴⁰,
Qiuqiang Kong⁴¹ &
Shengchen Li⁴⁰

Part of the book series: Lecture Notes in Electrical Engineering ((LNEE,volume 516))

Included in the following conference series:

International Conference in Communications, Signal Processing, and Systems

2171 Accesses
1 Citations

Abstract

Audio tagging aims to predict one or several labels in an audio clip. Many previous works use weakly labelled data (WLD) for audio tagging, where only presence or absence of sound events is known, but the order of sound events is unknown. To use the order information of sound events, we propose sequentially labelled data (SLD), where both the presence or absence and the order information of sound events are known. To utilize SLD in audio tagging, we propose a convolutional recurrent neural network followed by a connectionist temporal classification (CRNN-CTC) objective function to map from an audio clip spectrogram to SLD. Experiments show that CRNN-CTC obtains an area under curve (AUC) score of 0.986 in audio tagging, outperforming the baseline CRNN of 0.908 and 0.815 with max pooling and average pooling, respectively. In addition, we show CRNN-CTC has the ability to predict the order of sound events in an audio clip.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 169.00; Price excludes VAT (USA)

Softcover Book: USD 219.99; Price excludes VAT (USA)

Hardcover Book: USD 219.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Guo, G, and S. Z. Li. Content-based audio classification and retrieval by support vector machines. IEEE Press, 2003.
Google Scholar
Xu Y, Kong Q, Wang W and Plumbley MD. “Largescale weakly supervised audio classification using gated convolutional neural network,” arXiv preprint arXiv:1710.00343, 2017.
Stowell D, Giannoulis D, Benetos E, Lagrange M, and Plumbley MD. “Detection and classification of acoustic scenes and events,” IEEE Transactions on Multimedia. 17(10): 1733–1746, 2015.
Article Google Scholar
Mesaros A, Heittola T, et al. Dcase 2017 challenge setup: Tasks, datasets and baseline system, in Workshop on DCASE 2017, Munich, Germany, 2017.
Google Scholar
Kong Q, Xu Y, Wang W and Plumbley MD. A joint separation-classification model for sound event detection of weakly labelled data, in IEEE Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2018.
Google Scholar
Graves A and Jaitly N. Towards end-to-end speech recognition with recurrent neural networks, in Proc. of ICML, 2014.
Google Scholar
Graves A and Gomez F, Connectionist temporal classification:labelling unsegmented sequence data with recurrent neural networks, in ICML, 2006, pp. 369–376.
Google Scholar
Xu Y, Kong Q, Huang Q, Wang W and Plumbley MD. Attention and localization based on a deep convolutional recurrent model for weakly supervised audio tagging, in INTERSPEECH, 207, pp. 3083–3087.
Google Scholar
Bhavna K, Jain K and Sharma SK. Estimation of Area under Receiver Operating Characteristic Curve for Bi-Pareto and Bi-Two Parameter Exponential Models. Open Journal of Statistics 4.1(2014):1–10.
Google Scholar

Download references

Author information

Authors and Affiliations

Beijing University of Posts and Telecommunications, Beijing, China
Yuanbo Hou & Shengchen Li
Centre for Vision, Speech and Signal Processing, University of Surrey, Guildford, UK
Qiuqiang Kong

Authors

Yuanbo Hou
View author publications
You can also search for this author in PubMed Google Scholar
Qiuqiang Kong
View author publications
You can also search for this author in PubMed Google Scholar
Shengchen Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuanbo Hou .

Editor information

Editors and Affiliations

Department of Electrical Engineering, University of Texas at Arlington, Arlington, TX, USA
Qilian Liang
School of Information and Communication Engineering, Dalian University of Technology, Dalian, China
Xin Liu
School of Information Science and Technology, Dalian Maritime University, Dalian, China
Zhenyu Na
College of Electronic and Communication Engineering, Tianjin Normal University, Tianjin, China
Wei Wang
College of Electronic and Communication Engineering, Tianjin Normal University, Tianjin, China
Jiasong Mu
College of Electronic and Communication Engineering, Tianjin Normal University, Tianjin, China
Baoju Zhang

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hou, Y., Kong, Q., Li, S. (2020). Audio Tagging With Connectionist Temporal Classification Model Using Sequentially Labelled Data. In: Liang, Q., Liu, X., Na, Z., Wang, W., Mu, J., Zhang, B. (eds) Communications, Signal Processing, and Systems. CSPS 2018. Lecture Notes in Electrical Engineering, vol 516. Springer, Singapore. https://doi.org/10.1007/978-981-13-6504-1_114

Download citation

DOI: https://doi.org/10.1007/978-981-13-6504-1_114
Published: 14 August 2019
Publisher Name: Springer, Singapore
Print ISBN: 978-981-13-6503-4
Online ISBN: 978-981-13-6504-1
eBook Packages: EngineeringEngineering (R0)

Publish with us

Policies and ethics