Abstract
Bird species monitoring is important for the preservation of biological diversity because it provides fundamental information for biodiversity assessment and protection. Automatic acoustic recognition is considered to be an essential technology for realizing automatic monitoring of bird species. Current deep learning-based bird sound recognition methods do not fully conduct long-term correlation modeling along both the time and frequency axes of the spectrogram. Additionally, these methods have not completely studied the impact of different scales of features on the final recognition. To solve the abovementioned problems, we propose a Conformer-based dual path joint modeling network (CDPNet) for bird sound recognition. To the best of our knowledge, this is the first attempt to adopt Conformer in the bird sound recognition task. Specifically, the proposed CDPNet mainly consists of a dual-path time-frequency joint modeling module (DPTFM) and a multi-scale feature fusion module (MSFFM). The former aims to simultaneously capture time-frequency local features, long-term time dependence, and long-term frequency dependence to better model bird sound characteristics effectively. The latter is designed to improve recognition accuracy by fusing different scales of features. The proposed algorithm is implemented on an edge computing platform, NVIDIA Jetson Nano, to build a real-time bird sound recognition monitoring system. The ablation experimental results verify the benefit of using the DPTFM and the MSFFM. Through training and testing on the Semibirdaudio dataset containing 27,155 sound clips and the public Birdsdata dataset, the proposed CDPNet outperforms the other state-of-the-art models in terms of F1-score, precision, recall, and accuracy.
Graphical abstract










Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data Availability
The code and data are available from the corresponding author on reasonable request.
References
Pahuja R, Kumar A (2021) Sound-spectrogram based automatic bird species recognition using mlp classifier. Appl Acoust 180:108077. https://doi.org/10.1016/j.apacoust.2021.108077
Kułaga K, Budka M (2019) Bird species detection by an observer and an autonomous sound recorder in two different environments: Forest and farmland. PLoS One 14(2):1–19. https://doi.org/10.1371/journal.pone.0211970
Bolam FC, Mair L, Angelico M et al (2021) How many bird and mammal extinctions has recent conservation action prevented? Conserv Lett 14(1):e12762. https://doi.org/10.1111/conl.12762
Develey PF, Phalan BT (2021) Bird extinctions in brazil’s atlantic forest and how they can be prevented. Front Ecol Evol 9:624587. https://doi.org/10.3389/fevo.2021.624587
Piratelli A (2003) Mesh size and bird capture rates in mato grosso do sul state, Brazil. Braz J Biol 63:105–111. https://doi.org/10.1590/S1519-69842003000100014
Moorcroft D, Whittingham M, Bradbury R et al (2002) The selection of stubble fields by wintering granivorous birds reflects vegetation cover and food abundance. J Appl Ecol 535–547. http://www.jstor.org/stable/827145
Gilmer DS, Brass JA, Strong LL et al (1973) (1988) Goose counts from aerial photographs using an optical digitizer. Wildlife Soc B (1973-2006) 16(2):204–206. https://www.jstor.org/stable/3782190
Lei Q, Li J, Ma K (2018) Applications of remote sensing technology in avian ecology. Biodivers Sci 26(8):862. https://doi.org/10.17520/biods.2018143
Marler PR, Slabbekoorn H (2004) Nature’s music: the science of birdsong. Elsevier. https://doi.org/10.1016/B978-0-12-473070-0.X5000-2
Vidaña-Vila E, Navarro J, Alsina-Pagès RM et al (2020) A two-stage approach to automatically detect and classify woodpecker (fam. picidae) sounds. Applied Acoustics 166:107312. https://doi.org/10.1016/j.apacoust.2020.107312
Porter J, Arzberger P, Braun HW et al (2005) Wireless sensor networks for ecology. BioScience 55(7):561–572. https://doi.org/10.1641/0006-3568(2005)055[0561:WSNFE]2.0.CO;2
Porter JH, Nagy E, Kratz TK et al (2009) New eyes on the world: advanced sensors for ecology. BioScience 59(5):385–397. https://doi.org/10.1525/bio.2009.59.5.6
Franzen A, Gu IY (2003) Classification of bird species by using key song searching: A comparative study. In: SMC’03 conference proceedings. 2003 IEEE international conference on systems, man and cybernetics. Conference theme-system security and assurance (Cat. No. 03CH37483), IEEE, p 880–887. https://doi.org/10.1109/icsmc.2003.1243926
Kadurka RS, Kanakalla H (2021) Automated bird detection in audio recordings by a signal processing perspective. IJASIS 7(2):11–20. https://doi.org/10.29284/ijasis.7.2.2021.11-20
Mohanty R, Mallik BK, Solanki SS (2020) Automatic bird species recognition system using neural network based on spike. Appl Acoust 161:107177. https://doi.org/10.1016/j.apacoust.2019.107177
Yao W, Lv D, Zi J et al (2021) Crane song recognition based on the features fusion of gmm based on wavelet spectrum and mfcc. In: 2021 the 7th international conference on computer and communications (ICCC), p 501–508. https://doi.org/10.1109/ICCC54389.2021.9674627
Jančovič P, Köküer M (2011) Automatic detection and recognition of tonal bird sounds in noisy environments. Eurasip J Adv Signal Process 2011:1–10. https://doi.org/10.1155/2011/982936
Han X, Peng J (2023) Bird sound classification based on ecoc-svm. Appl Acoust 204:109245. https://doi.org/10.1016/j.apacoust.2023.109245
Fagerlund S (2007) Bird species recognition using support vector machines. Eurasip J Adv Signal Process 2007:1–8. https://doi.org/10.1155/2007/38637
Murugaiya R, Abas PE, Liyanage DS (2022) Probability enhanced entropy (pee) novel feature for improved bird sound classification. MIR 19:52–62. https://doi.org/10.1007/s11633-022-1318-3
Salamon J, Bello JP, Farnsworth A et al (2016) Towards the automatic classification of avian flight calls for bioacoustic monitoring. PLoS One 11(11):e0166866. https://doi.org/10.1371/journal.pone.0166866
Dahl GE, Yu D, Deng L et al (2011) Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE/ACM Trans Audio Speech Lang Process 20(1):30–42. https://doi.org/10.1109/TASL.2011.2134090
García-Ordás MT, Rubio-Martín S, Benítez-Andrades JA et al (2023) Multispecies bird sound recognition using a fully convolutional neural network. Appl Intell 1–14. https://doi.org/10.1007/s10489-023-04704-3
Maclean K, Triguero I (2023) Identifying bird species by their calls in soundscapes. Appl Intell 1–15. https://doi.org/10.1007/s10489-023-04486-8
Koops HV, Van Balen J, Wiering F et al (2014) A deep neural network approach to the lifeclef 2014 bird task. CLEF2014 working notes, vol 1180, p 634–642. https://api.semanticscholar.org/CorpusID:9591212
Permana SDH, Saputra G, Arifitama B et al (2022) Classification of bird sounds as an early warning method of forest fires using convolutional neural network (cnn) algorithm. J King Saud Univ-Com 34(7):4345–4357. https://doi.org/10.1016/j.jksuci.2021.04.013
Xie J, Hu K, Zhu M et al (2019) Investigation of different cnn-based models for improved bird sound classification. IEEE Access 7:175353–175361. https://doi.org/10.1109/ACCESS.2019.2957572
Xie J, Zhu M (2022) Sliding-window based scale-frequency map for bird sound classification using 2d-and 3d-cnn. Expert Syst Appl 207:118054. https://doi.org/10.1016/j.eswa.2022.118054
LeBien J, Zhong M, Campos-Cerqueira M et al (2020) A pipeline for identification of bird and frog species in tropical soundscape recordings using a convolutional neural network. Ecol Inform 59:101113. https://doi.org/10.1016/j.ecoinf.2020.101113
Hong TY, Zabidi M (2021) Bird sound detection with convolutional neural networks using raw waveforms and spectrograms. In: Proceedings of the international symposium on applied science and engineering, Erzurum, Turkey, p 7–9. https://doi.org/10.1109/SISY.2018.8524677
Sevilla A, Glotin H (2017) Audio bird classification with inception-v4 extended with time and time-frequency attention mechanisms. CLEF (working notes), vol 1866, p 1–8. https://api.semanticscholar.org/CorpusID:2699819
Szegedy C, Ioffe S, Vanhoucke V et al (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the AAAI conference on artificial intelligence. https://doi.org/10.1609/aaai.v31i1.11231
Lasseck M (2019) Bird species identification in soundscapes. CLEF (working notes) 2380. https://api.semanticscholar.org/CorpusID:198489397
Szegedy C, Vanhoucke V, Ioffe S et al (2016) Rethinking the inception architecture for computer vision. In: In Proc. of IEEE/CVF Conf.e on CVPR, pp 2818–2826, https://doi.org/10.1109/CVPR.2016.308
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: In Proc. of IEEE/CVF conf.e on CVPR, p 770–778. https://doi.org/10.1109/CVPR.2016.90
Sainath TN, Vinyals O, Senior A et al (2015) Convolutional, long short-term memory, fully connected deep neural networks. In: ICASSP, IEEE, p 4580–4584. https://doi.org/10.1109/ICASSP.2015.7178838
Xu X, Dinkel H, Wu M et al (2020) A crnn-gru based reinforcement learning approach to audio captioning. In: DCASE, p 225–229. https://api.semanticscholar.org/CorpusID:235804625
Nishikimi R, Nakamura E, Goto M et al (2021) Audio-to-score singing transcription based on a crnn-hsmm hybrid model. APSIPA Trans Signal 10:e7. https://doi.org/10.1017/ATSIP.2021.4
Liu A, Zhang L, Mei Y et al (2021) Residual recurrent crnn for end-to-end optical music recognition on monophonic scores. In: Proceedings of the 2021 workshop on multi-modal pre-training for multimedia understanding, p 23–27. https://doi.org/10.1145/3463945.3469056
Cakir E, Adavanne S, Parascandolo G et al (2017) Convolutional recurrent neural networks for bird audio detection. In: EUSIPCO, IEEE, p 1744–1748. https://doi.org/10.23919/EUSIPCO.2017.8081508
Xie J, Zhao S, Li X et al (2022) Kd-cldnn: Lightweight automatic recognition model based on bird vocalization. Appl Acoust 188:108550. https://doi.org/10.1016/j.apacoust.2021.108550
Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: Proc. ICML, PMLR, p 1310–1318. https://proceedings.mlr.press/v28/pascanu13.html
Gulati A, Qin J, Chiu CC et al (2020) Conformer: Convolution-augmented transformer for speech recognition. In: Interspeech, p 5036–5040. https://doi.org/10.21437/Interspeech.2020-3015
Koizumi Y, Karita S, Wisdom S et al (2021) Df-conformer: Integrated architecture of conv-tasnet and conformer using linear complexity self-attention for speech enhancement. In: WASPAA, p 161–165. https://doi.org/10.1109/WASPAA52581.2021.9632794
Burchi M, Vielzeuf V (2021) Efficient conformer: Progressive downsampling and grouped attention for automatic speech recognition. In: ASRU, p 8–15. https://doi.org/10.1109/ASRU51503.2021.9687874
King A (1989) Functional anatomy of the syrinx. Form and function in birds, vol 4, pp 105–192
Zheng C, Zhang H, Liu W et al (2023) Sixty years of frequency-domain monaural speech enhancement: From traditional to deep learning methods. Trends Hear 27:23312165231209910. https://doi.org/10.1177/23312165231209913
Cheng J, Xie B, Lin C et al (2012) A comparative study in birds: call-type-independent species and individual recognition using four machine-learning methods and two acoustic features. Bioacoustics 21(2):157–171. https://doi.org/10.1080/09524622.2012.669664
Stowell D, Plumbley MD (2014) Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning. PeerJ 2:e488. https://doi.org/10.7717/peerj.488
Noumida A, Rajan R (2022) Multi-label bird species classification from audio recordings using attention framework. Appl Acoust 197:108901. https://doi.org/10.1016/j.apacoust.2022.108901
Kahl S, Wood CM, Eibl M et al (2021) Birdnet: A deep learning solution for avian diversity monitoring. Ecol Inform 61:101236. https://doi.org/10.1016/j.ecoinf.2021.101236
Lesnichaia M, Mikhailava V, Bogach N et al (2022) Classification of accented english using cnn model trained on amplitude mel-spectrograms. Proc Interspeech 2022, p 3669–3673. https://doi.org/10.21437/interspeech.2022-462
Tang C, Luo C, Zhao Z et al (2021) Joint time-frequency and time domain learning for speech enhancement. In: Proc. IJCAI, p 3816–3822. https://doi.org/10.24963/ijcai.2020/524
Woo S, Park J, Lee JY et al (2018) CBAM: Convolutional Block Attention Module. In: ECCV, p 3–19. https://doi.org/10.1007/978-3-030-01234-2_1
Rubinstein R (1999) The cross-entropy method for combinatorial and continuous optimization. Methodol Comput Appl Probab 1:127–190. https://doi.org/10.1023/A:1010091220143
Li X, Li G, Li X (2008) Improved voice activity detection based on iterative spectral subtraction and double thresholds for cvr. In: 2008 Workshop on power electronics and intelligent transportation system, p 153–156. https://doi.org/10.1109/PEITS.2008.84
Loshchilov I, Hutter F (2018) Decoupled weight decay regularization. In: International conference on learning representations. https://api.semanticscholar.org/CorpusID:53592270
Smith LN, Topin N (2019) Super-convergence: very fast training of neural networks using large learning rates. In: Artificial intelligence and machine learning for multi-domain operations applications, international society for optics and photonics, vol 11006. SPIE, p 1100612. https://doi.org/10.1117/12.2520589
Selvaraju RR, Cogswell M, Das A, et al (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proc. of IEEE/CVF conf.e on ICCV, p 618–626. https://doi.org/10.1109/ICCV.2017.74
Li A, Yu G, Zheng C et al (2023) A general unfolding speech enhancement method motivated by taylor’s theorem. IEEE/ACM Trans Audio Speech Lang Process 31:3629–3646. https://doi.org/10.1109/TASLP.2023.3313442
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: 3rd International conference on learning representations (ICLR 2015), computational and biological learning society. https://api.semanticscholar.org/CorpusID:14124313
Schwab E, Pogrebnoj S, Freund M et al (2022) Automated bat call classification using deep convolutional neural networks. Bioacoustics 1–16. https://doi.org/10.1080/09524622.2022.2050816
Huang G, Liu Z, Van Der Maaten L et al (2017) Densely connected convolutional networks. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), p 2261–2269. https://doi.org/10.1109/CVPR.2017.243
Tanzi L, Audisio A, Cirrincione G et al (2022) Vision transformer for femur fracture classification. Injury 53(7):2625–2634. https://doi.org/10.1016/j.injury.2022.04.013
Liu B, Shen Z, Huang L et al (2021) A 1d-crnn inspired reconfigurable processor for noise-robust low-power keywords recognition. In: 2021 Design, automation & test in Europe conference & exhibition (DATE), p 495–500. https://doi.org/10.23919/DATE51398.2021.9474172
Xiao H, Liu D, Chen K et al (2022) Amresnet: An automatic recognition model of bird sounds in real environment. Appl Acoust 201:109121. https://doi.org/10.1016/j.apacoust.2022.109121
Xiao H, Liu D, Chen K et al (2022) Amresnet: An automatic recognition model of bird sounds in real environment. Appl Acoust 201:109121. https://doi.org/10.1016/j.apacoust.2022.109121
Lin X, Liu J, Kang X (2016) Audio recapture detection with convolutional neural networks. IEEE Trans Multimedia 18(8):1480–1487. https://doi.org/10.1109/TMM.2016.2571999
Funding
This work is supported in part by the National Key R &D Program of China (2022ZD0116304), the Network Security and Informatization Special Project of the Chinese Academy of Sciences under Grant CAS-WX2021SF-0501, and the Yunnan Province-Kunming City Major Science and Technology Project (202202AH210006).
Author information
Authors and Affiliations
Contributions
Huimin Guo: Conceptualization, Methodology, Software, Writing Reviewing, and Writing Original Draft. Haifang Jian: Supervision, Discussion, and Editing. Yiyu Wang: Visualization, Investigation. Hongchang Wang: Discussion, Editing. Shuaikang Zheng: Writing Reviewing. Qinghua Cheng: Software, Validation. Yuehao Li: Writing Reviewing.
Corresponding author
Ethics declarations
Competing interest
The authors declare that they have no conflicts of interest regarding this work.
Ethics approval
The authors confirm that they have complied with the publication ethics and state that this work is original and has not been used for publication anywhere before.
Consent to participate
The authors are willing to participate in journal promotions and updates.
Consent for Publication
The authors give consent to the journal regarding the publication of this work.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Guo, H., Jian, H., Wang, Y. et al. CDPNet: conformer-based dual path joint modeling network for bird sound recognition. Appl Intell 54, 3152–3168 (2024). https://doi.org/10.1007/s10489-024-05362-9
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-024-05362-9