CDPNet: conformer-based dual path joint modeling network for bird sound recognition

Guo, Huimin; Jian, Haifang; Wang, Yiyu; Wang, Hongchang; Zheng, Shuaikang; Cheng, Qinghua; Li, Yuehao

doi:10.1007/s10489-024-05362-9

CDPNet: conformer-based dual path joint modeling network for bird sound recognition

Published: 02 March 2024

Volume 54, pages 3152–3168, (2024)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Huimin Guo^1,2,
Haifang Jian^1,2,
Yiyu Wang³,
Hongchang Wang^1,2,
Shuaikang Zheng^1,2,
Qinghua Cheng^1,2 &
…
Yuehao Li^1,2

544 Accesses
1 Altmetric
Explore all metrics

Abstract

Bird species monitoring is important for the preservation of biological diversity because it provides fundamental information for biodiversity assessment and protection. Automatic acoustic recognition is considered to be an essential technology for realizing automatic monitoring of bird species. Current deep learning-based bird sound recognition methods do not fully conduct long-term correlation modeling along both the time and frequency axes of the spectrogram. Additionally, these methods have not completely studied the impact of different scales of features on the final recognition. To solve the abovementioned problems, we propose a Conformer-based dual path joint modeling network (CDPNet) for bird sound recognition. To the best of our knowledge, this is the first attempt to adopt Conformer in the bird sound recognition task. Specifically, the proposed CDPNet mainly consists of a dual-path time-frequency joint modeling module (DPTFM) and a multi-scale feature fusion module (MSFFM). The former aims to simultaneously capture time-frequency local features, long-term time dependence, and long-term frequency dependence to better model bird sound characteristics effectively. The latter is designed to improve recognition accuracy by fusing different scales of features. The proposed algorithm is implemented on an edge computing platform, NVIDIA Jetson Nano, to build a real-time bird sound recognition monitoring system. The ablation experimental results verify the benefit of using the DPTFM and the MSFFM. Through training and testing on the Semibirdaudio dataset containing 27,155 sound clips and the public Birdsdata dataset, the proposed CDPNet outperforms the other state-of-the-art models in terms of F1-score, precision, recall, and accuracy.

Graphical abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 8

Multispecies bird sound recognition using a fully convolutional neural network

Article 07 July 2023

Soundscape analysis using eco-acoustic indices for the birds biodiversity assessment in urban parks (case study: Isfahan City, Iran)

Article 02 May 2023

Sound-based bird classification using multiple features and machine learning paradigms

Article 17 January 2025

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Data Availability

The code and data are available from the corresponding author on reasonable request.

Notes

References

Pahuja R, Kumar A (2021) Sound-spectrogram based automatic bird species recognition using mlp classifier. Appl Acoust 180:108077. https://doi.org/10.1016/j.apacoust.2021.108077
Article Google Scholar
Kułaga K, Budka M (2019) Bird species detection by an observer and an autonomous sound recorder in two different environments: Forest and farmland. PLoS One 14(2):1–19. https://doi.org/10.1371/journal.pone.0211970
Article Google Scholar
Bolam FC, Mair L, Angelico M et al (2021) How many bird and mammal extinctions has recent conservation action prevented? Conserv Lett 14(1):e12762. https://doi.org/10.1111/conl.12762
Article Google Scholar
Develey PF, Phalan BT (2021) Bird extinctions in brazil’s atlantic forest and how they can be prevented. Front Ecol Evol 9:624587. https://doi.org/10.3389/fevo.2021.624587
Article Google Scholar
Piratelli A (2003) Mesh size and bird capture rates in mato grosso do sul state, Brazil. Braz J Biol 63:105–111. https://doi.org/10.1590/S1519-69842003000100014
Article Google Scholar
Moorcroft D, Whittingham M, Bradbury R et al (2002) The selection of stubble fields by wintering granivorous birds reflects vegetation cover and food abundance. J Appl Ecol 535–547. http://www.jstor.org/stable/827145
Gilmer DS, Brass JA, Strong LL et al (1973) (1988) Goose counts from aerial photographs using an optical digitizer. Wildlife Soc B (1973-2006) 16(2):204–206. https://www.jstor.org/stable/3782190
Lei Q, Li J, Ma K (2018) Applications of remote sensing technology in avian ecology. Biodivers Sci 26(8):862. https://doi.org/10.17520/biods.2018143
Article Google Scholar
Marler PR, Slabbekoorn H (2004) Nature’s music: the science of birdsong. Elsevier. https://doi.org/10.1016/B978-0-12-473070-0.X5000-2
Vidaña-Vila E, Navarro J, Alsina-Pagès RM et al (2020) A two-stage approach to automatically detect and classify woodpecker (fam. picidae) sounds. Applied Acoustics 166:107312. https://doi.org/10.1016/j.apacoust.2020.107312
Article Google Scholar
Porter J, Arzberger P, Braun HW et al (2005) Wireless sensor networks for ecology. BioScience 55(7):561–572. https://doi.org/10.1641/0006-3568(2005)055[0561:WSNFE]2.0.CO;2
Article Google Scholar
Porter JH, Nagy E, Kratz TK et al (2009) New eyes on the world: advanced sensors for ecology. BioScience 59(5):385–397. https://doi.org/10.1525/bio.2009.59.5.6
Article Google Scholar
Franzen A, Gu IY (2003) Classification of bird species by using key song searching: A comparative study. In: SMC’03 conference proceedings. 2003 IEEE international conference on systems, man and cybernetics. Conference theme-system security and assurance (Cat. No. 03CH37483), IEEE, p 880–887. https://doi.org/10.1109/icsmc.2003.1243926
Kadurka RS, Kanakalla H (2021) Automated bird detection in audio recordings by a signal processing perspective. IJASIS 7(2):11–20. https://doi.org/10.29284/ijasis.7.2.2021.11-20
Article Google Scholar
Mohanty R, Mallik BK, Solanki SS (2020) Automatic bird species recognition system using neural network based on spike. Appl Acoust 161:107177. https://doi.org/10.1016/j.apacoust.2019.107177
Article Google Scholar
Yao W, Lv D, Zi J et al (2021) Crane song recognition based on the features fusion of gmm based on wavelet spectrum and mfcc. In: 2021 the 7th international conference on computer and communications (ICCC), p 501–508. https://doi.org/10.1109/ICCC54389.2021.9674627
Jančovič P, Köküer M (2011) Automatic detection and recognition of tonal bird sounds in noisy environments. Eurasip J Adv Signal Process 2011:1–10. https://doi.org/10.1155/2011/982936
Article Google Scholar
Han X, Peng J (2023) Bird sound classification based on ecoc-svm. Appl Acoust 204:109245. https://doi.org/10.1016/j.apacoust.2023.109245
Article Google Scholar
Fagerlund S (2007) Bird species recognition using support vector machines. Eurasip J Adv Signal Process 2007:1–8. https://doi.org/10.1155/2007/38637
Article Google Scholar
Murugaiya R, Abas PE, Liyanage DS (2022) Probability enhanced entropy (pee) novel feature for improved bird sound classification. MIR 19:52–62. https://doi.org/10.1007/s11633-022-1318-3
Article Google Scholar
Salamon J, Bello JP, Farnsworth A et al (2016) Towards the automatic classification of avian flight calls for bioacoustic monitoring. PLoS One 11(11):e0166866. https://doi.org/10.1371/journal.pone.0166866
Article Google Scholar
Dahl GE, Yu D, Deng L et al (2011) Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE/ACM Trans Audio Speech Lang Process 20(1):30–42. https://doi.org/10.1109/TASL.2011.2134090
Article Google Scholar
García-Ordás MT, Rubio-Martín S, Benítez-Andrades JA et al (2023) Multispecies bird sound recognition using a fully convolutional neural network. Appl Intell 1–14. https://doi.org/10.1007/s10489-023-04704-3
Maclean K, Triguero I (2023) Identifying bird species by their calls in soundscapes. Appl Intell 1–15. https://doi.org/10.1007/s10489-023-04486-8
Koops HV, Van Balen J, Wiering F et al (2014) A deep neural network approach to the lifeclef 2014 bird task. CLEF2014 working notes, vol 1180, p 634–642. https://api.semanticscholar.org/CorpusID:9591212
Permana SDH, Saputra G, Arifitama B et al (2022) Classification of bird sounds as an early warning method of forest fires using convolutional neural network (cnn) algorithm. J King Saud Univ-Com 34(7):4345–4357. https://doi.org/10.1016/j.jksuci.2021.04.013
Article Google Scholar
Xie J, Hu K, Zhu M et al (2019) Investigation of different cnn-based models for improved bird sound classification. IEEE Access 7:175353–175361. https://doi.org/10.1109/ACCESS.2019.2957572
Article Google Scholar
Xie J, Zhu M (2022) Sliding-window based scale-frequency map for bird sound classification using 2d-and 3d-cnn. Expert Syst Appl 207:118054. https://doi.org/10.1016/j.eswa.2022.118054
Article Google Scholar
LeBien J, Zhong M, Campos-Cerqueira M et al (2020) A pipeline for identification of bird and frog species in tropical soundscape recordings using a convolutional neural network. Ecol Inform 59:101113. https://doi.org/10.1016/j.ecoinf.2020.101113
Article Google Scholar
Hong TY, Zabidi M (2021) Bird sound detection with convolutional neural networks using raw waveforms and spectrograms. In: Proceedings of the international symposium on applied science and engineering, Erzurum, Turkey, p 7–9. https://doi.org/10.1109/SISY.2018.8524677
Sevilla A, Glotin H (2017) Audio bird classification with inception-v4 extended with time and time-frequency attention mechanisms. CLEF (working notes), vol 1866, p 1–8. https://api.semanticscholar.org/CorpusID:2699819
Szegedy C, Ioffe S, Vanhoucke V et al (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the AAAI conference on artificial intelligence. https://doi.org/10.1609/aaai.v31i1.11231
Lasseck M (2019) Bird species identification in soundscapes. CLEF (working notes) 2380. https://api.semanticscholar.org/CorpusID:198489397
Szegedy C, Vanhoucke V, Ioffe S et al (2016) Rethinking the inception architecture for computer vision. In: In Proc. of IEEE/CVF Conf.e on CVPR, pp 2818–2826, https://doi.org/10.1109/CVPR.2016.308
He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: In Proc. of IEEE/CVF conf.e on CVPR, p 770–778. https://doi.org/10.1109/CVPR.2016.90
Sainath TN, Vinyals O, Senior A et al (2015) Convolutional, long short-term memory, fully connected deep neural networks. In: ICASSP, IEEE, p 4580–4584. https://doi.org/10.1109/ICASSP.2015.7178838
Xu X, Dinkel H, Wu M et al (2020) A crnn-gru based reinforcement learning approach to audio captioning. In: DCASE, p 225–229. https://api.semanticscholar.org/CorpusID:235804625
Nishikimi R, Nakamura E, Goto M et al (2021) Audio-to-score singing transcription based on a crnn-hsmm hybrid model. APSIPA Trans Signal 10:e7. https://doi.org/10.1017/ATSIP.2021.4
Article Google Scholar
Liu A, Zhang L, Mei Y et al (2021) Residual recurrent crnn for end-to-end optical music recognition on monophonic scores. In: Proceedings of the 2021 workshop on multi-modal pre-training for multimedia understanding, p 23–27. https://doi.org/10.1145/3463945.3469056
Cakir E, Adavanne S, Parascandolo G et al (2017) Convolutional recurrent neural networks for bird audio detection. In: EUSIPCO, IEEE, p 1744–1748. https://doi.org/10.23919/EUSIPCO.2017.8081508
Xie J, Zhao S, Li X et al (2022) Kd-cldnn: Lightweight automatic recognition model based on bird vocalization. Appl Acoust 188:108550. https://doi.org/10.1016/j.apacoust.2021.108550
Article Google Scholar
Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: Proc. ICML, PMLR, p 1310–1318. https://proceedings.mlr.press/v28/pascanu13.html
Gulati A, Qin J, Chiu CC et al (2020) Conformer: Convolution-augmented transformer for speech recognition. In: Interspeech, p 5036–5040. https://doi.org/10.21437/Interspeech.2020-3015
Koizumi Y, Karita S, Wisdom S et al (2021) Df-conformer: Integrated architecture of conv-tasnet and conformer using linear complexity self-attention for speech enhancement. In: WASPAA, p 161–165. https://doi.org/10.1109/WASPAA52581.2021.9632794
Burchi M, Vielzeuf V (2021) Efficient conformer: Progressive downsampling and grouped attention for automatic speech recognition. In: ASRU, p 8–15. https://doi.org/10.1109/ASRU51503.2021.9687874
King A (1989) Functional anatomy of the syrinx. Form and function in birds, vol 4, pp 105–192
Zheng C, Zhang H, Liu W et al (2023) Sixty years of frequency-domain monaural speech enhancement: From traditional to deep learning methods. Trends Hear 27:23312165231209910. https://doi.org/10.1177/23312165231209913
Article Google Scholar
Cheng J, Xie B, Lin C et al (2012) A comparative study in birds: call-type-independent species and individual recognition using four machine-learning methods and two acoustic features. Bioacoustics 21(2):157–171. https://doi.org/10.1080/09524622.2012.669664
Article Google Scholar
Stowell D, Plumbley MD (2014) Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning. PeerJ 2:e488. https://doi.org/10.7717/peerj.488
Article Google Scholar
Noumida A, Rajan R (2022) Multi-label bird species classification from audio recordings using attention framework. Appl Acoust 197:108901. https://doi.org/10.1016/j.apacoust.2022.108901
Article Google Scholar
Kahl S, Wood CM, Eibl M et al (2021) Birdnet: A deep learning solution for avian diversity monitoring. Ecol Inform 61:101236. https://doi.org/10.1016/j.ecoinf.2021.101236
Article Google Scholar
Lesnichaia M, Mikhailava V, Bogach N et al (2022) Classification of accented english using cnn model trained on amplitude mel-spectrograms. Proc Interspeech 2022, p 3669–3673. https://doi.org/10.21437/interspeech.2022-462
Tang C, Luo C, Zhao Z et al (2021) Joint time-frequency and time domain learning for speech enhancement. In: Proc. IJCAI, p 3816–3822. https://doi.org/10.24963/ijcai.2020/524
Woo S, Park J, Lee JY et al (2018) CBAM: Convolutional Block Attention Module. In: ECCV, p 3–19. https://doi.org/10.1007/978-3-030-01234-2_1
Rubinstein R (1999) The cross-entropy method for combinatorial and continuous optimization. Methodol Comput Appl Probab 1:127–190. https://doi.org/10.1023/A:1010091220143
Article MathSciNet Google Scholar
Li X, Li G, Li X (2008) Improved voice activity detection based on iterative spectral subtraction and double thresholds for cvr. In: 2008 Workshop on power electronics and intelligent transportation system, p 153–156. https://doi.org/10.1109/PEITS.2008.84
Loshchilov I, Hutter F (2018) Decoupled weight decay regularization. In: International conference on learning representations. https://api.semanticscholar.org/CorpusID:53592270
Smith LN, Topin N (2019) Super-convergence: very fast training of neural networks using large learning rates. In: Artificial intelligence and machine learning for multi-domain operations applications, international society for optics and photonics, vol 11006. SPIE, p 1100612. https://doi.org/10.1117/12.2520589
Selvaraju RR, Cogswell M, Das A, et al (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proc. of IEEE/CVF conf.e on ICCV, p 618–626. https://doi.org/10.1109/ICCV.2017.74
Li A, Yu G, Zheng C et al (2023) A general unfolding speech enhancement method motivated by taylor’s theorem. IEEE/ACM Trans Audio Speech Lang Process 31:3629–3646. https://doi.org/10.1109/TASLP.2023.3313442
Article Google Scholar
Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: 3rd International conference on learning representations (ICLR 2015), computational and biological learning society. https://api.semanticscholar.org/CorpusID:14124313
Schwab E, Pogrebnoj S, Freund M et al (2022) Automated bat call classification using deep convolutional neural networks. Bioacoustics 1–16. https://doi.org/10.1080/09524622.2022.2050816
Huang G, Liu Z, Van Der Maaten L et al (2017) Densely connected convolutional networks. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), p 2261–2269. https://doi.org/10.1109/CVPR.2017.243
Tanzi L, Audisio A, Cirrincione G et al (2022) Vision transformer for femur fracture classification. Injury 53(7):2625–2634. https://doi.org/10.1016/j.injury.2022.04.013
Article Google Scholar
Liu B, Shen Z, Huang L et al (2021) A 1d-crnn inspired reconfigurable processor for noise-robust low-power keywords recognition. In: 2021 Design, automation & test in Europe conference & exhibition (DATE), p 495–500. https://doi.org/10.23919/DATE51398.2021.9474172
Xiao H, Liu D, Chen K et al (2022) Amresnet: An automatic recognition model of bird sounds in real environment. Appl Acoust 201:109121. https://doi.org/10.1016/j.apacoust.2022.109121
Article Google Scholar
Xiao H, Liu D, Chen K et al (2022) Amresnet: An automatic recognition model of bird sounds in real environment. Appl Acoust 201:109121. https://doi.org/10.1016/j.apacoust.2022.109121
Article Google Scholar
Lin X, Liu J, Kang X (2016) Audio recapture detection with convolutional neural networks. IEEE Trans Multimedia 18(8):1480–1487. https://doi.org/10.1109/TMM.2016.2571999
Article Google Scholar

Download references

Funding

This work is supported in part by the National Key R &D Program of China (2022ZD0116304), the Network Security and Informatization Special Project of the Chinese Academy of Sciences under Grant CAS-WX2021SF-0501, and the Yunnan Province-Kunming City Major Science and Technology Project (202202AH210006).

Author information

Authors and Affiliations

Laboratory of Solid State Optoelectronics Information Technology, Institute of Semiconductors, Chinese Academy of Sciences, Beijing, 100083, China
Huimin Guo, Haifang Jian, Hongchang Wang, Shuaikang Zheng, Qinghua Cheng & Yuehao Li
University of Chinese Academy of Sciences, Beijing, 100049, China
Huimin Guo, Haifang Jian, Hongchang Wang, Shuaikang Zheng, Qinghua Cheng & Yuehao Li
Shandong Normal University, Shandong, 250014, China
Yiyu Wang

Authors

Huimin Guo
View author publications
You can also search for this author inPubMed Google Scholar
Haifang Jian
View author publications
You can also search for this author inPubMed Google Scholar
Yiyu Wang
View author publications
You can also search for this author inPubMed Google Scholar
Hongchang Wang
View author publications
You can also search for this author inPubMed Google Scholar
Shuaikang Zheng
View author publications
You can also search for this author inPubMed Google Scholar
Qinghua Cheng
View author publications
You can also search for this author inPubMed Google Scholar
Yuehao Li
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

Huimin Guo: Conceptualization, Methodology, Software, Writing Reviewing, and Writing Original Draft. Haifang Jian: Supervision, Discussion, and Editing. Yiyu Wang: Visualization, Investigation. Hongchang Wang: Discussion, Editing. Shuaikang Zheng: Writing Reviewing. Qinghua Cheng: Software, Validation. Yuehao Li: Writing Reviewing.

Corresponding author

Correspondence to Haifang Jian.

Ethics declarations

Competing interest

The authors declare that they have no conflicts of interest regarding this work.

Ethics approval

The authors confirm that they have complied with the publication ethics and state that this work is original and has not been used for publication anywhere before.

Consent to participate

The authors are willing to participate in journal promotions and updates.

Consent for Publication

The authors give consent to the journal regarding the publication of this work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Guo, H., Jian, H., Wang, Y. et al. CDPNet: conformer-based dual path joint modeling network for bird sound recognition. Appl Intell 54, 3152–3168 (2024). https://doi.org/10.1007/s10489-024-05362-9

Download citation

Accepted: 22 February 2024
Published: 02 March 2024
Issue Date: February 2024
DOI: https://doi.org/10.1007/s10489-024-05362-9

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CDPNet: conformer-based dual path joint modeling network for bird sound recognition

Abstract

Graphical abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multispecies bird sound recognition using a fully convolutional neural network

Soundscape analysis using eco-acoustic indices for the birds biodiversity assessment in urban parks (case study: Isfahan City, Iran)

Sound-based bird classification using multiple features and machine learning paradigms

Explore related subjects

Data Availability

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interest

Ethics approval

Consent to participate

Consent for Publication

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now