Skip to main content
Log in

CDPNet: conformer-based dual path joint modeling network for bird sound recognition

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Bird species monitoring is important for the preservation of biological diversity because it provides fundamental information for biodiversity assessment and protection. Automatic acoustic recognition is considered to be an essential technology for realizing automatic monitoring of bird species. Current deep learning-based bird sound recognition methods do not fully conduct long-term correlation modeling along both the time and frequency axes of the spectrogram. Additionally, these methods have not completely studied the impact of different scales of features on the final recognition. To solve the abovementioned problems, we propose a Conformer-based dual path joint modeling network (CDPNet) for bird sound recognition. To the best of our knowledge, this is the first attempt to adopt Conformer in the bird sound recognition task. Specifically, the proposed CDPNet mainly consists of a dual-path time-frequency joint modeling module (DPTFM) and a multi-scale feature fusion module (MSFFM). The former aims to simultaneously capture time-frequency local features, long-term time dependence, and long-term frequency dependence to better model bird sound characteristics effectively. The latter is designed to improve recognition accuracy by fusing different scales of features. The proposed algorithm is implemented on an edge computing platform, NVIDIA Jetson Nano, to build a real-time bird sound recognition monitoring system. The ablation experimental results verify the benefit of using the DPTFM and the MSFFM. Through training and testing on the Semibirdaudio dataset containing 27,155 sound clips and the public Birdsdata dataset, the proposed CDPNet outperforms the other state-of-the-art models in terms of F1-score, precision, recall, and accuracy.

Graphical abstract

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Data Availability

The code and data are available from the corresponding author on reasonable request.

Notes

  1. https://media.ebird.org/catalog

  2. https://xeno-canto.org/

  3. https://soundcloud.com/discover

References

  1. Pahuja R, Kumar A (2021) Sound-spectrogram based automatic bird species recognition using mlp classifier. Appl Acoust 180:108077. https://doi.org/10.1016/j.apacoust.2021.108077

    Article  Google Scholar 

  2. Kułaga K, Budka M (2019) Bird species detection by an observer and an autonomous sound recorder in two different environments: Forest and farmland. PLoS One 14(2):1–19. https://doi.org/10.1371/journal.pone.0211970

    Article  Google Scholar 

  3. Bolam FC, Mair L, Angelico M et al (2021) How many bird and mammal extinctions has recent conservation action prevented? Conserv Lett 14(1):e12762. https://doi.org/10.1111/conl.12762

    Article  Google Scholar 

  4. Develey PF, Phalan BT (2021) Bird extinctions in brazil’s atlantic forest and how they can be prevented. Front Ecol Evol 9:624587. https://doi.org/10.3389/fevo.2021.624587

    Article  Google Scholar 

  5. Piratelli A (2003) Mesh size and bird capture rates in mato grosso do sul state, Brazil. Braz J Biol 63:105–111. https://doi.org/10.1590/S1519-69842003000100014

    Article  Google Scholar 

  6. Moorcroft D, Whittingham M, Bradbury R et al (2002) The selection of stubble fields by wintering granivorous birds reflects vegetation cover and food abundance. J Appl Ecol 535–547. http://www.jstor.org/stable/827145

  7. Gilmer DS, Brass JA, Strong LL et al (1973) (1988) Goose counts from aerial photographs using an optical digitizer. Wildlife Soc B (1973-2006) 16(2):204–206. https://www.jstor.org/stable/3782190

  8. Lei Q, Li J, Ma K (2018) Applications of remote sensing technology in avian ecology. Biodivers Sci 26(8):862. https://doi.org/10.17520/biods.2018143

    Article  Google Scholar 

  9. Marler PR, Slabbekoorn H (2004) Nature’s music: the science of birdsong. Elsevier. https://doi.org/10.1016/B978-0-12-473070-0.X5000-2

  10. Vidaña-Vila E, Navarro J, Alsina-Pagès RM et al (2020) A two-stage approach to automatically detect and classify woodpecker (fam. picidae) sounds. Applied Acoustics 166:107312. https://doi.org/10.1016/j.apacoust.2020.107312

    Article  Google Scholar 

  11. Porter J, Arzberger P, Braun HW et al (2005) Wireless sensor networks for ecology. BioScience 55(7):561–572. https://doi.org/10.1641/0006-3568(2005)055[0561:WSNFE]2.0.CO;2

    Article  Google Scholar 

  12. Porter JH, Nagy E, Kratz TK et al (2009) New eyes on the world: advanced sensors for ecology. BioScience 59(5):385–397. https://doi.org/10.1525/bio.2009.59.5.6

    Article  Google Scholar 

  13. Franzen A, Gu IY (2003) Classification of bird species by using key song searching: A comparative study. In: SMC’03 conference proceedings. 2003 IEEE international conference on systems, man and cybernetics. Conference theme-system security and assurance (Cat. No. 03CH37483), IEEE, p 880–887. https://doi.org/10.1109/icsmc.2003.1243926

  14. Kadurka RS, Kanakalla H (2021) Automated bird detection in audio recordings by a signal processing perspective. IJASIS 7(2):11–20. https://doi.org/10.29284/ijasis.7.2.2021.11-20

    Article  Google Scholar 

  15. Mohanty R, Mallik BK, Solanki SS (2020) Automatic bird species recognition system using neural network based on spike. Appl Acoust 161:107177. https://doi.org/10.1016/j.apacoust.2019.107177

    Article  Google Scholar 

  16. Yao W, Lv D, Zi J et al (2021) Crane song recognition based on the features fusion of gmm based on wavelet spectrum and mfcc. In: 2021 the 7th international conference on computer and communications (ICCC), p 501–508. https://doi.org/10.1109/ICCC54389.2021.9674627

  17. Jančovič P, Köküer M (2011) Automatic detection and recognition of tonal bird sounds in noisy environments. Eurasip J Adv Signal Process 2011:1–10. https://doi.org/10.1155/2011/982936

    Article  Google Scholar 

  18. Han X, Peng J (2023) Bird sound classification based on ecoc-svm. Appl Acoust 204:109245. https://doi.org/10.1016/j.apacoust.2023.109245

    Article  Google Scholar 

  19. Fagerlund S (2007) Bird species recognition using support vector machines. Eurasip J Adv Signal Process 2007:1–8. https://doi.org/10.1155/2007/38637

    Article  Google Scholar 

  20. Murugaiya R, Abas PE, Liyanage DS (2022) Probability enhanced entropy (pee) novel feature for improved bird sound classification. MIR 19:52–62. https://doi.org/10.1007/s11633-022-1318-3

    Article  Google Scholar 

  21. Salamon J, Bello JP, Farnsworth A et al (2016) Towards the automatic classification of avian flight calls for bioacoustic monitoring. PLoS One 11(11):e0166866. https://doi.org/10.1371/journal.pone.0166866

    Article  Google Scholar 

  22. Dahl GE, Yu D, Deng L et al (2011) Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE/ACM Trans Audio Speech Lang Process 20(1):30–42. https://doi.org/10.1109/TASL.2011.2134090

    Article  Google Scholar 

  23. García-Ordás MT, Rubio-Martín S, Benítez-Andrades JA et al (2023) Multispecies bird sound recognition using a fully convolutional neural network. Appl Intell 1–14. https://doi.org/10.1007/s10489-023-04704-3

  24. Maclean K, Triguero I (2023) Identifying bird species by their calls in soundscapes. Appl Intell 1–15. https://doi.org/10.1007/s10489-023-04486-8

  25. Koops HV, Van Balen J, Wiering F et al (2014) A deep neural network approach to the lifeclef 2014 bird task. CLEF2014 working notes, vol 1180, p 634–642. https://api.semanticscholar.org/CorpusID:9591212

  26. Permana SDH, Saputra G, Arifitama B et al (2022) Classification of bird sounds as an early warning method of forest fires using convolutional neural network (cnn) algorithm. J King Saud Univ-Com 34(7):4345–4357. https://doi.org/10.1016/j.jksuci.2021.04.013

    Article  Google Scholar 

  27. Xie J, Hu K, Zhu M et al (2019) Investigation of different cnn-based models for improved bird sound classification. IEEE Access 7:175353–175361. https://doi.org/10.1109/ACCESS.2019.2957572

    Article  Google Scholar 

  28. Xie J, Zhu M (2022) Sliding-window based scale-frequency map for bird sound classification using 2d-and 3d-cnn. Expert Syst Appl 207:118054. https://doi.org/10.1016/j.eswa.2022.118054

    Article  Google Scholar 

  29. LeBien J, Zhong M, Campos-Cerqueira M et al (2020) A pipeline for identification of bird and frog species in tropical soundscape recordings using a convolutional neural network. Ecol Inform 59:101113. https://doi.org/10.1016/j.ecoinf.2020.101113

    Article  Google Scholar 

  30. Hong TY, Zabidi M (2021) Bird sound detection with convolutional neural networks using raw waveforms and spectrograms. In: Proceedings of the international symposium on applied science and engineering, Erzurum, Turkey, p 7–9. https://doi.org/10.1109/SISY.2018.8524677

  31. Sevilla A, Glotin H (2017) Audio bird classification with inception-v4 extended with time and time-frequency attention mechanisms. CLEF (working notes), vol 1866, p 1–8. https://api.semanticscholar.org/CorpusID:2699819

  32. Szegedy C, Ioffe S, Vanhoucke V et al (2017) Inception-v4, inception-resnet and the impact of residual connections on learning. In: Proceedings of the AAAI conference on artificial intelligence. https://doi.org/10.1609/aaai.v31i1.11231

  33. Lasseck M (2019) Bird species identification in soundscapes. CLEF (working notes) 2380. https://api.semanticscholar.org/CorpusID:198489397

  34. Szegedy C, Vanhoucke V, Ioffe S et al (2016) Rethinking the inception architecture for computer vision. In: In Proc. of IEEE/CVF Conf.e on CVPR, pp 2818–2826, https://doi.org/10.1109/CVPR.2016.308

  35. He K, Zhang X, Ren S et al (2016) Deep residual learning for image recognition. In: In Proc. of IEEE/CVF conf.e on CVPR, p 770–778. https://doi.org/10.1109/CVPR.2016.90

  36. Sainath TN, Vinyals O, Senior A et al (2015) Convolutional, long short-term memory, fully connected deep neural networks. In: ICASSP, IEEE, p 4580–4584. https://doi.org/10.1109/ICASSP.2015.7178838

  37. Xu X, Dinkel H, Wu M et al (2020) A crnn-gru based reinforcement learning approach to audio captioning. In: DCASE, p 225–229. https://api.semanticscholar.org/CorpusID:235804625

  38. Nishikimi R, Nakamura E, Goto M et al (2021) Audio-to-score singing transcription based on a crnn-hsmm hybrid model. APSIPA Trans Signal 10:e7. https://doi.org/10.1017/ATSIP.2021.4

    Article  Google Scholar 

  39. Liu A, Zhang L, Mei Y et al (2021) Residual recurrent crnn for end-to-end optical music recognition on monophonic scores. In: Proceedings of the 2021 workshop on multi-modal pre-training for multimedia understanding, p 23–27. https://doi.org/10.1145/3463945.3469056

  40. Cakir E, Adavanne S, Parascandolo G et al (2017) Convolutional recurrent neural networks for bird audio detection. In: EUSIPCO, IEEE, p 1744–1748. https://doi.org/10.23919/EUSIPCO.2017.8081508

  41. Xie J, Zhao S, Li X et al (2022) Kd-cldnn: Lightweight automatic recognition model based on bird vocalization. Appl Acoust 188:108550. https://doi.org/10.1016/j.apacoust.2021.108550

    Article  Google Scholar 

  42. Pascanu R, Mikolov T, Bengio Y (2013) On the difficulty of training recurrent neural networks. In: Proc. ICML, PMLR, p 1310–1318. https://proceedings.mlr.press/v28/pascanu13.html

  43. Gulati A, Qin J, Chiu CC et al (2020) Conformer: Convolution-augmented transformer for speech recognition. In: Interspeech, p 5036–5040. https://doi.org/10.21437/Interspeech.2020-3015

  44. Koizumi Y, Karita S, Wisdom S et al (2021) Df-conformer: Integrated architecture of conv-tasnet and conformer using linear complexity self-attention for speech enhancement. In: WASPAA, p 161–165. https://doi.org/10.1109/WASPAA52581.2021.9632794

  45. Burchi M, Vielzeuf V (2021) Efficient conformer: Progressive downsampling and grouped attention for automatic speech recognition. In: ASRU, p 8–15. https://doi.org/10.1109/ASRU51503.2021.9687874

  46. King A (1989) Functional anatomy of the syrinx. Form and function in birds, vol 4, pp 105–192

  47. Zheng C, Zhang H, Liu W et al (2023) Sixty years of frequency-domain monaural speech enhancement: From traditional to deep learning methods. Trends Hear 27:23312165231209910. https://doi.org/10.1177/23312165231209913

    Article  Google Scholar 

  48. Cheng J, Xie B, Lin C et al (2012) A comparative study in birds: call-type-independent species and individual recognition using four machine-learning methods and two acoustic features. Bioacoustics 21(2):157–171. https://doi.org/10.1080/09524622.2012.669664

    Article  Google Scholar 

  49. Stowell D, Plumbley MD (2014) Automatic large-scale classification of bird sounds is strongly improved by unsupervised feature learning. PeerJ 2:e488. https://doi.org/10.7717/peerj.488

    Article  Google Scholar 

  50. Noumida A, Rajan R (2022) Multi-label bird species classification from audio recordings using attention framework. Appl Acoust 197:108901. https://doi.org/10.1016/j.apacoust.2022.108901

    Article  Google Scholar 

  51. Kahl S, Wood CM, Eibl M et al (2021) Birdnet: A deep learning solution for avian diversity monitoring. Ecol Inform 61:101236. https://doi.org/10.1016/j.ecoinf.2021.101236

    Article  Google Scholar 

  52. Lesnichaia M, Mikhailava V, Bogach N et al (2022) Classification of accented english using cnn model trained on amplitude mel-spectrograms. Proc Interspeech 2022, p 3669–3673. https://doi.org/10.21437/interspeech.2022-462

  53. Tang C, Luo C, Zhao Z et al (2021) Joint time-frequency and time domain learning for speech enhancement. In: Proc. IJCAI, p 3816–3822. https://doi.org/10.24963/ijcai.2020/524

  54. Woo S, Park J, Lee JY et al (2018) CBAM: Convolutional Block Attention Module. In: ECCV, p 3–19. https://doi.org/10.1007/978-3-030-01234-2_1

  55. Rubinstein R (1999) The cross-entropy method for combinatorial and continuous optimization. Methodol Comput Appl Probab 1:127–190. https://doi.org/10.1023/A:1010091220143

    Article  MathSciNet  Google Scholar 

  56. Li X, Li G, Li X (2008) Improved voice activity detection based on iterative spectral subtraction and double thresholds for cvr. In: 2008 Workshop on power electronics and intelligent transportation system, p 153–156. https://doi.org/10.1109/PEITS.2008.84

  57. Loshchilov I, Hutter F (2018) Decoupled weight decay regularization. In: International conference on learning representations. https://api.semanticscholar.org/CorpusID:53592270

  58. Smith LN, Topin N (2019) Super-convergence: very fast training of neural networks using large learning rates. In: Artificial intelligence and machine learning for multi-domain operations applications, international society for optics and photonics, vol 11006. SPIE, p 1100612. https://doi.org/10.1117/12.2520589

  59. Selvaraju RR, Cogswell M, Das A, et al (2017) Grad-cam: Visual explanations from deep networks via gradient-based localization. In: Proc. of IEEE/CVF conf.e on ICCV, p 618–626. https://doi.org/10.1109/ICCV.2017.74

  60. Li A, Yu G, Zheng C et al (2023) A general unfolding speech enhancement method motivated by taylor’s theorem. IEEE/ACM Trans Audio Speech Lang Process 31:3629–3646. https://doi.org/10.1109/TASLP.2023.3313442

    Article  Google Scholar 

  61. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large-scale image recognition. In: 3rd International conference on learning representations (ICLR 2015), computational and biological learning society. https://api.semanticscholar.org/CorpusID:14124313

  62. Schwab E, Pogrebnoj S, Freund M et al (2022) Automated bat call classification using deep convolutional neural networks. Bioacoustics 1–16. https://doi.org/10.1080/09524622.2022.2050816

  63. Huang G, Liu Z, Van Der Maaten L et al (2017) Densely connected convolutional networks. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), p 2261–2269. https://doi.org/10.1109/CVPR.2017.243

  64. Tanzi L, Audisio A, Cirrincione G et al (2022) Vision transformer for femur fracture classification. Injury 53(7):2625–2634. https://doi.org/10.1016/j.injury.2022.04.013

    Article  Google Scholar 

  65. Liu B, Shen Z, Huang L et al (2021) A 1d-crnn inspired reconfigurable processor for noise-robust low-power keywords recognition. In: 2021 Design, automation & test in Europe conference & exhibition (DATE), p 495–500. https://doi.org/10.23919/DATE51398.2021.9474172

  66. Xiao H, Liu D, Chen K et al (2022) Amresnet: An automatic recognition model of bird sounds in real environment. Appl Acoust 201:109121. https://doi.org/10.1016/j.apacoust.2022.109121

    Article  Google Scholar 

  67. Xiao H, Liu D, Chen K et al (2022) Amresnet: An automatic recognition model of bird sounds in real environment. Appl Acoust 201:109121. https://doi.org/10.1016/j.apacoust.2022.109121

    Article  Google Scholar 

  68. Lin X, Liu J, Kang X (2016) Audio recapture detection with convolutional neural networks. IEEE Trans Multimedia 18(8):1480–1487. https://doi.org/10.1109/TMM.2016.2571999

    Article  Google Scholar 

Download references

Funding

This work is supported in part by the National Key R &D Program of China (2022ZD0116304), the Network Security and Informatization Special Project of the Chinese Academy of Sciences under Grant CAS-WX2021SF-0501, and the Yunnan Province-Kunming City Major Science and Technology Project (202202AH210006).

Author information

Authors and Affiliations

Authors

Contributions

Huimin Guo: Conceptualization, Methodology, Software, Writing Reviewing, and Writing Original Draft. Haifang Jian: Supervision, Discussion, and Editing. Yiyu Wang: Visualization, Investigation. Hongchang Wang: Discussion, Editing. Shuaikang Zheng: Writing Reviewing. Qinghua Cheng: Software, Validation. Yuehao Li: Writing Reviewing.

Corresponding author

Correspondence to Haifang Jian.

Ethics declarations

Competing interest

The authors declare that they have no conflicts of interest regarding this work.

Ethics approval

The authors confirm that they have complied with the publication ethics and state that this work is original and has not been used for publication anywhere before.

Consent to participate

The authors are willing to participate in journal promotions and updates.

Consent for Publication

The authors give consent to the journal regarding the publication of this work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Guo, H., Jian, H., Wang, Y. et al. CDPNet: conformer-based dual path joint modeling network for bird sound recognition. Appl Intell 54, 3152–3168 (2024). https://doi.org/10.1007/s10489-024-05362-9

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-024-05362-9

Keywords

Navigation