Abstract
Traditional efficient lightweight image classification algorithms generally demonstrate low accuracy in real-time wetland bird recognition tasks due to the environmental complexity and the high similarity among bird species. Moreover, a bird recognition server needs to perform computation-intensive tasks of multi-process parallel inferences, requiring a low inference latency of the bird recognition algorithm. Traditional high-accuracy fine-grained methods cannot meet the demands due to their high computational complexity. In this study, we introduce a scalable two-stage model for real-time wetland bird recognition, which incorporates an object detector and a fine-grained image recognition technique, bilinear pooling, to encode fine-grained features. Additionally, we design a lightweight architecture and propose a bilinear scalable module in the bilinear pooling to trade-off between latency and accuracy. The experimental results show that the proposed method achieves 77.6% and 97.6% accuracy on the CUB and WPB datasets, respectively, which are much higher than MobileNetV3 and ShuffleNetV2, with a low inference latency of only 79.5 ms on CPU. Furthermore, parallel inference experiments in practical environments demonstrate that the proposed method achieves an inference speed of 15.3 FPS, with 12 parallel video streams.





Similar content being viewed by others
Data availability
Data associated with this work can be availed from the corresponding author upon formal request.
References
Hu S, Niu Z, Chen Y, Li L, Zhang H (2017) Global wetlands: potential distribution, wetland loss, and status. Sci Total Environ 586:319–327
Xu T, Weng B, Yan D, Wang K, Li X, Bi W, Li M, Cheng X, Liu Y (2019) Wetlands of international importance: Status, threats, and future protection. Int J Environ Res Public Health 16:1818
Kati VI, Sekercioglu CH (2006) Diversity, ecological structure, and conservation of the landbird community of dadia reserve, greece. Divers Distrib 12:620–629
Wang S, Loreau M (2014) Ecosystem stability in space: \(\alpha \), \(\beta \) and \(\gamma \) variability. Ecol Lett 17:891–901
Brambilla M, Rizzolli F, Franzoi A, Caldonazzi M, Zanghellini S, Pedrini P (2020) A network of small protected areas favoured generalist but not specialized wetland birds in a 30-year period. Biol Conserv 248:108699
Mitsch WJ, Bernal B, Nahlik AM, Mander Ü, Zhang L, Anderson CJ, Jørgensen SE, Brix H (2013) Wetlands, carbon, and climate change. Landsc Ecol 28:583–597
Salimi S, Scholz M (2021) Impact of future climate scenarios on peatland and constructed wetland water quality: a mesocosm experiment within climate chambers. J Environ Manag 289:112459
Song F, Su F, Mi C, Sun D (2021) Analysis of driving forces on wetland ecosystem services value change: a case in northeast china. Sci Total Environ 751:141778
Elliott LH, Igl LD, Johnson DH (2020) The relative importance of wetland area versus habitat heterogeneity for promoting species richness and abundance of wetland birds in the prairie pothole region, usa. Condor 122:060
Raj S, Garyali S, Kumar S, Shidnal S (2020) Image based bird species identification using convolutional neural network. Int J Eng Res & Technol (IJERT) 9:346
Varghese A, Shyamkrishna K, Rajeswari M (2022) Utilization of deep learning technology in recognizing bird species In: AIP Conf Proc 2463:1
Xie J, Zhong Y, Zhang J, Liu S, Ding C, Triantafyllopoulos A (2023) A review of automatic recognition technology for bird vocalizations in the deep learning era. Ecol Inf 73:101927
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems 25
Huang Y-P, Basanta H (2019) Bird image retrieval and recognition using a deep learning platform. IEEE access 7:66980–66989
Van Horn G, Branson S, Farrell R, Haber S, Barry J, Ipeirotis P, Perona P, Belongie S (2015) Building a Bird Recognition App and Large Scale Dataset with Citizen Scientists: The Fine Print in Fine-Grained Dataset Collection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 595–604
Weiss K, Khoshgoftaar TM, Wang D (2016) A survey of transfer learning. J Big data 3:1–40
Villa AG, Salazar A, Vargas F (2017) Towards automatic wild animal monitoring: identification of animal species in camera-trap images using very deep convolutional neural networks. Ecol Inf 41:24–32
Ferreira AC, Silva LR, Renna F, Brandl HB, Renoult JP, Farine DR, Covas R, Doutrelant C (2020) Deep learning-based methods for individual recognition in small birds. Methods Ecol Evol 11:1072–1085
Xiao K, Engstrom L, Ilyas A, Madry A (2020) Noise or signal: The role of image backgrounds in object recognition. arXiv preprint arXiv:2006.09994
He K, Zhang X, Ren S, Sun J (2016) Deep Residual Learning for Image Recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778
Xiao T, Xu Y, Yang K, Zhang J, Peng Y, Zhang Z (2015) The Application of Two-Level Attention Models in Deep Convolutional Neural Network for Fine-Grained Image Classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 842–850
Wang Y, Wang Z (2019) A survey of recent work on fine-grained image classification techniques. J Vis Commun Image Represent 59:210–214
Girshick R, Donahue J, Darrell T, Malik J (2014) Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587
Girshick R (2015) Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448
Ren S, He K, Girshick R, Sun J (2015) Faster r-cnn: Towards real-time object detection with region proposal networks. Adv Neural Inf Process Syst 28:1149
Redmon J, Divvala S, Girshick R, Farhadi A (2016) You Only Look Once: Unified, Real-Time Object Detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 779–788
Liu W, Anguelov D, Erhan D, Szegedy C, Reed S, Fu C-Y, Berg AC (2016) Ssd: Single shot multibox detector. In: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11–14, 2016, Proceedings, Part I 14, pp. 21–37
Carion N, Massa F, Synnaeve G, Usunier N, Kirillov A, Zagoruyko S (2020) End-to-end object detection with transformers. In: European Conference on Computer Vision, pp. 213–229
Li J, Zhang J, Li J, Li G, Liu S, Lin L, Li G (2024) Learning background prompts to discover implicit knowledge for open vocabulary object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16678–16687
Iandola FN (2016) Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size. arXiv preprint arXiv:1602.07360
Howard AG (2017) Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861
Zhang X, Zhou X, Lin M, Sun J (2018) Shufflenet: An extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6848–6856
Han K, Wang Y, Tian Q, Guo J, Xu C, Xu C (2020) Ghostnet: More features from cheap operations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1580–1589
Vasu PKA, Gabriel J, Zhu J, Tuzel O, Ranjan A (2023) Fastvit: A fast hybrid vision transformer using structural reparameterization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5785–5795
Shaker A, Maaz M, Rasheed H, Khan S, Yang M-H, Khan FS (2023) Swiftformer: Efficient additive attention for transformer-based real-time mobile vision applications. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 17425–17436
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Iandola F, Moskewicz M, Karayev S, Girshick R, Darrell T, Keutzer K (2014) Densenet: Implementing efficient convnet descriptor pyramids. arXiv preprint arXiv:1404.1869
Zhang N, Donahue J, Girshick R, Darrell T (2014) Part-based r-cnns for fine-grained category detection. In: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part I 13, pp. 834–849
Lin T-Y, RoyChowdhury A, Maji S (2015) Bilinear cnn models for fine-grained visual recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1449–1457
Fu J, Zheng H, Mei T (2017) Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4438–4446
Behera A, Wharton Z, Hewage PR, Bera A (2021) Context-aware attentional pooling (cap) for fine-grained visual classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, pp. 929–937
He J, Chen J-N, Liu S, Kortylewski A, Yang C, Bai Y, Wang C (2022) Transfg: A transformer architecture for fine-grained recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, pp. 852–860
Wang Q, Wu B, Zhu P, Li P, Zuo W, Hu Q (2020) Eca-net: Efficient channel attention for deep convolutional neural networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11534–11542
Wang C-Y, Bochkovskiy A, Liao H-YM (2023) Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7464–7475
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255
Howard A, Sandler M, Chu G, Chen L-C, Chen B, Tan M, Wang W, Zhu Y, Pang R, Vasudevan V et al (2019) Searching for mobilenetv3. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1314–1324
Perronnin F, Sánchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification. In: Computer Vision–ECCV 2010: 11th European Conference on Computer Vision, Heraklion, Crete, Greece, September 5-11, 2010, Proceedings, Part IV 11, pp. 143–156
Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Workshop on Statistical Learning in Computer Vision, ECCV, vol. 1, pp. 1–2
Wah C, Branson S, Welinder P, Perona P, Belongie S (2011) The caltech-ucsd birds-200-2011 dataset. Technical Report CNS-TR-2011-001
Vujović Ž et al (2021) Classification model evaluation metrics. Int J Adv Comput Sci Appl 12:599–606
Liu Z, Mao H, Wu C-Y, Feichtenhofer C, Darrell T, Xie S (2022) A convnet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986
Dosovitskiy A (2020) An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Vasu PKA, Gabriel J, Zhu J, Tuzel O, Ranjan A (2023) Mobileone: An improved one millisecond mobile backbone. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7907–7917
Li Y, Hu J, Wen Y, Evangelidis G, Salahi K, Wang Y, Tulyakov S, Ren J (2023) Rethinking vision transformers for mobilenet size and speed. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 16889–16900
Xie S, Girshick R, Dollár P, Tu Z, He K (2017) Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500
Ma N, Zhang X, Zheng H-T, Sun J (2018) Shufflenet v2: Practical guidelines for efficient cnn architecture design. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 116–131
Gao Y, Beijbom O, Zhang N, Darrell T (2016) Compact bilinear pooling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 317–326
Kong S, Fowlkes C (2017) Low-rank bilinear pooling for fine-grained classification. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 365–374
Wang Z, Yang P, Zhang B, Hu L, Lv W, Lin C, Zhang C, Wang Q (2024) Performance prediction for deep learning models with pipeline inference strategy. IEEE Int Things J 11(2):2964–2978
Acknowledgements
This research was supported by National Natural Science Foundation Project of CQ CSTC (No. cstc2020jcyj-msxmX0554).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare no Conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Xia, W., Zhou, Q., Wu, D. et al. A scalable two-stage model for real-time Wetland bird recognition. J Supercomput 81, 588 (2025). https://doi.org/10.1007/s11227-025-07061-9
Accepted:
Published:
DOI: https://doi.org/10.1007/s11227-025-07061-9