Abstract
A growing body of research indicates that employing large models for adaptation to downstream tasks often yields remarkable performance. However, in the domain of ship detection, the potential of these large models is frequently underutilized due to domain shift issues. This paper introduces the Cross-Modal Ship Grounding (CSG) model, which leverages an efficient Cross-Modal Adapter (CMA) technology to transfer the general detection capabilities of large models to ship images, addressing domain shift with minimal training costs. To mitigate the challenges posed by complex and variable background interference, the Water-Land Separation (WLS) module is proposed to focus specifically on the water area. This module effectively addresses the issue of background target interference, thereby enhancing the model’s accuracy in complex scenes. Empirical evaluations on both private and public datasets demonstrate that the CSG model surpasses all state-of-the-art models in performance.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: European Conference on Computer Vision, pp. 213–229. Springer, Cham (2020)
Gu, X., Lin, T.Y., Kuo, W., Cui, Y.: Open-vocabulary object detection via vision and language knowledge distillation. In: International Conference on Learning Representations (2021). https://api.semanticscholar.org/CorpusID:238744187
Hu, J.E., et al.: Lora: low-rank adaptation of large language models. arXiv abs/2106.09685 (2021). https://api.semanticscholar.org/CorpusID:235458009
Huang, Q., Sun, H., Wang, Y., Yuan, Y., Guo, X., Gao, Q.: Ship detection based on yolo algorithm for visible images. IET Image Process. (2023)
Kamath, A., Singh, M., LeCun, Y., Synnaeve, G., Misra, I., Carion, N.: MDETR-modulated detection for end-to-end multi-modal understanding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 1780–1790 (2021)
Ke, L., et al.: Segment anything in high quality. arXiv preprint arXiv:2306.01567 (2023)
Kim, K., Hong, S., Choi, B., Kim, E.: Probabilistic ship detection and classification using deep learning. Appl. Sci. 8(6), 936 (2018)
Kirillov, A., et al.: Segment anything. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3992–4003 (2023). https://api.semanticscholar.org/CorpusID:257952310
Koch, G., Zemel, R., Salakhutdinov, R., et al.: Siamese neural networks for one-shot image recognition. In: ICML Deep Learning Workshop, vol. 2. Lille (2015)
Leela, S., Roh, M.I., Ohb, M.: Image-based ship detection using deep learning. Ocean Syst. Eng. 10 (2020)
Liao, Y., et al.: A real-time cross-modality correlation filtering method for referring expression comprehension. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10880–10889 (2020)
Liu, D., Zhang, H., Wu, F., Zha, Z.J.: Learning to assemble neural module tree networks for visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4673–4682 (2019)
Liu, R.W., Yuan, W., Chen, X., Lu, Y.: An enhanced CNN-enabled learning method for promoting ship detection in maritime surveillance system. Ocean Eng. 235, 109435 (2021)
Liu, S., et al.: Grounding dino: marrying dino with grounded pre-training for open-set object detection. arXiv abs/2303.05499 (2023). https://api.semanticscholar.org/CorpusID:257427307
Ma, J., He, Y., Li, F., Han, L., You, C., Wang, B.: Segment anything in medical images. Nat. Commun. 15(1), 654 (2024)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Ravi, S., Larochelle, H.: Optimization as a model for few-shot learning. In: International Conference on Learning Representations (2016). https://api.semanticscholar.org/CorpusID:67413369
Reis, D., Kupec, J., Hong, J., Daoudi, A.: Real-time flying object detection with yolov8. arXiv preprint arXiv:2305.09972 (2023)
Shao, Z., Wang, L., Wang, Z., Du, W., Wu, W.: Saliency-aware convolution neural network for ship detection in surveillance video. IEEE Trans. Circuits Syst. Video Technol. 30(3), 781–794 (2019)
Shao, Z., Wu, W., Wang, Z., Du, W., Li, C.: Seaships: a large-scale precisely annotated dataset for ship detection. IEEE Trans. Multimedia 20, 2593–2604 (2018). https://api.semanticscholar.org/CorpusID:52285314
Su, J.C., Maji, S., Hariharan, B.: When does self-supervision improve few-shot learning? In: European Conference on Computer Vision, pp. 645–666. Springer, Cham (2020)
Wang, P., Wu, Q., Cao, J., Shen, C., Gao, L., Hengel, A.V.D.: Neighbourhood watch: referring expression comprehension via language-guided graph attention networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1960–1968 (2019)
Yang, J., Chen, H., Yan, J., Chen, X., Yao, J.: Towards better understanding and better generalization of few-shot classification in histology images with contrastive learning. arXiv preprint arXiv:2202.09059 (2022)
Yang, L., et al.: Pdnet: toward better one-stage object detection with prediction decoupling. IEEE Trans. Image Process. 31, 5121–5133 (2022)
Yang, Z., Chen, T., Wang, L., Luo, J.: Improving one-stage visual grounding by recursive sub-query construction. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 387–404. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_23
Yang, Z., Gong, B., Wang, L., Huang, W., Yu, D., Luo, J.: A fast and accurate one-stage approach to visual grounding. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4683–4693 (2019)
Zareian, A., Rosa, K.D., Hu, D.H., Chang, S.F.: Open-vocabulary object detection using captions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14393–14402 (2021)
Zhang, C., et al.: Faster segment anything: towards lightweight SAM for mobile applications. arXiv preprint arXiv:2306.14289 (2023)
Zhao, Y., et al.: DETRs beat YOLOs on real-time object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16965–16974 (2024)
Zhu, X., Ma, Y., Wang, T., Xu, Y., Shi, J., Lin, D.: SSN: shape signature networks for multi-class object detection from point clouds. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12370, pp. 581–597. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58595-2_35
Acknowledgement
This work was supported by National Natural Science Foundation of China (62271359).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Hu, Q., Chen, L., Feng, Z., Chen, Y. (2025). Cross-Modal Ship Grounding: Towards Large Model for Enhanced Few-Shot Learning. In: Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, CL., Bhattacharya, S., Pal, U. (eds) Pattern Recognition. ICPR 2024. Lecture Notes in Computer Science, vol 15330. Springer, Cham. https://doi.org/10.1007/978-3-031-78113-1_2
Download citation
DOI: https://doi.org/10.1007/978-3-031-78113-1_2
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-78112-4
Online ISBN: 978-3-031-78113-1
eBook Packages: Computer ScienceComputer Science (R0)