Skip to main content

GI-Grasp: Target-Oriented 6DoF Grasping Strategy with Grasp Intuition Based on Vision-Language Models

  • Conference paper
  • First Online:
Intelligent Robotics and Applications (ICIRA 2024)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 15202))

Included in the following conference series:

Abstract

Robot grasping is widely recognized as a crucial component of robotics activities. Several deep learning based grasping algorithms for planar and 6-degree-of-freedom have been presented, and they have produced good results in simulation and real world. However, when these algorithms do grasping posture estimation, their projected grasping poses may not always make sense for the grasping site, even if they cover the item under consideration. These algorithms tend to focus on the thing as a whole and perform activities that differ significantly from human behavior. To that end, we propose our GI-Grasp, a novel strategy that allows the robot to perceive the object to be grasped at a finer scale by introducing vision-language models (VLMs) to determine which part of the object is more suitable for grasping, guiding the robot to act like a human. First, we segment the RGB images of the grasping scene into instances in order to detect and localize the items to be clutched. Secondly, we provide the robot with a priori knowledge of the objects to be grasped through VLMs to help the robot understand the compositional details of the objects to be grasped and identify the spatial constraints related to the grasping task. Finally, acceptable position prediction is combined with the grasping algorithm to improve the robot’s grasping accuracy. Our real-world experiments have proven that GI-Grasp of object features assists robots in grasping items in a more human like (and reasonable) style, increasing the success rate of grasping.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Xie, A., Lee, L., Xiao, T., Finn, C.: Decomposing the generalization gap in imitation learning for visual robotic manipulation. arXiv preprint arXiv:2307.03659 (2023)

  2. Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)

  3. Brohan, A., et al.: RT-2: vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818 (2023)

  4. Huang, W., Wang, C., Zhang, R., Li, Y., Wu, J., Fei-Fei, L.: VoxPoser: composable 3D value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973 (2023)

  5. He, Y., Sun, W., Huang, H., Liu, J., Fan, H., Sun, J.: PVN3D: a deep point-wise 3D keypoints voting network for 6DoF pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11632–11641 (2020)

    Google Scholar 

  6. Kleeberger, K., Bormann, R., Kraus, W., Huber, M.F.: A survey on learning-based robotic grasping. Current Robot. Rep. 1, 239–249 (2020)

    Article  Google Scholar 

  7. Dang, H., Allen, P.K.: Semantic grasping: planning robotic grasps functionally suitable for an object manipulation task. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1311–1317. IEEE (2012)

    Google Scholar 

  8. Shridhar, M., Manuelli, L., Fox, D.: CLIPort: what and where pathways for robotic manipulation. In: Conference on Robot Learning, pp. 894–906. PMLR (2022)

    Google Scholar 

  9. Shridhar, M., Manuelli, L., Fox, D.: Perceiver-actor: a multi-task transformer for robotic manipulation. In: Conference on Robot Learning, pp. 785–799. PMLR (2023)

    Google Scholar 

  10. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018)

    MATH  Google Scholar 

  11. Zuo, Y., Qiu, W., Xie, L., Zhong, F., Wang, Y., Yuille, A.L.: CRAVES: controlling robotic arm with a vision-based economic system. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4214–4223 (2019)

    Google Scholar 

  12. Sarantopoulos, I., Kiatos, M., Doulgeri, Z., Malassiotis, S.: Split deep Q-learning for robust object singulation. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 6225–6231. IEEE (2020)

    Google Scholar 

  13. Ze, Y., Hansen, N., Chen, Y., Jain, M., Wang, X.: Visual reinforcement learning with self-supervised 3D representations. IEEE Robot. Autom. Lett. 8(5), 2890–2897 (2023)

    Article  MATH  Google Scholar 

  14. Xu, Z., et al.: RoboNinja: learning an adaptive cutting policy for multi-material objects. arXiv preprint arXiv:2302.11553 (2023)

  15. Zhang, J., Huang, J., Jin, S., Lu, S.: Vision-language models for vision tasks: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 46, 5625–5644 (2024)

    Article  MATH  Google Scholar 

  16. Kirillov, A., et al.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4015–4026 (2023)

    Google Scholar 

  17. Fang, H.-S., Wang, C., Gou, M., Lu, C.: GraspNet-1billion: a large-scale benchmark for general object grasping. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11444–11453 (2020)

    Google Scholar 

  18. Applin, S., et al.: GPT-4V(ision) system card (2023)

    Google Scholar 

Download references

Acknowledgements

This work was supported in part by the National Key Research and Development Project of China under Grant 2022YFF0902401, in part by the National Natural Science Foundation of China under Grant U22A2063 and Grant 62173083, in part by the Major Program of National Natural Science Foundation of China under Grant 71790614, and in part by the 111 Project under Grant B16009.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haiyu Zhang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Jia, T., Zhang, H., Yang, G., Liu, Y., Wang, H., Guo, S. (2025). GI-Grasp: Target-Oriented 6DoF Grasping Strategy with Grasp Intuition Based on Vision-Language Models. In: Lan, X., Mei, X., Jiang, C., Zhao, F., Tian, Z. (eds) Intelligent Robotics and Applications. ICIRA 2024. Lecture Notes in Computer Science(), vol 15202. Springer, Singapore. https://doi.org/10.1007/978-981-96-0774-7_7

Download citation

  • DOI: https://doi.org/10.1007/978-981-96-0774-7_7

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-96-0773-0

  • Online ISBN: 978-981-96-0774-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics