GI-Grasp: Target-Oriented 6DoF Grasping Strategy with Grasp Intuition Based on Vision-Language Models

Jia, Tong; Zhang, Haiyu; Yang, Guowei; Liu, Yizhe; Wang, Hao; Guo, Shiyi

doi:10.1007/978-981-96-0774-7_7

Tong Jia^12,13,
Haiyu Zhang¹³,
Guowei Yang¹³,
Yizhe Liu¹³,
Hao Wang¹³ &
…
Shiyi Guo¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 15202))

Included in the following conference series:

International Conference on Intelligent Robotics and Applications

Abstract

Robot grasping is widely recognized as a crucial component of robotics activities. Several deep learning based grasping algorithms for planar and 6-degree-of-freedom have been presented, and they have produced good results in simulation and real world. However, when these algorithms do grasping posture estimation, their projected grasping poses may not always make sense for the grasping site, even if they cover the item under consideration. These algorithms tend to focus on the thing as a whole and perform activities that differ significantly from human behavior. To that end, we propose our GI-Grasp, a novel strategy that allows the robot to perceive the object to be grasped at a finer scale by introducing vision-language models (VLMs) to determine which part of the object is more suitable for grasping, guiding the robot to act like a human. First, we segment the RGB images of the grasping scene into instances in order to detect and localize the items to be clutched. Secondly, we provide the robot with a priori knowledge of the objects to be grasped through VLMs to help the robot understand the compositional details of the objects to be grasped and identify the spatial constraints related to the grasping task. Finally, acceptable position prediction is combined with the grasping algorithm to improve the robot’s grasping accuracy. Our real-world experiments have proven that GI-Grasp of object features assists robots in grasping items in a more human like (and reasonable) style, increasing the success rate of grasping.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 119.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Xie, A., Lee, L., Xiao, T., Finn, C.: Decomposing the generalization gap in imitation learning for visual robotic manipulation. arXiv preprint arXiv:2307.03659 (2023)
Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Brohan, A., et al.: RT-2: vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818 (2023)
Huang, W., Wang, C., Zhang, R., Li, Y., Wu, J., Fei-Fei, L.: VoxPoser: composable 3D value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973 (2023)
He, Y., Sun, W., Huang, H., Liu, J., Fan, H., Sun, J.: PVN3D: a deep point-wise 3D keypoints voting network for 6DoF pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11632–11641 (2020)
Google Scholar
Kleeberger, K., Bormann, R., Kraus, W., Huber, M.F.: A survey on learning-based robotic grasping. Current Robot. Rep. 1, 239–249 (2020)
Article Google Scholar
Dang, H., Allen, P.K.: Semantic grasping: planning robotic grasps functionally suitable for an object manipulation task. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1311–1317. IEEE (2012)
Google Scholar
Shridhar, M., Manuelli, L., Fox, D.: CLIPort: what and where pathways for robotic manipulation. In: Conference on Robot Learning, pp. 894–906. PMLR (2022)
Google Scholar
Shridhar, M., Manuelli, L., Fox, D.: Perceiver-actor: a multi-task transformer for robotic manipulation. In: Conference on Robot Learning, pp. 785–799. PMLR (2023)
Google Scholar
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018)
MATH Google Scholar
Zuo, Y., Qiu, W., Xie, L., Zhong, F., Wang, Y., Yuille, A.L.: CRAVES: controlling robotic arm with a vision-based economic system. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4214–4223 (2019)
Google Scholar
Sarantopoulos, I., Kiatos, M., Doulgeri, Z., Malassiotis, S.: Split deep Q-learning for robust object singulation. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 6225–6231. IEEE (2020)
Google Scholar
Ze, Y., Hansen, N., Chen, Y., Jain, M., Wang, X.: Visual reinforcement learning with self-supervised 3D representations. IEEE Robot. Autom. Lett. 8(5), 2890–2897 (2023)
Article MATH Google Scholar
Xu, Z., et al.: RoboNinja: learning an adaptive cutting policy for multi-material objects. arXiv preprint arXiv:2302.11553 (2023)
Zhang, J., Huang, J., Jin, S., Lu, S.: Vision-language models for vision tasks: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 46, 5625–5644 (2024)
Article MATH Google Scholar
Kirillov, A., et al.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4015–4026 (2023)
Google Scholar
Fang, H.-S., Wang, C., Gou, M., Lu, C.: GraspNet-1billion: a large-scale benchmark for general object grasping. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11444–11453 (2020)
Google Scholar
Applin, S., et al.: GPT-4V(ision) system card (2023)
Google Scholar

Download references

Acknowledgements

This work was supported in part by the National Key Research and Development Project of China under Grant 2022YFF0902401, in part by the National Natural Science Foundation of China under Grant U22A2063 and Grant 62173083, in part by the Major Program of National Natural Science Foundation of China under Grant 71790614, and in part by the 111 Project under Grant B16009.

Author information

Authors and Affiliations

State Key Laboratory of Synthetical Automation for Process Industries, Northeastern University, Shenyang, 110819, China
Tong Jia
College of Information Science and Engineering, Northeastern University, Shenyang, 110819, China
Tong Jia, Haiyu Zhang, Guowei Yang, Yizhe Liu, Hao Wang & Shiyi Guo

Authors

Tong Jia
View author publications
You can also search for this author in PubMed Google Scholar
Haiyu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Guowei Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yizhe Liu
View author publications
You can also search for this author in PubMed Google Scholar
Hao Wang
View author publications
You can also search for this author in PubMed Google Scholar
Shiyi Guo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haiyu Zhang .

Editor information

Editors and Affiliations

Xi’an Jiaotong University, Xi’an, China
Xuguang Lan
Xi’an Jiaotong University, Xi’an, China
Xuesong Mei
Xi’an Jiaotong University, Xi’an, China
Caigui Jiang
Xi’an Jiaotong University, Xi’an, China
Fei Zhao
Xi'an Jiaotong University, Xi'an, China
Zhiqiang Tian

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jia, T., Zhang, H., Yang, G., Liu, Y., Wang, H., Guo, S. (2025). GI-Grasp: Target-Oriented 6DoF Grasping Strategy with Grasp Intuition Based on Vision-Language Models. In: Lan, X., Mei, X., Jiang, C., Zhao, F., Tian, Z. (eds) Intelligent Robotics and Applications. ICIRA 2024. Lecture Notes in Computer Science(), vol 15202. Springer, Singapore. https://doi.org/10.1007/978-981-96-0774-7_7

Download citation

DOI: https://doi.org/10.1007/978-981-96-0774-7_7
Published: 21 January 2025
Publisher Name: Springer, Singapore
Print ISBN: 978-981-96-0773-0
Online ISBN: 978-981-96-0774-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

GI-Grasp: Target-Oriented 6DoF Grasping Strategy with Grasp Intuition Based on Vision-Language Models