Abstract
Robot grasping is widely recognized as a crucial component of robotics activities. Several deep learning based grasping algorithms for planar and 6-degree-of-freedom have been presented, and they have produced good results in simulation and real world. However, when these algorithms do grasping posture estimation, their projected grasping poses may not always make sense for the grasping site, even if they cover the item under consideration. These algorithms tend to focus on the thing as a whole and perform activities that differ significantly from human behavior. To that end, we propose our GI-Grasp, a novel strategy that allows the robot to perceive the object to be grasped at a finer scale by introducing vision-language models (VLMs) to determine which part of the object is more suitable for grasping, guiding the robot to act like a human. First, we segment the RGB images of the grasping scene into instances in order to detect and localize the items to be clutched. Secondly, we provide the robot with a priori knowledge of the objects to be grasped through VLMs to help the robot understand the compositional details of the objects to be grasped and identify the spatial constraints related to the grasping task. Finally, acceptable position prediction is combined with the grasping algorithm to improve the robot’s grasping accuracy. Our real-world experiments have proven that GI-Grasp of object features assists robots in grasping items in a more human like (and reasonable) style, increasing the success rate of grasping.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Xie, A., Lee, L., Xiao, T., Finn, C.: Decomposing the generalization gap in imitation learning for visual robotic manipulation. arXiv preprint arXiv:2307.03659 (2023)
Achiam, J., et al.: GPT-4 technical report. arXiv preprint arXiv:2303.08774 (2023)
Brohan, A., et al.: RT-2: vision-language-action models transfer web knowledge to robotic control. arXiv preprint arXiv:2307.15818 (2023)
Huang, W., Wang, C., Zhang, R., Li, Y., Wu, J., Fei-Fei, L.: VoxPoser: composable 3D value maps for robotic manipulation with language models. arXiv preprint arXiv:2307.05973 (2023)
He, Y., Sun, W., Huang, H., Liu, J., Fan, H., Sun, J.: PVN3D: a deep point-wise 3D keypoints voting network for 6DoF pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11632–11641 (2020)
Kleeberger, K., Bormann, R., Kraus, W., Huber, M.F.: A survey on learning-based robotic grasping. Current Robot. Rep. 1, 239–249 (2020)
Dang, H., Allen, P.K.: Semantic grasping: planning robotic grasps functionally suitable for an object manipulation task. In: 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 1311–1317. IEEE (2012)
Shridhar, M., Manuelli, L., Fox, D.: CLIPort: what and where pathways for robotic manipulation. In: Conference on Robot Learning, pp. 894–906. PMLR (2022)
Shridhar, M., Manuelli, L., Fox, D.: Perceiver-actor: a multi-task transformer for robotic manipulation. In: Conference on Robot Learning, pp. 785–799. PMLR (2023)
Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (2018)
Zuo, Y., Qiu, W., Xie, L., Zhong, F., Wang, Y., Yuille, A.L.: CRAVES: controlling robotic arm with a vision-based economic system. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4214–4223 (2019)
Sarantopoulos, I., Kiatos, M., Doulgeri, Z., Malassiotis, S.: Split deep Q-learning for robust object singulation. In: 2020 IEEE International Conference on Robotics and Automation (ICRA), pp. 6225–6231. IEEE (2020)
Ze, Y., Hansen, N., Chen, Y., Jain, M., Wang, X.: Visual reinforcement learning with self-supervised 3D representations. IEEE Robot. Autom. Lett. 8(5), 2890–2897 (2023)
Xu, Z., et al.: RoboNinja: learning an adaptive cutting policy for multi-material objects. arXiv preprint arXiv:2302.11553 (2023)
Zhang, J., Huang, J., Jin, S., Lu, S.: Vision-language models for vision tasks: a survey. IEEE Trans. Pattern Anal. Mach. Intell. 46, 5625–5644 (2024)
Kirillov, A., et al.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4015–4026 (2023)
Fang, H.-S., Wang, C., Gou, M., Lu, C.: GraspNet-1billion: a large-scale benchmark for general object grasping. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11444–11453 (2020)
Applin, S., et al.: GPT-4V(ision) system card (2023)
Acknowledgements
This work was supported in part by the National Key Research and Development Project of China under Grant 2022YFF0902401, in part by the National Natural Science Foundation of China under Grant U22A2063 and Grant 62173083, in part by the Major Program of National Natural Science Foundation of China under Grant 71790614, and in part by the 111 Project under Grant B16009.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Jia, T., Zhang, H., Yang, G., Liu, Y., Wang, H., Guo, S. (2025). GI-Grasp: Target-Oriented 6DoF Grasping Strategy with Grasp Intuition Based on Vision-Language Models. In: Lan, X., Mei, X., Jiang, C., Zhao, F., Tian, Z. (eds) Intelligent Robotics and Applications. ICIRA 2024. Lecture Notes in Computer Science(), vol 15202. Springer, Singapore. https://doi.org/10.1007/978-981-96-0774-7_7
Download citation
DOI: https://doi.org/10.1007/978-981-96-0774-7_7
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-96-0773-0
Online ISBN: 978-981-96-0774-7
eBook Packages: Computer ScienceComputer Science (R0)