Skip to main content
Log in

TRF-Net: a transformer-based RGB-D fusion network for desktop object instance segmentation

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

To perform object-specific tasks on the desktop, robots need to perceive different objects. The challenge is to calculate the pixel-wise mask for each object, even in the presence of occlusions and unseen objects. We take a step toward this problem by proposing a metric learning-based network called TRF-Net to perform desktop object instance segmentation. We design two ResNet-based branches to process the RGB and depth images separately. Then, we propose a Transformer-based fusion module called TranSE to fuse the features from both branches. This module also transfers the fused features to the decoder part, which helps generate fine-grained decoder features. After that, we propose a multi-scale feature embedding loss function called MFE loss to reduce the intra-class distance and increase the inter-class distance, which contributes to the feature clustering in embedding space. Due to the lack of large-scale real-world datasets for desktop objects, the proposed TRF-Net is trained with the synthetic dataset and tested with the small-scale real-world dataset. The target objects in the testing dataset do not present in the training dataset, ensuring the novelty of testing objects. We demonstrate that our method can produce accurate instance segmentation masks, outperforming other state-of-the-art methods on desktop object instance segmentation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13

Similar content being viewed by others

Data availability statement

This study applied data which are publicly available online, on the following links: (https://drive.google.com/uc?export=download &id=1Du309Ye8J7v2c4fFGuyPGjf-C3-623vw) and (https://www.acin.tuwien.ac.at/en/vision-for-robotics/software-tools/osd/).

References

  1. Yin C, Zhang Q (2022) Object affordance detection with boundary-preserving network for robotic manipulation tasks. Neural Comput Appl 34(20):17963–17980

    Article  Google Scholar 

  2. Liu S, Tian G, Zhang Y, Zhang M, Liu S (2021) Active object detection based on a novel deep q-learning network and long-term learning strategy for service robot. IEEE Trans Ind Electron 69(6):5984–5993

    Article  Google Scholar 

  3. Sundermeyer M, Mousavian A, Triebel R, Fox D (2021) “Contact-graspnet: efficient 6-DOF grasp generation in cluttered scenes. In: 2021 IEEE international conference on robotics and automation (ICRA). IEEE, pp 13438–13444

  4. Li Y, Kong T, Chu R, Li Y, Wang P, Li L (2021) Simultaneous semantic and collision learning for 6-DOF grasp pose estimation. In: 2021 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, pp 3571–3578

  5. Zhuang C, Wang Z, Zhao H, Ding H (2021) Semantic part segmentation method based 3D object pose estimation with RGB-D images for bin-picking. Robot Comput-Integr Manuf 68:102086

    Article  Google Scholar 

  6. Hu Y, Hugonot J, Fua P, Salzmann M (2019) Segmentation-driven 6d object pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3385–3394

  7. Silberman N, Hoiem D, Kohli P, Fergus R (2012) Indoor segmentation and support inference from RGBD images. In: European conference on computer vision. Springer, pp 746–760

  8. Song S, Lichtenberg SP, Xiao J (2015) Sun RGB-D: A RGB-D scene understanding benchmark suite. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 567–576

  9. Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B (2016) The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3213–3223

  10. Zhou B, Zhao H, Puig X, Fidler S, Barriuso A, Torralba A (2017) Scene parsing through ade20k dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 633–641

  11. Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338

    Article  Google Scholar 

  12. Richtsfeld A, Mörwald T, Prankl J, Zillich M, Vincze M (2012) Segmentation of unknown objects in indoor environments. In: 2012 IEEE/RSJ international conference on intelligent robots and systems. IEEE, pp 4791–4796

  13. Platt J (1998) Sequential minimal optimization: a fast algorithm for training support vector machines. Advances in Kernel Methods. Support Vector Learning, MIT Press, Boston

  14. Xie C, Xiang Y, Mousavian A, Fox D (2020) The best of both modes: separately leveraging RGB and depth for unseen object instance segmentation. In: Conference on robot learning. PMLR, pp 1369–1378

  15. Xie C, Xiang Y, Mousavian A, Fox D (2021) Unseen object instance segmentation for robotic environments. IEEE Trans Robot 37(5):1343–1359

    Article  Google Scholar 

  16. Xiang Y, Xie C, Mousavian A, Fox D (2020) Learning RGB-D feature embeddings for unseen object instance segmentation. In: Conference on robot learning. PMLR, pp 461–470

  17. Back S, Lee J, Kim T, Noh S, Kang R, Bak S, Lee K (2022) Unseen object amodal instance segmentation via hierarchical occlusion modeling. In: 2022 international conference on robotics and automation (ICRA). IEEE, pp 5085–5092

  18. Zabihifar S, Semochkin A, Seliverstova E, Efimov A (2021) Unreal mask: one-shot multi-object class-based pose estimation for robotic manipulation using keypoints with a synthetic dataset. Neural Comput Appl 33(19):12283–12300

    Article  Google Scholar 

  19. Coumans E, Bai Y (2016) Pybullet, a python module for physics simulation for games, robotics and machine learning. http://pybullet.org/

  20. Denninger M, Sundermeyer M, Winkelbauer D, Zidan Y, Olefir D, Elbadrawy M, Lodhi A, Katam H (2019) Blenderproc. arXiv preprint arXiv:1911.01911

  21. Danielczuk M, Matl M, Gupta S, Li A, Lee A, Mahler J, Goldberg K (2019) Segmenting unknown 3d objects from real depth images using mask r-CNN trained on synthetic data. In: 2019 international conference on robotics and automation (ICRA). IEEE, pp 7283–7290

  22. Song S, Yu F, Zeng A, Chang AX, Savva M, Funkhouser T (2017) Semantic scene completion from a single depth image. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1746–1754

  23. Chang A X, Funkhouser T, Guibas L, Hanrahan P, Huang Q, Li Z, Savarese S, Savva M, Song S, Su H et al (2015) Shapenet: an information-rich 3D model repository. arXiv preprint arXiv:1512.03012

  24. Calli B, Singh A, Walsman A, Srinivasa S, Abbeel P, Dollar AM (2015) The YCB object and model set: towards common benchmarks for manipulation research. In: International conference on advanced robotics (ICAR). IEEE, pp 510–517

  25. Yuan D, Chang X, Li Z, He Z (2022) Learning adaptive spatial-temporal context-aware correlation filters for UAV tracking. ACM Trans Multimedia Comput, Commun, Appl (TOMM) 18(3):1–18

    Article  Google Scholar 

  26. Shu X, Yang Y, Liu J, Chang X, Wu B (2023) Alvls: adaptive local variances-based levelset framework for medical images segmentation. Pattern Recogn 136:109257

    Article  Google Scholar 

  27. Cen J, Yun P, Cai J, Wang M Y, Liu M (2021) Deep metric learning for open world semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 15333–15342

  28. Chen Y, Pont-Tuset J, Montes A, Van Gool L (2018) Blazingly fast video object segmentation with pixel-wise metric learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1189–1198

  29. Milioto A, Mandtler L, Stachniss C (2019) Fast instance and semantic segmentation exploiting local connectivity, metric learning, and one-shot detection for robotics. In: 2019 international conference on robotics and automation (ICRA). IEEE, pp 5481–5487

  30. Wang K, Liew J H, Zou Y, Zhou D, Feng J (2019) Panet: few-shot image semantic segmentation with prototype alignment. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9197–9206

  31. Zhang M, Shi M, Li L (2021) Mfnet: multi-class few-shot segmentation network with pixel-wise metric learning. arXiv preprint arXiv:2111.00232

  32. Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, Lu L, Yuille AL, Zhou Y (2021) Transunet: transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306

  33. Hatamizadeh A, Tang Y, Nath V, Yang D, Myronenko A, Landman B, Roth H R, Xu D (2022) Unetr: transformers for 3D medical image segmentation. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 574–584

  34. Hazirbas C, Ma L, Domokos C, Cremers D (2016) Fusenet: incorporating depth into semantic segmentation via fusion-based CNN architecture. In: Asian conference on computer vision. Springer, pp 213–228

  35. He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-CNN. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969

  36. Chen X, Lin K-Y, Wang J, Wu W, Qian C, Li H, Zeng G (2020) Bi-directional cross-modality feature propagation with separation-and-aggregation gate for RGB-D semantic segmentation. In: European conference on computer vision. Springer, pp 561–577

  37. Valada A, Mohan R, Burgard W (2020) Self-supervised model adaptation for multimodal semantic segmentation. Int J Comput Vis 128(5):1239–1285

    Article  MATH  Google Scholar 

  38. Zhang Y, Yang Y, Xiong C, Sun G, Guo Y (2022) Attention-based dual supervised decoder for RGBD semantic segmentation. arXiv preprint arXiv:2201.01427

  39. Singh SK, Srivastava R (2022) Sl-net: self-learning and mutual attention-based distinguished window for RGBD complex salient object detection. Neural Comput Appl 35(1):595–609

    Article  Google Scholar 

  40. Jiang J, Zheng L, Luo F, Zhang Z (2018) Rednet: residual encoder-decoder network for indoor RGB-D semantic segmentation. arXiv preprint arXiv:1806.01054

  41. Seichter D, Köhler M, Lewandowski B, Wengefeld T, Gross H M (2021Efficient RGB-D semantic segmentation for indoor scene analysis. In: 2021 IEEE international conference on robotics and automation (ICRA). IEEE, pp 13525–13531

  42. Chen T, Hu X, Xiao J, Zhang G, Wang S (2022) Cfidnet: cascaded feature interaction decoder for RGB-D salient object detection. Neural Comput Appl 34(10):7547–7563

    Article  Google Scholar 

  43. Zhou H, Qi L, Huang H, Yang X, Wan Z, Wen X (2022) Canet: co-attention network for RGB-D semantic segmentation. Pattern Recogn 124:108468

    Article  Google Scholar 

  44. Qian Y, Deng L, Li T, Wang C, Yang M (2021) Gated-residual block for semantic segmentation using RGB-D data. IEEE Trans Intell Transp Syst 23(8):11836–11844

    Article  Google Scholar 

  45. Yue Y, Zhou W, Lei J, Yu L (2021) Two-stage cascaded decoder for semantic segmentation of RGB-D images. IEEE Signal Process Lett 28:1115–1119

    Article  Google Scholar 

  46. Hermans A, Beyer L, Leibe B (2017) “In defense of the triplet loss for person re-identification,” arXiv preprint arXiv:1703.07737

  47. Lee J, Abu-El-Haija S, Varadarajan B, Natsev A (2018) Collaborative deep metric learning for video understanding. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery and data mining, pp 481–490

  48. Xie C, Xiang Y, Harchaoui Z, Fox D (2019) Object discovery in videos as foreground motion clustering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9994–10003

  49. Wang F, Cheng J, Liu W, Liu H (2018) Additive margin softmax for face verification. IEEE Signal Process Lett 25(7):926–930

    Article  Google Scholar 

  50. Roth K, Brattoli B, Ommer B (2019) Mic: mining interclass characteristics for improved metric learning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8000–8009

  51. Jeeveswaran K, Kathiresan S, Varma A, Magdy O, Zonooz B, Arani E (2022) A comprehensive study of vision transformers on dense prediction tasks. arXiv preprint arXiv:2201.08683

  52. Zheng S, Lu J, Zhao H, Zhu X, Luo Z, Wang Y, Fu Y, Feng J, Xiang T, Torr PH et al. (2021) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6881–6890

  53. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st international conference on neural information processing system, pp 6000–6010

  54. Strudel R, Garcia R, Laptev I, Schmid C (2021) Segmenter: transformer for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7262–7272

  55. Xie E, Wang W, Yu Z, Anandkumar A, Alvarez JM, Luo P (2021) Segformer: simple and efficient design for semantic segmentation with transformers. Adv Neural Inf Process Syst 34:12077–12090

    Google Scholar 

  56. Wu S, Wu T, Lin F, Tian S, Guo G (2021) Fully transformer networks for semantic image segmentation. arXiv preprint arXiv:2106.04108

  57. Li Y, Chen Y, Wang N, Zhang Z (2019) Scale-aware trident networks for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision pp 6054–6063

  58. Zhang F, Li M, Zhai G, Liu Y (2021) Multi-branch and multi-scale attention learning for fine-grained visual categorization. In: MultiMedia modeling: 27th international conference, MMM (2021) Prague, Czech Republic, June 22–24, 2021, Proceedings, Part I, vol 27. Springer, pp 136–147

  59. Tang H, Yuan C, Li Z, Tang J (2022) Learning attention-guided pyramidal features for few-shot fine-grained recognition. Pattern Recogn 130:108792

    Article  Google Scholar 

  60. Zhou Z, Siddiquee MMR, Tajbakhsh N, Liang J (2019) Unet++: redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans Med Imaging 39(6):1856–1867

    Article  Google Scholar 

  61. Zeng N, Wu P, Wang Z, Li H, Liu W, Liu X (2022) A small-sized object detection oriented multi-scale feature fusion approach with application to defect detection. IEEE Trans Instrum Meas 71:1–14

    Google Scholar 

  62. Wang Z, Guo J, Zhang C, Wang B (2022) Multiscale feature enhancement network for salient object detection in optical remote sensing images. IEEE Trans Geosci Remote Sens 60:1–19

    Google Scholar 

  63. Danielczuk M, Mousavian A, Eppner C, Fox D (2021) Object rearrangement using learned implicit collision functions. In: 2021 IEEE international conference on robotics and automation (ICRA). IEEE, pp 6010–6017

  64. Goyal A, Mousavian A, Paxton C, Chao Y-W, Okorn B, Deng J, Fox D (2022) Ifor: iterative flow minimization for robotic object rearrangement. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14787–14797

  65. Serhan B, Pandya H, Kucukyilmaz A, Neumann G (2022) Push-to-see: learning non-prehensile manipulation to enhance instance segmentation via deep q-learning. In: 2022 international conference on robotics and automation (ICRA). IEEE, pp 1513–1519

  66. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440

  67. Ronneberger O, Fischer P, Brox T, U-net: convolutional networks for biomedical image segmentation. In: Medical image computing and computer-assisted intervention-MICCAI, (2015) 18th international conference, Munich, Germany, October 5–9, 2015, Proceedings, Part III, vol 18. Springer, pp 234–241

  68. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141

  69. Romera E, Alvarez JM, Bergasa LM, Arroyo R (2017) Erfnet: efficient residual factorized convnet for real-time semantic segmentation. IEEE Trans Intell Transp Syst 19(1):263–272

    Article  Google Scholar 

  70. Gutoski M, Lazzaretti AE, Lopes HS (2021) Deep metric learning for open-set human action recognition in videos. Neural Comput Appl 33(4):1207–1220

    Article  Google Scholar 

  71. Comaniciu D, Meer P (2002) Mean shift: a robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell 24(5):603–619

    Article  Google Scholar 

  72. Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition. IEEE, pp 248–255

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (No. 61973066), Major Science and Technology Projects of Liaoning Province (No. 2021JH1/10400049), Foundation of Key Laboratory of Equipment Reliability(No. WD2C20205500306) and Foundation of Key Laboratory of Aerospace System Simulation(No. 6142002200301).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yunzhou Zhang.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cao, H., Zhang, Y., Shan, D. et al. TRF-Net: a transformer-based RGB-D fusion network for desktop object instance segmentation. Neural Comput & Applic 35, 21309–21330 (2023). https://doi.org/10.1007/s00521-023-08886-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-023-08886-2

Keywords

Navigation