TRF-Net: a transformer-based RGB-D fusion network for desktop object instance segmentation

Cao, He; Zhang, Yunzhou; Shan, Dexing; Liu, Xiaozheng; Zhao, Jiaqi

doi:10.1007/s00521-023-08886-2

TRF-Net: a transformer-based RGB-D fusion network for desktop object instance segmentation

Original Article
Published: 05 August 2023

Volume 35, pages 21309–21330, (2023)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

He Cao¹,
Yunzhou Zhang ORCID: orcid.org/0000-0003-0610-3732²,
Dexing Shan²,
Xiaozheng Liu² &
…
Jiaqi Zhao²

473 Accesses
1 Citation
Explore all metrics

Abstract

To perform object-specific tasks on the desktop, robots need to perceive different objects. The challenge is to calculate the pixel-wise mask for each object, even in the presence of occlusions and unseen objects. We take a step toward this problem by proposing a metric learning-based network called TRF-Net to perform desktop object instance segmentation. We design two ResNet-based branches to process the RGB and depth images separately. Then, we propose a Transformer-based fusion module called TranSE to fuse the features from both branches. This module also transfers the fused features to the decoder part, which helps generate fine-grained decoder features. After that, we propose a multi-scale feature embedding loss function called MFE loss to reduce the intra-class distance and increase the inter-class distance, which contributes to the feature clustering in embedding space. Due to the lack of large-scale real-world datasets for desktop objects, the proposed TRF-Net is trained with the synthetic dataset and tested with the small-scale real-world dataset. The target objects in the testing dataset do not present in the training dataset, ensuring the novelty of testing objects. We demonstrate that our method can produce accurate instance segmentation masks, outperforming other state-of-the-art methods on desktop object instance segmentation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Instance-Aware Embedding for Point Cloud Instance Segmentation

Holistic indoor scene understanding by context-supported instance segmentation

Article 25 June 2021

InstaBoost++: Visual Coherence Principles for Unified 2D/3D Instance Level Data Augmentation

Article 17 June 2023

Data availability statement

This study applied data which are publicly available online, on the following links: (https://drive.google.com/uc?export=download &id=1Du309Ye8J7v2c4fFGuyPGjf-C3-623vw) and (https://www.acin.tuwien.ac.at/en/vision-for-robotics/software-tools/osd/).

References

Yin C, Zhang Q (2022) Object affordance detection with boundary-preserving network for robotic manipulation tasks. Neural Comput Appl 34(20):17963–17980
Article Google Scholar
Liu S, Tian G, Zhang Y, Zhang M, Liu S (2021) Active object detection based on a novel deep q-learning network and long-term learning strategy for service robot. IEEE Trans Ind Electron 69(6):5984–5993
Article Google Scholar
Sundermeyer M, Mousavian A, Triebel R, Fox D (2021) “Contact-graspnet: efficient 6-DOF grasp generation in cluttered scenes. In: 2021 IEEE international conference on robotics and automation (ICRA). IEEE, pp 13438–13444
Li Y, Kong T, Chu R, Li Y, Wang P, Li L (2021) Simultaneous semantic and collision learning for 6-DOF grasp pose estimation. In: 2021 IEEE/RSJ international conference on intelligent robots and systems (IROS). IEEE, pp 3571–3578
Zhuang C, Wang Z, Zhao H, Ding H (2021) Semantic part segmentation method based 3D object pose estimation with RGB-D images for bin-picking. Robot Comput-Integr Manuf 68:102086
Article Google Scholar
Hu Y, Hugonot J, Fua P, Salzmann M (2019) Segmentation-driven 6d object pose estimation. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3385–3394
Silberman N, Hoiem D, Kohli P, Fergus R (2012) Indoor segmentation and support inference from RGBD images. In: European conference on computer vision. Springer, pp 746–760
Song S, Lichtenberg SP, Xiao J (2015) Sun RGB-D: A RGB-D scene understanding benchmark suite. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 567–576
Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B (2016) The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3213–3223
Zhou B, Zhao H, Puig X, Fidler S, Barriuso A, Torralba A (2017) Scene parsing through ade20k dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 633–641
Everingham M, Van Gool L, Williams CK, Winn J, Zisserman A (2010) The pascal visual object classes (VOC) challenge. Int J Comput Vis 88(2):303–338
Article Google Scholar
Richtsfeld A, Mörwald T, Prankl J, Zillich M, Vincze M (2012) Segmentation of unknown objects in indoor environments. In: 2012 IEEE/RSJ international conference on intelligent robots and systems. IEEE, pp 4791–4796
Platt J (1998) Sequential minimal optimization: a fast algorithm for training support vector machines. Advances in Kernel Methods. Support Vector Learning, MIT Press, Boston
Xie C, Xiang Y, Mousavian A, Fox D (2020) The best of both modes: separately leveraging RGB and depth for unseen object instance segmentation. In: Conference on robot learning. PMLR, pp 1369–1378
Xie C, Xiang Y, Mousavian A, Fox D (2021) Unseen object instance segmentation for robotic environments. IEEE Trans Robot 37(5):1343–1359
Article Google Scholar
Xiang Y, Xie C, Mousavian A, Fox D (2020) Learning RGB-D feature embeddings for unseen object instance segmentation. In: Conference on robot learning. PMLR, pp 461–470
Back S, Lee J, Kim T, Noh S, Kang R, Bak S, Lee K (2022) Unseen object amodal instance segmentation via hierarchical occlusion modeling. In: 2022 international conference on robotics and automation (ICRA). IEEE, pp 5085–5092
Zabihifar S, Semochkin A, Seliverstova E, Efimov A (2021) Unreal mask: one-shot multi-object class-based pose estimation for robotic manipulation using keypoints with a synthetic dataset. Neural Comput Appl 33(19):12283–12300
Article Google Scholar
Coumans E, Bai Y (2016) Pybullet, a python module for physics simulation for games, robotics and machine learning. http://pybullet.org/
Denninger M, Sundermeyer M, Winkelbauer D, Zidan Y, Olefir D, Elbadrawy M, Lodhi A, Katam H (2019) Blenderproc. arXiv preprint arXiv:1911.01911
Danielczuk M, Matl M, Gupta S, Li A, Lee A, Mahler J, Goldberg K (2019) Segmenting unknown 3d objects from real depth images using mask r-CNN trained on synthetic data. In: 2019 international conference on robotics and automation (ICRA). IEEE, pp 7283–7290
Song S, Yu F, Zeng A, Chang AX, Savva M, Funkhouser T (2017) Semantic scene completion from a single depth image. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1746–1754
Chang A X, Funkhouser T, Guibas L, Hanrahan P, Huang Q, Li Z, Savarese S, Savva M, Song S, Su H et al (2015) Shapenet: an information-rich 3D model repository. arXiv preprint arXiv:1512.03012
Calli B, Singh A, Walsman A, Srinivasa S, Abbeel P, Dollar AM (2015) The YCB object and model set: towards common benchmarks for manipulation research. In: International conference on advanced robotics (ICAR). IEEE, pp 510–517
Yuan D, Chang X, Li Z, He Z (2022) Learning adaptive spatial-temporal context-aware correlation filters for UAV tracking. ACM Trans Multimedia Comput, Commun, Appl (TOMM) 18(3):1–18
Article Google Scholar
Shu X, Yang Y, Liu J, Chang X, Wu B (2023) Alvls: adaptive local variances-based levelset framework for medical images segmentation. Pattern Recogn 136:109257
Article Google Scholar
Cen J, Yun P, Cai J, Wang M Y, Liu M (2021) Deep metric learning for open world semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 15333–15342
Chen Y, Pont-Tuset J, Montes A, Van Gool L (2018) Blazingly fast video object segmentation with pixel-wise metric learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1189–1198
Milioto A, Mandtler L, Stachniss C (2019) Fast instance and semantic segmentation exploiting local connectivity, metric learning, and one-shot detection for robotics. In: 2019 international conference on robotics and automation (ICRA). IEEE, pp 5481–5487
Wang K, Liew J H, Zou Y, Zhou D, Feng J (2019) Panet: few-shot image semantic segmentation with prototype alignment. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 9197–9206
Zhang M, Shi M, Li L (2021) Mfnet: multi-class few-shot segmentation network with pixel-wise metric learning. arXiv preprint arXiv:2111.00232
Chen J, Lu Y, Yu Q, Luo X, Adeli E, Wang Y, Lu L, Yuille AL, Zhou Y (2021) Transunet: transformers make strong encoders for medical image segmentation. arXiv preprint arXiv:2102.04306
Hatamizadeh A, Tang Y, Nath V, Yang D, Myronenko A, Landman B, Roth H R, Xu D (2022) Unetr: transformers for 3D medical image segmentation. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp 574–584
Hazirbas C, Ma L, Domokos C, Cremers D (2016) Fusenet: incorporating depth into semantic segmentation via fusion-based CNN architecture. In: Asian conference on computer vision. Springer, pp 213–228
He K, Gkioxari G, Dollár P, Girshick R (2017) Mask r-CNN. In: Proceedings of the IEEE international conference on computer vision, pp 2961–2969
Chen X, Lin K-Y, Wang J, Wu W, Qian C, Li H, Zeng G (2020) Bi-directional cross-modality feature propagation with separation-and-aggregation gate for RGB-D semantic segmentation. In: European conference on computer vision. Springer, pp 561–577
Valada A, Mohan R, Burgard W (2020) Self-supervised model adaptation for multimodal semantic segmentation. Int J Comput Vis 128(5):1239–1285
Article MATH Google Scholar
Zhang Y, Yang Y, Xiong C, Sun G, Guo Y (2022) Attention-based dual supervised decoder for RGBD semantic segmentation. arXiv preprint arXiv:2201.01427
Singh SK, Srivastava R (2022) Sl-net: self-learning and mutual attention-based distinguished window for RGBD complex salient object detection. Neural Comput Appl 35(1):595–609
Article Google Scholar
Jiang J, Zheng L, Luo F, Zhang Z (2018) Rednet: residual encoder-decoder network for indoor RGB-D semantic segmentation. arXiv preprint arXiv:1806.01054
Seichter D, Köhler M, Lewandowski B, Wengefeld T, Gross H M (2021Efficient RGB-D semantic segmentation for indoor scene analysis. In: 2021 IEEE international conference on robotics and automation (ICRA). IEEE, pp 13525–13531
Chen T, Hu X, Xiao J, Zhang G, Wang S (2022) Cfidnet: cascaded feature interaction decoder for RGB-D salient object detection. Neural Comput Appl 34(10):7547–7563
Article Google Scholar
Zhou H, Qi L, Huang H, Yang X, Wan Z, Wen X (2022) Canet: co-attention network for RGB-D semantic segmentation. Pattern Recogn 124:108468
Article Google Scholar
Qian Y, Deng L, Li T, Wang C, Yang M (2021) Gated-residual block for semantic segmentation using RGB-D data. IEEE Trans Intell Transp Syst 23(8):11836–11844
Article Google Scholar
Yue Y, Zhou W, Lei J, Yu L (2021) Two-stage cascaded decoder for semantic segmentation of RGB-D images. IEEE Signal Process Lett 28:1115–1119
Article Google Scholar
Hermans A, Beyer L, Leibe B (2017) “In defense of the triplet loss for person re-identification,” arXiv preprint arXiv:1703.07737
Lee J, Abu-El-Haija S, Varadarajan B, Natsev A (2018) Collaborative deep metric learning for video understanding. In: Proceedings of the 24th ACM SIGKDD international conference on knowledge discovery and data mining, pp 481–490
Xie C, Xiang Y, Harchaoui Z, Fox D (2019) Object discovery in videos as foreground motion clustering. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 9994–10003
Wang F, Cheng J, Liu W, Liu H (2018) Additive margin softmax for face verification. IEEE Signal Process Lett 25(7):926–930
Article Google Scholar
Roth K, Brattoli B, Ommer B (2019) Mic: mining interclass characteristics for improved metric learning. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 8000–8009
Jeeveswaran K, Kathiresan S, Varma A, Magdy O, Zonooz B, Arani E (2022) A comprehensive study of vision transformers on dense prediction tasks. arXiv preprint arXiv:2201.08683
Zheng S, Lu J, Zhao H, Zhu X, Luo Z, Wang Y, Fu Y, Feng J, Xiang T, Torr PH et al. (2021) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 6881–6890
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I (2017) Attention is all you need. In: Proceedings of the 31st international conference on neural information processing system, pp 6000–6010
Strudel R, Garcia R, Laptev I, Schmid C (2021) Segmenter: transformer for semantic segmentation. In: Proceedings of the IEEE/CVF international conference on computer vision, pp 7262–7272
Xie E, Wang W, Yu Z, Anandkumar A, Alvarez JM, Luo P (2021) Segformer: simple and efficient design for semantic segmentation with transformers. Adv Neural Inf Process Syst 34:12077–12090
Google Scholar
Wu S, Wu T, Lin F, Tian S, Guo G (2021) Fully transformer networks for semantic image segmentation. arXiv preprint arXiv:2106.04108
Li Y, Chen Y, Wang N, Zhang Z (2019) Scale-aware trident networks for object detection. In: Proceedings of the IEEE/CVF international conference on computer vision pp 6054–6063
Zhang F, Li M, Zhai G, Liu Y (2021) Multi-branch and multi-scale attention learning for fine-grained visual categorization. In: MultiMedia modeling: 27th international conference, MMM (2021) Prague, Czech Republic, June 22–24, 2021, Proceedings, Part I, vol 27. Springer, pp 136–147
Tang H, Yuan C, Li Z, Tang J (2022) Learning attention-guided pyramidal features for few-shot fine-grained recognition. Pattern Recogn 130:108792
Article Google Scholar
Zhou Z, Siddiquee MMR, Tajbakhsh N, Liang J (2019) Unet++: redesigning skip connections to exploit multiscale features in image segmentation. IEEE Trans Med Imaging 39(6):1856–1867
Article Google Scholar
Zeng N, Wu P, Wang Z, Li H, Liu W, Liu X (2022) A small-sized object detection oriented multi-scale feature fusion approach with application to defect detection. IEEE Trans Instrum Meas 71:1–14
Google Scholar
Wang Z, Guo J, Zhang C, Wang B (2022) Multiscale feature enhancement network for salient object detection in optical remote sensing images. IEEE Trans Geosci Remote Sens 60:1–19
Google Scholar
Danielczuk M, Mousavian A, Eppner C, Fox D (2021) Object rearrangement using learned implicit collision functions. In: 2021 IEEE international conference on robotics and automation (ICRA). IEEE, pp 6010–6017
Goyal A, Mousavian A, Paxton C, Chao Y-W, Okorn B, Deng J, Fox D (2022) Ifor: iterative flow minimization for robotic object rearrangement. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 14787–14797
Serhan B, Pandya H, Kucukyilmaz A, Neumann G (2022) Push-to-see: learning non-prehensile manipulation to enhance instance segmentation via deep q-learning. In: 2022 international conference on robotics and automation (ICRA). IEEE, pp 1513–1519
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Ronneberger O, Fischer P, Brox T, U-net: convolutional networks for biomedical image segmentation. In: Medical image computing and computer-assisted intervention-MICCAI, (2015) 18th international conference, Munich, Germany, October 5–9, 2015, Proceedings, Part III, vol 18. Springer, pp 234–241
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 7132–7141
Romera E, Alvarez JM, Bergasa LM, Arroyo R (2017) Erfnet: efficient residual factorized convnet for real-time semantic segmentation. IEEE Trans Intell Transp Syst 19(1):263–272
Article Google Scholar
Gutoski M, Lazzaretti AE, Lopes HS (2021) Deep metric learning for open-set human action recognition in videos. Neural Comput Appl 33(4):1207–1220
Article Google Scholar
Comaniciu D, Meer P (2002) Mean shift: a robust approach toward feature space analysis. IEEE Trans Pattern Anal Mach Intell 24(5):603–619
Article Google Scholar
Deng J, Dong W, Socher R, Li L-J, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: IEEE conference on computer vision and pattern recognition. IEEE, pp 248–255

Download references

Acknowledgements

This work was supported by National Natural Science Foundation of China (No. 61973066), Major Science and Technology Projects of Liaoning Province (No. 2021JH1/10400049), Foundation of Key Laboratory of Equipment Reliability(No. WD2C20205500306) and Foundation of Key Laboratory of Aerospace System Simulation(No. 6142002200301).

Author information

Authors and Affiliations

Faculty of Robot Science and Engineering, Northeastern University, Shenyang, China
He Cao
College of Information Science and Engineering, Northeastern University, Shenyang, China
Yunzhou Zhang, Dexing Shan, Xiaozheng Liu & Jiaqi Zhao

Authors

He Cao
View author publications
You can also search for this author in PubMed Google Scholar
Yunzhou Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Dexing Shan
View author publications
You can also search for this author in PubMed Google Scholar
Xiaozheng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jiaqi Zhao
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yunzhou Zhang.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Cao, H., Zhang, Y., Shan, D. et al. TRF-Net: a transformer-based RGB-D fusion network for desktop object instance segmentation. Neural Comput & Applic 35, 21309–21330 (2023). https://doi.org/10.1007/s00521-023-08886-2

Download citation

Received: 28 November 2022
Accepted: 12 July 2023
Published: 05 August 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s00521-023-08886-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

TRF-Net: a transformer-based RGB-D fusion network for desktop object instance segmentation

Abstract

Access this article

Similar content being viewed by others

Instance-Aware Embedding for Point Cloud Instance Segmentation

Holistic indoor scene understanding by context-supported instance segmentation

InstaBoost++: Visual Coherence Principles for Unified 2D/3D Instance Level Data Augmentation

Data availability statement

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

TRF-Net: a transformer-based RGB-D fusion network for desktop object instance segmentation

Abstract

Access this article

Similar content being viewed by others

Instance-Aware Embedding for Point Cloud Instance Segmentation

Holistic indoor scene understanding by context-supported instance segmentation

InstaBoost++: Visual Coherence Principles for Unified 2D/3D Instance Level Data Augmentation

Data availability statement

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation