MutualFormer: Multi-modal Representation Learning via Cross-Diffusion Attention

Wang, Xixi; Wang, Xiao; Jiang, Bo; Tang, Jin; Luo, Bin

doi:10.1007/s11263-024-02067-x

MutualFormer: Multi-modal Representation Learning via Cross-Diffusion Attention

Published: 24 April 2024

Volume 132, pages 3867–3888, (2024)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Xixi Wang¹,
Xiao Wang¹,
Bo Jiang¹,
Jin Tang¹ &
…
Bin Luo¹

1251 Accesses
Explore all metrics

Abstract

Aggregating multi-modal data to obtain reliable data representation attracts more and more attention. Recent studies demonstrate that Transformer models usually work well for multi-modal tasks. Existing Transformers generally either adopt the cross-attention (CA) mechanism or simple concatenation to achieve the information interaction among different modalities which generally ignore the issue of modality gap. In this work, we re-think Transformer and extend it to MutualFormer for multi-modal data representation. Rather than CA in Transformer, MutualFormer employs our new design of cross-diffusion attention (CDA) to conduct the information communication among different modalities. Comparing with CA, the main advantages of the proposed CDA are three aspects. First, the cross-affinities in CDA are defined based on the individual modal affinities (token metrics) which thus can naturally alleviate the issue of modality/domain gap existed in traditional token feature based CA definition. Second, CDA provides a general scheme which can either be used for multi-modal representation or serve as the post-optimization for existing CA models. Third, CDA is implemented efficiently. We successfully apply the MutualFormer on several multi-modal learning tasks. Extensive experiments demonstrate the effectiveness of the proposed MutualFormer.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fine-Grained Correlation Learning with Stacked Co-attention Networks for Cross-Modal Information Retrieval

Cross-Modal Representation Learning

Deep Multi-modal Learning with Cascade Consensus

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

Notes

Source Code: https://github.com/ttaalle/multi-modal-vehicle-Re-ID.
As shown in Li et al. (2020), Baseline model adopts the CAM-based adaptive fusion module to fuse the multi-modal features.

References

Achanta, R., Hemami, S., Estrada, F., & Süsstrunk, S. (2009). Frequency-tuned salient region detection. In Proceedings of the IEEE International Conference on Computer Vision and Pattern Recognition, pp. 1597–1604
Bai, S., Bai, X., Tian, Q., & Latecki, L. J. (2018). Regularized diffusion process on bidirectional context for object retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(5), 1213–1226.
Article Google Scholar
Barbato, F., Rizzoli, G., & Zanuttigh, P. (2023). Depthformer: Multimodal positional encodings and cross-input attention for transformer-based segmentation networks. In ICASSP 2023–2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5
Barbosa, I.B., Cristani, M., Del Bue, A., Bazzani, L., & Murino, V.(2012). Re-identification with rgb-d sensors. In: Computer Vision–ECCV 2012. Workshops and Demonstrations: Florence, Italy, October 7–13, 2012, Proceedings, Part I 12, pp. 433–442. Springer
Borji, A., Cheng, M.-M., Jiang, H., & Li, J. (2015). Salient object detection: A benchmark. IEEE Transactions on Image Processing, 24, 5706–5722.
Article MathSciNet Google Scholar
Cao, J., Leng, H., Lischinski, D., Cohen-Or, D., Tu, C., & Li, Y. (2021). Shapeconv: Shape-aware convolutional layer for indoor rgb-d semantic segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7088–7097
Cao, Y., Luo, X., Yang, J., Cao, Y., & Yang, M. Y. (2022). Locality guided cross-modal feature aggregation and pixel-level fusion for multispectral pedestrian detection. Information Fusion, 88, 1–11.
Article Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., & Zagoruyko, S. (2020). End-to-end object detection with transformers. In Proceedings of European Conference on Computer Vision, pp. 213–229
Chaudhuri, A., Mancini, M., Chen, Y., Akata, Z., & Dutta, A. (2022). Cross-modal fusion distillation for fine-grained sketch-based image retrieval. In 33rd British Machine Vision Conference. BMVA Press
Chen, T., Ding, S., Xie, J., Yuan, Y., Chen, W., Yang, Y., Ren, Z., & Wang, Z. (2019). Abd-net: Attentive but diverse person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 8351–8361
Chen, Q., Liu, Z., Zhang, Y., Fu, K., Zhao, Q., & Du, H. (2021). Rgb-d salient object detection via 3d convolutional neural networks. In Proceedings of AAAI Conference on Artificial Intelligence, vol. 35, pp. 1063–1071
Chen, X., Yan, B., Zhu, J., Wang, D., Yang, X., & Lu, H. (2021) Transformer tracking. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8126–8135
Chen, H., Li, Y., Deng, Y., & Lin, G. (2021). Cnn-based rgb-d salient object detection: Learn, select, and fuse. International Journal of Computer Vision, 129(7), 2076–2096.
Article Google Scholar
Chen, L.-Z., Lin, Z., Wang, Z., Yang, Y.-L., & Cheng, M.-M. (2021). Spatial information guided convolution for real-time rgbd semantic segmentation. IEEE Transactions on Image Processing, 30, 2313–2324.
Article Google Scholar
Curto, D., Clapés, A., Selva, J., Smeureanu, S., Junior, J., Jacques, C., Gallardo-Pujol, D., Guilera, G., Leiva, D., & Moeslund, T. B,(2021). Dyadformer: A multi-modal transformer for long-range modeling of dyadic interactions. In Proceedings of the IEEE International Conference on Computer Vision, pp. 2177–2188
Dai, Y., Gao, Y., & Liu, F. (2021). Transmed: Transformers advance multi-modal medical image classification. Diagnostics, 11(8), 1384.
Article Google Scholar
Dalmaz, O., Yurt, M., & Çukur, T. (2022). Resvit: Residual vision transformers for multimodal medical image synthesis. IEEE Transactions on Medical Imaging, 41(10), 2598–2614.
Article Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L.(2009). Imagenet: A large-scale hierarchical image database. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019). Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT, vol. 1, pp. 4171–4186
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., & Houlsby, N. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the IEEE International Conference on Learning Representations
Dou, Z.-Y., Kamath, A., Gan, Z., Zhang, P., Wang, J., Li, L., Liu, Z., Liu, C., LeCun, Y., & Peng, N. (2022). Coarse-to-fine vision-language pre-training with fusion in the backbone. Advances in Neural Information Processing Systems, 35, 32942–32956.
Google Scholar
Fan, D.-P., Cheng, M.-M., Liu, Y., Li, T., & Borji, A. (2017). Structure-measure: A new way to evaluate foreground maps. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4548–4557
Fan, D.-P., Gong, C., Cao, Y., Ren, B., Cheng, M.-M., & Borji, A. (2018). Enhanced-alignment Measure for Binary Foreground Map Evaluation. In Proceedings of International Joint Conference on Artificial Intelligence, pp. 698–704
Fan, J., Zheng, P., & Lee, C. K. (2022). A multi-granularity scene segmentation network for human-robot collaboration environment perception. In 2022 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 2105–2110 . IEEE
Fan, D.-P., Lin, Z., Zhang, Z., Zhu, M., & Cheng, M.-M. (2020). Rethinking rgb-d salient object detection: Models, data sets, and large-scale benchmarks. IEEE Transactions on Neural Networks and Learning Systems, 32(5), 2075–2089.
Article Google Scholar
Feng, C.-M., Yan, Y., Chen, G., Fu, H., Xu, Y., & Shao, L. (2021). Accelerated multi-modal mr imaging with transformers. arXiv:2106.14248
Fu, K., Fan, D.-P., Ji, G.-P., Zhao, Q., Shen, J., & Zhu, C. (2021). Siamese network for rgb-d salient object detection and beyond. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44, 5541–5559.
Google Scholar
Gabeur, V., Sun, C., Alahari, K., & Schmid, C. (2020). Multi-modal transformer for video retrieval. In Proceedings of European Conference on Computer Vision, pp. 214–229
George, A., & Marcel, S. (2021). Cross modal focal loss for rgbd face anti-spoofing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7882–7891
Gu, Z., Niu, L., Zhao, H., & Zhang, L. (2021). Hard pixel mining for depth privileged semantic segmentation. IEEE Transactions on Multimedia, 23, 3738–3751.
Article Google Scholar
He, S., Luo, H., Wang, P., Wang, F., Li, H., & Jiang, W. (2021). Transreid: Transformer-based object re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 15013–15022
He, K., Zhang, X., Ren, S., & Sun, J.(2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778
Huang, J., Tao, J., Liu, B., Lian, Z., & Niu, M. (2020). Multimodal transformer fusion for continuous emotion recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 3507–3511
Ji, W., Li, J., Zhang, M., Piao, Y., & Lu, H. (2020). Accurate rgb-d salient object detection via collaborative learning. In Proceedings of European Conference on Computer Vision, pp. 52–69
Ju, R., Ge, L., Geng, W., Ren, T., & Wu, G. (2014). Depth saliency based on anisotropic center-surround difference. In Proceedings of the IEEE International Conference on Image Processing, pp. 1115–1119
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv:1412.6980
Li, C., Cong, R., Piao, Y., Xu, Q., & Loy, C. C. (2020). Rgb-d salient object detection with cross-modality modulation and selection. In Proceedings of European Conference on Computer Vision, pp. 225–241
Li, H., Li, C., Zhu, X., Zheng, A., & Luo, B. (2020). Multi-spectral vehicle re-identification: A challenge. In Proceedings of AAAI Conference on Artificial Intelligence, vol. 34, pp. 11345–11353
Li, D., Wei, X., Hong, X., & Gong, Y. (2020). Infrared-visible cross-modal person re-identification with an x modality. In Proceedings of AAAI Conference on Artificial Intelligence, vol. 34, pp. 4610–4617
Li, X., Yan, H., Qiu, X., & Huang, X.-J. (2020). Flat: Chinese ner using flat-lattice transformer. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp. 6836–6842
Li, N., Ye, J., Ji, Y., Ling, H., & Yu, J.(2014). Saliency detection on light field. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2806–2813
Li, L.H., Zhang, P., Zhang, H., Yang, J., Li, C., Zhong, Y., Wang, L., Yuan, L., Zhang, L., & Hwang, J.-N. (2022). Grounded language-image pre-training. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10965–10975
Liao, W., Ying Yang, M., Zhan, N., & Rosenhahn, B. (2017). Triplet-based deep similarity learning for person re-identification. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 385–393
Li, C., Cong, R., Kwong, S., Hou, J., Fu, H., Zhu, G., Zhang, D., & Huang, Q. (2020). Asif-net: Attention steered interweave fusion network for rgb-d salient object detection. IEEE Transactions on Cybernetics, 51(1), 88–100.
Article Google Scholar
Li, J., Ji, W., Zhang, M., Piao, Y., Lu, H., & Cheng, L. (2022). Delving into calibrated depth for accurate rgb-d salient object detection. International Journal of Computer Vision, 131, 855–876.
Article Google Scholar
Li, G., Liu, Z., Chen, M., Bai, Z., Lin, W., & Ling, H. (2021). Hierarchical alternate interaction network for rgb-d salient object detection. IEEE Transactions on Image Processing, 30, 3528–3542.
Article Google Scholar
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6–12, 2014, Proceedings, Part V 13, pp. 740–755 . Springer
Ling, Y., Zhong, Z., Luo, Z., Rota, P., Li, S., & Sebe, N. (2020). Class-aware modality mix and center-guided metric learning for visible-thermal person re-identification. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 889–897
Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollar, P. (2020). Focal loss for dense object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2), 318–327.
Article Google Scholar
Lin, D., & Huang, H. (2020). Zig–zag network for semantic segmentation of rgb-d images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(10), 2642–2655.
Article Google Scholar
Liu, Z., Wang, Y., Tu, Z., Xiao, Y., & Tang, B. (2021). Tritransnet: Rgb-d salient object detection with a triplet transformer embedding network. In Proceedings of the ACM International Conference on Multimedia
Liu, N., Zhang, N., & Han, J. (2020). Learning selective self-mutual attention for rgb-d saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 13753–13762
Liu, Y., Zhang, J., Fang, L., Jiang, Q., & Zhou, B. (2021). Multimodal motion prediction with stacked transformers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition
Liu, N., Zhang, N., Wan, K., Shao, L., & Han, J. (2021). Visual saliency transformer. In Proceedings of the IEEE International Conference on Computer Vision, pp. 4722–4732
Luo, H., Gu, Y., Liao, X., Lai, S., & Jiang, W. (2019). Bag of tricks and a strong baseline for deep person re-identification. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 0–0
Luo, A., Li, X., Yang, F., Jiao, Z., Cheng, H., & Lyu, S. (2020). Cascade graph neural networks for rgb-d salient object detection. In Proceedings of European Conference on Computer Vision, pp. 346–364
Mao, Y., Zhang, J., Wan, Z., Dai, Y., Li, A., Lv, Y., Tian, X., Fan, D.-P., & Barnes, N. (2021). Transformer transforms salient object detection and camouflaged object detection. arXiv:2104.10127
Mogelmose, A., Bahnsen, C., Moeslund, T., Clapés, A., & Escalera, S. (2013). Tri-modal person re-identification with rgb, depth and thermal features. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 301–307
Munaro, M., Fossati, A., Basso, A., Menegatti, E., & Van Gool, L. (2014). One-shot person re-identification with a consumer depth camera. Person Re-Identification, pp. 161–181
Nagrani, A., Yang, S., Arnab, A., Jansen, A., Schmid, C., & Sun, C. (2021). Attention bottlenecks for multimodal fusion. Advances in Neural Information Processing Systems, 34, 14200–14213.
Google Scholar
Nguyen, D. T., Hong, H. G., Kim, K. W., & Park, K. R. (2017). Person recognition system based on a combination of body images from visible light and thermal cameras. Sensors, 17(3), 605.
Article Google Scholar
Niu, Y., Geng, Y., Li, X., & Liu, F. (2012). Leveraging stereopsis for saliency analysis. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 454–461
Ni, J., Zhang, Z., Shen, K., Tang, G., & Yang, S. X. (2023). An improved deep network-based rgb-d semantic segmentation method for indoor scenes. International Journal of Machine Learning and Cybernetics, 15, 589–604.
Article Google Scholar
Pan, W., Wu, H., Zhu, J., Zeng, H., & Zhu, X. (2022). H-vit: Hybrid vision transformer for multi-modal vehicle re-identification. In Artificial Intelligence: Second CAAI International Conference, CICAI 2022, Beijing, China, August 27–28, 2022, Revised Selected Papers, Part I, pp. 255–267 . Springer
Pang, Y., Zhang, L., Zhao, X., & Lu, H. (2020). Hierarchical dynamic filtering network for rgb-d salient object detection. In Proceedings of European Conference on Computer Vision, pp. 235–252
Peng, H., Li, B., Xiong, W., Hu, W., & Ji, R. (2014). Rgbd salient object detection: A benchmark and algorithms. In Proceedings of European Conference on Computer Vision, pp. 92–109
Perazzi, F., Krähenbühl, P., Pritch, Y., & Hornung, A. (2012). Saliency filters: Contrast based filtering for salient region detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 733–740
Piao, Y., Ji, W., Li, J., Zhang, M., & Lu, H. (2019). Depth-induced multi-scale recurrent attention network for saliency detection. In Proceedings of the IEEE International Conference on Computer Vision, pp. 7254–7263
Piao, Y., Rong, Z., Zhang, M., Ren, W., & Lu, H. (2020). A2dele: Adaptive and attentive depth distiller for efficient rgb-d salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9060–9069
Qu, L., He, S., Zhang, J., Tian, J., Tang, Y., & Yang, Q. (2017). Rgbd salient object detection via deep fusion. IEEE Transactions on Image Processing, 26(5), 2274–2285.
Article MathSciNet Google Scholar
Rahman, M.A., & Wang, Y. (2016). Optimizing intersection-over-union in deep neural networks for image segmentation. In International Symposium on Visual Computing, pp. 234–244
Ren, J., Gong, X., Yu, L., Zhou, W., & Ying Yang, M. (2015). Exploiting global priors for rgb-d saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 25–32
Ren, L., Lu, J., Feng, J., & Zhou, J. (2019). Uniform and variational deep learning for rgb-d object recognition and person re-identification. IEEE Transactions on Image Processing, 28(10), 4970–4983.
Article MathSciNet Google Scholar
Rizzoli, G., Shenaj, D., & Zanuttigh, P. (2023). Source-free domain adaptation for rgb-d semantic segmentation with vision transformers. arXiv:2305.14269
Rizzoli, G., Barbato, F., & Zanuttigh, P. (2022). Multimodal semantic segmentation in autonomous driving: A review of current approaches and future perspectives. Technologies, 10(4), 90.
Article Google Scholar
Shen, F., Xie, Y., Zhu, J., Zhu, X., & Zeng, H. (2023). Git: Graph interactive transformer for vehicle re-identification. IEEE Transactions on Image Processing, 32, 1039–1051.
Article Google Scholar
Silberman, N., Hoiem, D., Kohli, P., & Fergus, R. (2012). Indoor segmentation and support inference from rgbd images. In Computer Vision–ECCV 2012: 12th European Conference on Computer Vision, Florence, Italy, October 7–13, 2012, Proceedings, Part V 12, pp. 746–760 . Springer
Sun, Y., Cheng, C., Zhang, Y., Zhang, C., Zheng, L., Wang, Z., & Wei, Y. (2020). Circle loss: A unified perspective of pair similarity optimization. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6398–6407
Sun, Y., Zheng, L., Yang, Y., Tian, Q., & Wang, S. (2018). Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of European Conference on Computer Vision, pp. 480–496
Tan, H., & Bansal, M. (2019). Lxmert: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pp. 5100–5111
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., & Jégou, H. (2021). Training data-efficient image transformers & distillation through attention. In Proceedings of International Conference on Machine Learning, pp. 10347–10357 . PMLR
Truong, T.-D., Duong, C.N., Pham, H.A., Raj, B., Le, N., & Luu, K. (2021). The right to talk: An audio-visual transformer approach. In Proceedings of the IEEE International Conference on Computer Vision, pp. 1105–1114
Van der Maaten, L., & Hinton G. (2008). Visualizing data using t-sne. Journal of Machine Learning Research, 9(11), 2579–2605.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, U., & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30, 6000–6010.
Wang, Y., Chen, X., Cao, L., Huang, W., Sun, F., & Wang, Y. (2022). Multimodal token fusion for vision transformers. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 12186–12195
Wang, G., Yuan, Y., Chen, X., Li, J., & Zhou, X. (2018). Learning discriminative features with multiple granularities for person re-identification. In Proceedings of the 26th ACM International Conference on Multimedia, pp. 274–282
Wang, G., Zhang, T., Cheng, J., Liu, S., Yang, Y., & Hou, Z. (2019). Rgb-infrared cross-modality person re-identification via joint pixel and feature alignment. In Proceedings of the IEEE International Conference on Computer Vision, pp. 3623–3632
Wei, J., Wang, S., & Huang, Q. (2020). F${^3}$net: Fusion, feedback and focus for salient object detection. In Proceedings of AAAI Conference on Artificial Intelligence, pp. 12321–12328
Wu, S., Song, X., & Feng, Z. (2021). Mect: Multi-metadata embedding based cross-transformer for Chinese named entity recognition. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 1529–1539
Wu, A., Zheng, W.-S., Yu, H.-X., Gong, S., & Lai, J. (2017). Rgb-infrared cross-modality person re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 5380–5389
Wu, X., & Li, T. (2023). Sentimental visual captioning using multimodal transformer. International Journal of Computer Vision, 131, 1073–1090.
Article Google Scholar
Xu, P., Zhu, X., & Clifton, D. A. (2023). Multimodal learning with transformers: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(10), 12113–12132.
Article Google Scholar
Ye, M., Lan, X., Wang, Z., & Yuen, P. C. (2019). Bi-directional center-constrained top-ranking for visible thermal person re-identification. IEEE Transactions on Information Forensics and Security, 15, 407–419.
Article Google Scholar
Yu, W., Luo, M., Zhou, P., Si, C., Zhou, Y., Wang, X., Feng, J., & Yan, S. (2022). Metaformer is actually what you need for vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10819–10829
Yuan, L., Chen, Y., Wang, T., Yu, W., Shi, Y., Jiang, Z.-H., Tay, F.E., Feng, J., & Yan, S. (2021). Tokens-to-token vit: Training vision transformers from scratch on imagenet. In Proceedings of the IEEE International Conference on Computer Vision, pp. 558–567
Yu, J., Li, J., Yu, Z., & Huang, Q. (2020). Multimodal transformer with multi-view visual representation for image captioning. IEEE Transactions on Circuits and Systems for Video Technology, 30(12), 4467–4480.
Zhai, Y., Zeng, Y., Cao, D., & Lu, S. (2022). Trireid: Towards multi-modal person re-identification via descriptive fusion model. In Proceedings of the 2022 International Conference on Multimedia Retrieval, pp. 63–71
Zhang, Q., Lei, Z., Zhang, Z., & Li, S. Z. (2020). Context-aware attention network for image-text retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3536–3545
Zhang, J., Liu, H., Yang, K., Hu, X., Liu, R., & Stiefelhagen, R. (2023). Cmx:Cross-modal fusion for rgb-x semantic segmentation with transformers. IEEE Transactions on Intelligent Transportation Systems, 24(12), 14679–14694.
Zhang, M., Ren, W., Piao, Y., Rong, Z., & Lu, H. (2020). Select, supplement and focus for rgb-d saliency detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3472–3481
Zhang, M., Zhang, Y., Piao, Y., Hu, B., & Lu, H. (2020). Feature reintegration over differential treatment: A top-down and adaptive fusion network for rgb-d salient object detection. In Proceedings of the 28th ACM International Conference on Multimedia, pp. 4107–4115
Zhao, J.-X., Cao, Y., Fan, D.-P., Cheng, M.-M., Li, X.-Y., & Zhang, L. (2019). Contrast prior and fluid pyramid integration for rgbd salient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3922–3931
Zhao, J., Zhao, Y., Li, J., Yan, K., & Tian, Y. (2021). Heterogeneous relational complement for vehicle re-identification. In Proceedings of the IEEE International Conference on Computer Vision, pp. 205–214
Zheng, W.-S., Hong, J., Jiao, J., Wu, A., Zhu, X., Gong, S., Qin, J., & Lai, J. (2022). Joint bilateral-resolution identity modeling for cross-resolution person re-identification. International Journal of Computer Vision, 130, 136–156.
Zheng, A., Wang, Z., Chen, Z., Li, C., & Tang, J. (2021). Robust multi-modality person re-identification. In Proceedings of AAAI Conference on Artificial Intelligence, vol. 35, pp. 3529–3537
Zheng, A., Zhu, X., Li, C., Tang, J., & Ma, J. (2022). Multi-spectral vehicle re-identification with cross-directional consistency network and a high-quality benchmark. arXiv:2208.00632
Zhou, H., Qi, L., Wan, Z., Huang, H., & Yang, X. (2021). Rgb-d co-attention network for semantic segmentation. In Proceedings of the Asian Conference on Computer Vision, pp. 519–536
Zhou, F., Lai, Y.-K., Rosin, P. L., Zhang, F., & Hu, Y. (2022). Scale-aware network with modality-awareness for rgb-d indoor semantic segmentation. Neurocomputing, 492, 464–473.
Article Google Scholar
Zolfaghari, M., Zhu, Y., Gehler, P., & Brox, T. (2021). Crossclr: Cross-modal contrastive learning for multi-modal video representations. InProceedings of the IEEE International Conference on Computer Vision, pp. 1450–1459

Download references

Acknowledgements

This research was supported in part by Anhui Provincial Key Research and Development Program under Grant 2022i01020014, in part by National Natural Science Foundation of China under Grants 62076004 and 62102205 and in part by Natural Science Foundation of Anhui Province under Grant 2108085Y23.

Author information

Authors and Affiliations

School of Computer Science and Technology, Anhui University, Hefei, 230601, China
Xixi Wang, Xiao Wang, Bo Jiang, Jin Tang & Bin Luo

Authors

Xixi Wang
View author publications
You can also search for this author inPubMed Google Scholar
Xiao Wang
View author publications
You can also search for this author inPubMed Google Scholar
Bo Jiang
View author publications
You can also search for this author inPubMed Google Scholar
Jin Tang
View author publications
You can also search for this author inPubMed Google Scholar
Bin Luo
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Bo Jiang.

Additional information

Communicated by Vittorio Murino.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, X., Wang, X., Jiang, B. et al. MutualFormer: Multi-modal Representation Learning via Cross-Diffusion Attention. Int J Comput Vis 132, 3867–3888 (2024). https://doi.org/10.1007/s11263-024-02067-x

Download citation

Received: 03 March 2023
Accepted: 23 March 2024
Published: 24 April 2024
Issue Date: September 2024
DOI: https://doi.org/10.1007/s11263-024-02067-x

Keywords

Part of a collection:

Special Issue on Multimodal Learning

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

MutualFormer: Multi-modal Representation Learning via Cross-Diffusion Attention

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Fine-Grained Correlation Learning with Stacked Co-attention Networks for Cross-Modal Information Retrieval

Cross-Modal Representation Learning

Deep Multi-modal Learning with Cascade Consensus

Explore related subjects

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now