GLViG: Global and Local Vision GNN May Be What You Need for Vision

Li, Tanzhe; Lin, Wei; Zheng, Xiawu; Jin, Taisong

doi:10.1007/978-981-99-8546-3_30

Tanzhe Li¹⁵,
Wei Lin¹⁵,
Xiawu Zheng¹⁶ &
…
Taisong Jin¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14433))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

494 Accesses

Abstract

In this article, we propose a novel vision architecture termed GLViG, which leverages graph neural networks (GNNs) to capture local and important global information in images. To achieve this, GLViG represents image patches as graph nodes and constructs two types of graphs to encode the information, which are subsequently processed by GNNs to enable efficient information exchange between image patches, resulting in superior performance. In order to address the quadratic computational complexity challenges posed by high-resolution images, GLViG adaptively samples the image patches and optimizes computational complexity to linear. Finally, to enhance the adaptation of GNNs to the 2D image structure, we use Depth-wise Convolution dynamically generated positional encoding as a solution to the fixed-size and static limitations of absolute position encoding in ViG. The extensive experiments on image classification, object detection, and image segmentation demonstrate the superiority of the proposed GLViG architecture. Specifically, the GLViG-B1 architecture achieves a significant improvement on ImageNet-1K when compared to the state-of-the-art GNN-based backbone ViG-Tiny (80.7% vs. 78.2%). Additionally, our proposed GLViG model surpasses popular computer vision models such as Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and Vision MLPs. We believe that our method has great potential to advance the capabilities of computer vision and bring a new perspective to the design of new vision architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Chen, K., et al.: Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017)
Article Google Scholar
Chen, Z.M., Wei, X.S., Wang, P., Guo, Y.: Multi-label image recognition with graph convolutional networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5177–5186 (2019)
Google Scholar
Chu, X., et al.: Conditional positional encodings for vision transformers. arXiv: Computer Vision and Pattern Recognition (2021)
Contributors, M.: Mmsegmentation: Openmmlab semantic segmentation toolbox and benchmark (2020)
Google Scholar
Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: Randaugment: Practical automated data augmentation with a reduced search space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 702–703 (2020)
Google Scholar
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256. JMLR Workshop and Conference Proceedings (2010)
Google Scholar
Han, K., Wang, Y., Guo, J., Tang, Y., Wu, E.: Vision gnn: an image is worth graph of nodes. arXiv preprint arXiv:2206.00272 (2022)
Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. Adv. Neural. Inf. Process. Syst. 34, 15908–15919 (2021)
Google Scholar
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Hoffer, E., Ben-Nun, T., Hubara, I., Giladi, N., Hoefler, T., Soudry, D.: Augment your batch: Improving generalization through instance repetition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8129–8138 (2020)
Google Scholar
Howard, A.G., et al.: Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
Huang, G., Sun, Yu., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep networks with stochastic depth. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 646–661. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_39
Chapter Google Scholar
Islam, A., Jia, S., Bruce, N.D.B.: How much position information do convolutional neural networks encode. arXiv: Computer Vision and Pattern Recognition (2020)
Kirillov, A., Girshick, R., He, K., Dollár, P.: Panoptic feature pyramid networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6399–6408 (2019)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)
Article Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Article Google Scholar
Li, G., Muller, M., Thabet, A., Ghanem, B.: Deepgcns: can gcns go as deep as cnns? In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9267–9276 (2019)
Google Scholar
Li, Y., Zhang, K., Cao, J., Timofte, R., Gool, L.V.: Localvit: bringing locality to vision transformers. Comput. Vis. Pattern Recog. (2021)
Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Google Scholar
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Chapter Google Scholar
Liu, H., Dai, Z., So, D., Le, Q.V.: Pay attention to mlps. Adv. Neural. Inf. Process. Syst. 34, 9204–9215 (2021)
Google Scholar
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Google Scholar
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32 (2019)
Google Scholar
Polyak, B.T., Juditsky, A.B.: Acceleration of stochastic approximation by averaging. SIAM J. Control. Optim. 30(4), 838–855 (1992)
Article MathSciNet Google Scholar
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115, 211–252 (2015)
Article MathSciNet Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Google Scholar
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Google Scholar
Tan, M., Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114. PMLR (2019)
Google Scholar
Tolstikhin, I.O., et al.: Mlp-mixer: an all-mlp architecture for vision. Adv. Neural. Inf. Process. Syst. 34, 24261–24272 (2021)
Google Scholar
Touvron, H., et al.: Resmlp: feedforward networks for image classification with data-efficient training. IEEE Trans. Pattern Anal. Mach. Intell. (2022)
Google Scholar
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
Google Scholar
Trockman, A., Kolter, J.Z.: Patches are all you need? arXiv preprint arXiv:2201.09792 (2022)
Vaswani, A., et al.: Attention is all you need. In: Advances in neural information processing systems 30 (2017)
Google Scholar
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 568–578 (2021)
Google Scholar
Wang, W., et al.: Pvt v2: improved baselines with pyramid vision transformer. Comput. Vis. Media 8(3), 415–424 (2022)
Article Google Scholar
Wightman, R., Touvron, H., Jégou, H.: Resnet strikes back: an improved training procedure in timm. arXiv preprint arXiv:2110.00476 (2021)
Wightman, R., et al.: Pytorch image models (2019)
Google Scholar
Woo, S., et al.: Convnext v2: co-designing and scaling convnets with masked autoencoders. arXiv preprint arXiv:2301.00808 (2023)
Xu, H., Jiang, C., Liang, X., Li, Z.: Spatial-aware graph relation network for large-scale object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9298–9307 (2019)
Google Scholar
Yu, W., et al.: Metaformer is actually what you need for vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10819–10829 (2022)
Google Scholar
Yuan, L., et al.: Tokens-to-token vit: training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 558–567 (2021)
Google Scholar
Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6023–6032 (2019)
Google Scholar
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)
Zhang, L., Li, X., Arnab, A., Yang, K., Tong, Y., Torr, P.H.: Dual graph convolutional network for semantic segmentation. arXiv preprint arXiv:1909.06121 (2019)
Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13001–13008 (2020)
Google Scholar
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 633–641 (2017)
Google Scholar

Download references

Acknowledgments

This work was supported by National Key R &D Program of China (No.2022ZD0118202) and the National Natural Science Foundation of China (No. 62072386, No. 62376101).

Author information

Authors and Affiliations

Key Laboratory of Multimedia Trusted Perception and Efffcient Computing, Ministry of Education of China, School of Informatics, Xiamen University, Xiamen, China
Tanzhe Li, Wei Lin & Taisong Jin
Peng Cheng Laboratory, Shenzhen, China
Xiawu Zheng

Authors

Tanzhe Li
View author publications
You can also search for this author in PubMed Google Scholar
Wei Lin
View author publications
You can also search for this author in PubMed Google Scholar
Xiawu Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Taisong Jin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Taisong Jin .

Editor information

Editors and Affiliations

Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Xiamen University, Xiamen, China
Hanzi Wang
Beijing University of Posts and Telecommunications, Beijing, China
Zhanyu Ma
Sun Yat-sen University, Guangzhou, China
Weishi Zheng
Peking University, Beijing, China
Hongbin Zha
Chinese Academy of Sciences, Beijing, China
Xilin Chen
Chinese Academy of Sciences, Beijing, China
Liang Wang
Xiamen University, Xiamen, China
Rongrong Ji

A Appendix

1.1 A.1 Semantic Segmentation

Settings: We chose the ADE20K dataset [53] to evaluate the semantic segmentation performance of GLViG. The ADE20K dataset consists of 20,000 training images, 2,000 validation images, and 3,000 test images, covering 150 semantic categories. For our framework, we utilized MMSEG [5] as the implementation framework and Semantic FPN [18] as the segmentation head. All experiments were conducted using 4 NVIDIA 3090 GPUs. During the training phase, the backbone was initialized with weights pre-trained on ImageNet-1K, while the newly added layers were initialized with Xavier [8]. We optimized our model using AdamW [28] with an initial learning rate of 1e-4. Following common practices [2, 18], we trained our models for 40,000 iterations with a batch size of 32. The learning rate was decayed using a polynomial decay schedule with a power of 0.9. In the training phase, we randomly resized and cropped the images to a size of 512 \(\times \) 512. During the testing phase, we rescaled the images to ensure the shorter side was 512 pixels.

Table 4. Results of semantic segmentation on ADE20K [53] validation set. We calculate FLOPs with input size 512 \(\times \) 512 for Semantic FPN.

Full size table

Results: As shown in Table 4, GLViG outperforms the representative backbones in Semantic Segmentation, including ResNet [12] (CNN), Attention-based method PVT [41] (Vision Transformer). Considering no relevant data is available, we didn’t compare ViG [9] to the proposed method. For example, GLViG-B1 outperforms ResNet-18 by 7.4% mIoU (40.3% vs. 32.9%) and PVT-Tiny by 4.6% mIoU (40.3% vs.35.7%) under comparable parameters and FLOPs. In general, GLViG performs competently in semantic segmentation and consistently outperforms various backbone models (Tables 5 and 6).

1.2 A.2 Details of the GLViG Architecture

Table 5. Detailed settings of GLViG series.

Full size table

Table 6. Training hyper-parameters of GLViG for ImageNet-1K [31].

Full size table

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, T., Lin, W., Zheng, X., Jin, T. (2024). GLViG: Global and Local Vision GNN May Be What You Need for Vision. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14433. Springer, Singapore. https://doi.org/10.1007/978-981-99-8546-3_30

Download citation

DOI: https://doi.org/10.1007/978-981-99-8546-3_30
Published: 26 December 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8545-6
Online ISBN: 978-981-99-8546-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

GLViG: Global and Local Vision GNN May Be What You Need for Vision

Abstract

Access this chapter

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Appendix

A Appendix

1.1 A.1 Semantic Segmentation

1.2 A.2 Details of the GLViG Architecture

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation