Skip to main content

GLViG: Global and Local Vision GNN May Be What You Need for Vision

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14433))

Included in the following conference series:

  • 494 Accesses

Abstract

In this article, we propose a novel vision architecture termed GLViG, which leverages graph neural networks (GNNs) to capture local and important global information in images. To achieve this, GLViG represents image patches as graph nodes and constructs two types of graphs to encode the information, which are subsequently processed by GNNs to enable efficient information exchange between image patches, resulting in superior performance. In order to address the quadratic computational complexity challenges posed by high-resolution images, GLViG adaptively samples the image patches and optimizes computational complexity to linear. Finally, to enhance the adaptation of GNNs to the 2D image structure, we use Depth-wise Convolution dynamically generated positional encoding as a solution to the fixed-size and static limitations of absolute position encoding in ViG. The extensive experiments on image classification, object detection, and image segmentation demonstrate the superiority of the proposed GLViG architecture. Specifically, the GLViG-B1 architecture achieves a significant improvement on ImageNet-1K when compared to the state-of-the-art GNN-based backbone ViG-Tiny (80.7% vs. 78.2%). Additionally, our proposed GLViG model surpasses popular computer vision models such as Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and Vision MLPs. We believe that our method has great potential to advance the capabilities of computer vision and bring a new perspective to the design of new vision architectures.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 79.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Chen, K., et al.: Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)

  2. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017)

    Article  Google Scholar 

  3. Chen, Z.M., Wei, X.S., Wang, P., Guo, Y.: Multi-label image recognition with graph convolutional networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5177–5186 (2019)

    Google Scholar 

  4. Chu, X., et al.: Conditional positional encodings for vision transformers. arXiv: Computer Vision and Pattern Recognition (2021)

  5. Contributors, M.: Mmsegmentation: Openmmlab semantic segmentation toolbox and benchmark (2020)

    Google Scholar 

  6. Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: Randaugment: Practical automated data augmentation with a reduced search space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 702–703 (2020)

    Google Scholar 

  7. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  8. Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256. JMLR Workshop and Conference Proceedings (2010)

    Google Scholar 

  9. Han, K., Wang, Y., Guo, J., Tang, Y., Wu, E.: Vision gnn: an image is worth graph of nodes. arXiv preprint arXiv:2206.00272 (2022)

  10. Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. Adv. Neural. Inf. Process. Syst. 34, 15908–15919 (2021)

    Google Scholar 

  11. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)

    Google Scholar 

  12. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  13. Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)

  14. Hoffer, E., Ben-Nun, T., Hubara, I., Giladi, N., Hoefler, T., Soudry, D.: Augment your batch: Improving generalization through instance repetition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8129–8138 (2020)

    Google Scholar 

  15. Howard, A.G., et al.: Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)

  16. Huang, G., Sun, Yu., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep networks with stochastic depth. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 646–661. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_39

    Chapter  Google Scholar 

  17. Islam, A., Jia, S., Bruce, N.D.B.: How much position information do convolutional neural networks encode. arXiv: Computer Vision and Pattern Recognition (2020)

  18. Kirillov, A., Girshick, R., He, K., Dollár, P.: Panoptic feature pyramid networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6399–6408 (2019)

    Google Scholar 

  19. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)

    Article  Google Scholar 

  20. LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)

    Article  Google Scholar 

  21. Li, G., Muller, M., Thabet, A., Ghanem, B.: Deepgcns: can gcns go as deep as cnns? In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9267–9276 (2019)

    Google Scholar 

  22. Li, Y., Zhang, K., Cao, J., Timofte, R., Gool, L.V.: Localvit: bringing locality to vision transformers. Comput. Vis. Pattern Recog. (2021)

    Google Scholar 

  23. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)

    Google Scholar 

  24. Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48

    Chapter  Google Scholar 

  25. Liu, H., Dai, Z., So, D., Le, Q.V.: Pay attention to mlps. Adv. Neural. Inf. Process. Syst. 34, 9204–9215 (2021)

    Google Scholar 

  26. Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)

    Google Scholar 

  27. Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022)

    Google Scholar 

  28. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  29. Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32 (2019)

    Google Scholar 

  30. Polyak, B.T., Juditsky, A.B.: Acceleration of stochastic approximation by averaging. SIAM J. Control. Optim. 30(4), 838–855 (1992)

    Article  MathSciNet  Google Scholar 

  31. Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115, 211–252 (2015)

    Article  MathSciNet  Google Scholar 

  32. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

  33. Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)

    Google Scholar 

  34. Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)

    Google Scholar 

  35. Tan, M., Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114. PMLR (2019)

    Google Scholar 

  36. Tolstikhin, I.O., et al.: Mlp-mixer: an all-mlp architecture for vision. Adv. Neural. Inf. Process. Syst. 34, 24261–24272 (2021)

    Google Scholar 

  37. Touvron, H., et al.: Resmlp: feedforward networks for image classification with data-efficient training. IEEE Trans. Pattern Anal. Mach. Intell. (2022)

    Google Scholar 

  38. Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)

    Google Scholar 

  39. Trockman, A., Kolter, J.Z.: Patches are all you need? arXiv preprint arXiv:2201.09792 (2022)

  40. Vaswani, A., et al.: Attention is all you need. In: Advances in neural information processing systems 30 (2017)

    Google Scholar 

  41. Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 568–578 (2021)

    Google Scholar 

  42. Wang, W., et al.: Pvt v2: improved baselines with pyramid vision transformer. Comput. Vis. Media 8(3), 415–424 (2022)

    Article  Google Scholar 

  43. Wightman, R., Touvron, H., Jégou, H.: Resnet strikes back: an improved training procedure in timm. arXiv preprint arXiv:2110.00476 (2021)

  44. Wightman, R., et al.: Pytorch image models (2019)

    Google Scholar 

  45. Woo, S., et al.: Convnext v2: co-designing and scaling convnets with masked autoencoders. arXiv preprint arXiv:2301.00808 (2023)

  46. Xu, H., Jiang, C., Liang, X., Li, Z.: Spatial-aware graph relation network for large-scale object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9298–9307 (2019)

    Google Scholar 

  47. Yu, W., et al.: Metaformer is actually what you need for vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10819–10829 (2022)

    Google Scholar 

  48. Yuan, L., et al.: Tokens-to-token vit: training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 558–567 (2021)

    Google Scholar 

  49. Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6023–6032 (2019)

    Google Scholar 

  50. Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)

  51. Zhang, L., Li, X., Arnab, A., Yang, K., Tong, Y., Torr, P.H.: Dual graph convolutional network for semantic segmentation. arXiv preprint arXiv:1909.06121 (2019)

  52. Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13001–13008 (2020)

    Google Scholar 

  53. Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 633–641 (2017)

    Google Scholar 

Download references

Acknowledgments

This work was supported by National Key R &D Program of China (No.2022ZD0118202) and the National Natural Science Foundation of China (No. 62072386, No. 62376101).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Taisong Jin .

Editor information

Editors and Affiliations

A Appendix

A Appendix

1.1 A.1 Semantic Segmentation

Settings: We chose the ADE20K dataset [53] to evaluate the semantic segmentation performance of GLViG. The ADE20K dataset consists of 20,000 training images, 2,000 validation images, and 3,000 test images, covering 150 semantic categories. For our framework, we utilized MMSEG [5] as the implementation framework and Semantic FPN [18] as the segmentation head. All experiments were conducted using 4 NVIDIA 3090 GPUs. During the training phase, the backbone was initialized with weights pre-trained on ImageNet-1K, while the newly added layers were initialized with Xavier [8]. We optimized our model using AdamW [28] with an initial learning rate of 1e-4. Following common practices [2, 18], we trained our models for 40,000 iterations with a batch size of 32. The learning rate was decayed using a polynomial decay schedule with a power of 0.9. In the training phase, we randomly resized and cropped the images to a size of 512 \(\times \) 512. During the testing phase, we rescaled the images to ensure the shorter side was 512 pixels.

Table 4. Results of semantic segmentation on ADE20K [53] validation set. We calculate FLOPs with input size 512 \(\times \) 512 for Semantic FPN.

Results: As shown in Table 4, GLViG outperforms the representative backbones in Semantic Segmentation, including ResNet [12] (CNN), Attention-based method PVT [41] (Vision Transformer). Considering no relevant data is available, we didn’t compare ViG [9] to the proposed method. For example, GLViG-B1 outperforms ResNet-18 by 7.4% mIoU (40.3% vs. 32.9%) and PVT-Tiny by 4.6% mIoU (40.3% vs.35.7%) under comparable parameters and FLOPs. In general, GLViG performs competently in semantic segmentation and consistently outperforms various backbone models (Tables 5 and 6).

1.2 A.2 Details of the GLViG Architecture

Table 5. Detailed settings of GLViG series.
Table 6. Training hyper-parameters of GLViG for ImageNet-1K [31].

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Li, T., Lin, W., Zheng, X., Jin, T. (2024). GLViG: Global and Local Vision GNN May Be What You Need for Vision. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14433. Springer, Singapore. https://doi.org/10.1007/978-981-99-8546-3_30

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8546-3_30

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8545-6

  • Online ISBN: 978-981-99-8546-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics