Abstract
In this article, we propose a novel vision architecture termed GLViG, which leverages graph neural networks (GNNs) to capture local and important global information in images. To achieve this, GLViG represents image patches as graph nodes and constructs two types of graphs to encode the information, which are subsequently processed by GNNs to enable efficient information exchange between image patches, resulting in superior performance. In order to address the quadratic computational complexity challenges posed by high-resolution images, GLViG adaptively samples the image patches and optimizes computational complexity to linear. Finally, to enhance the adaptation of GNNs to the 2D image structure, we use Depth-wise Convolution dynamically generated positional encoding as a solution to the fixed-size and static limitations of absolute position encoding in ViG. The extensive experiments on image classification, object detection, and image segmentation demonstrate the superiority of the proposed GLViG architecture. Specifically, the GLViG-B1 architecture achieves a significant improvement on ImageNet-1K when compared to the state-of-the-art GNN-based backbone ViG-Tiny (80.7% vs. 78.2%). Additionally, our proposed GLViG model surpasses popular computer vision models such as Convolutional Neural Networks (CNNs), Vision Transformers (ViTs), and Vision MLPs. We believe that our method has great potential to advance the capabilities of computer vision and bring a new perspective to the design of new vision architectures.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Chen, K., et al.: Mmdetection: Open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155 (2019)
Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017)
Chen, Z.M., Wei, X.S., Wang, P., Guo, Y.: Multi-label image recognition with graph convolutional networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5177–5186 (2019)
Chu, X., et al.: Conditional positional encodings for vision transformers. arXiv: Computer Vision and Pattern Recognition (2021)
Contributors, M.: Mmsegmentation: Openmmlab semantic segmentation toolbox and benchmark (2020)
Cubuk, E.D., Zoph, B., Shlens, J., Le, Q.V.: Randaugment: Practical automated data augmentation with a reduced search space. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 702–703 (2020)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Glorot, X., Bengio, Y.: Understanding the difficulty of training deep feedforward neural networks. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 249–256. JMLR Workshop and Conference Proceedings (2010)
Han, K., Wang, Y., Guo, J., Tang, Y., Wu, E.: Vision gnn: an image is worth graph of nodes. arXiv preprint arXiv:2206.00272 (2022)
Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer. Adv. Neural. Inf. Process. Syst. 34, 15908–15919 (2021)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hinton, G., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015)
Hoffer, E., Ben-Nun, T., Hubara, I., Giladi, N., Hoefler, T., Soudry, D.: Augment your batch: Improving generalization through instance repetition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8129–8138 (2020)
Howard, A.G., et al.: Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861 (2017)
Huang, G., Sun, Yu., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep networks with stochastic depth. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9908, pp. 646–661. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_39
Islam, A., Jia, S., Bruce, N.D.B.: How much position information do convolutional neural networks encode. arXiv: Computer Vision and Pattern Recognition (2020)
Kirillov, A., Girshick, R., He, K., Dollár, P.: Panoptic feature pyramid networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6399–6408 (2019)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Commun. ACM 60(6), 84–90 (2017)
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proc. IEEE 86(11), 2278–2324 (1998)
Li, G., Muller, M., Thabet, A., Ghanem, B.: Deepgcns: can gcns go as deep as cnns? In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9267–9276 (2019)
Li, Y., Zhang, K., Cao, J., Timofte, R., Gool, L.V.: Localvit: bringing locality to vision transformers. Comput. Vis. Pattern Recog. (2021)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Lin, T.-Y., et al.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10602-1_48
Liu, H., Dai, Z., So, D., Le, Q.V.: Pay attention to mlps. Adv. Neural. Inf. Process. Syst. 34, 9204–9215 (2021)
Liu, Z., et al.: Swin transformer: hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10012–10022 (2021)
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11976–11986 (2022)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. In: Advances in Neural Information Processing Systems 32 (2019)
Polyak, B.T., Juditsky, A.B.: Acceleration of stochastic approximation by averaging. SIAM J. Control. Optim. 30(4), 838–855 (1992)
Russakovsky, O., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115, 211–252 (2015)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Szegedy, C., et al.: Going deeper with convolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Tan, M., Le, Q.: Efficientnet: rethinking model scaling for convolutional neural networks. In: International Conference on Machine Learning, pp. 6105–6114. PMLR (2019)
Tolstikhin, I.O., et al.: Mlp-mixer: an all-mlp architecture for vision. Adv. Neural. Inf. Process. Syst. 34, 24261–24272 (2021)
Touvron, H., et al.: Resmlp: feedforward networks for image classification with data-efficient training. IEEE Trans. Pattern Anal. Mach. Intell. (2022)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
Trockman, A., Kolter, J.Z.: Patches are all you need? arXiv preprint arXiv:2201.09792 (2022)
Vaswani, A., et al.: Attention is all you need. In: Advances in neural information processing systems 30 (2017)
Wang, W., et al.: Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 568–578 (2021)
Wang, W., et al.: Pvt v2: improved baselines with pyramid vision transformer. Comput. Vis. Media 8(3), 415–424 (2022)
Wightman, R., Touvron, H., Jégou, H.: Resnet strikes back: an improved training procedure in timm. arXiv preprint arXiv:2110.00476 (2021)
Wightman, R., et al.: Pytorch image models (2019)
Woo, S., et al.: Convnext v2: co-designing and scaling convnets with masked autoencoders. arXiv preprint arXiv:2301.00808 (2023)
Xu, H., Jiang, C., Liang, X., Li, Z.: Spatial-aware graph relation network for large-scale object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9298–9307 (2019)
Yu, W., et al.: Metaformer is actually what you need for vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10819–10829 (2022)
Yuan, L., et al.: Tokens-to-token vit: training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 558–567 (2021)
Yun, S., Han, D., Oh, S.J., Chun, S., Choe, J., Yoo, Y.: Cutmix: regularization strategy to train strong classifiers with localizable features. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6023–6032 (2019)
Zhang, H., Cisse, M., Dauphin, Y.N., Lopez-Paz, D.: Mixup: beyond empirical risk minimization. arXiv preprint arXiv:1710.09412 (2017)
Zhang, L., Li, X., Arnab, A., Yang, K., Tong, Y., Torr, P.H.: Dual graph convolutional network for semantic segmentation. arXiv preprint arXiv:1909.06121 (2019)
Zhong, Z., Zheng, L., Kang, G., Li, S., Yang, Y.: Random erasing data augmentation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 13001–13008 (2020)
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 633–641 (2017)
Acknowledgments
This work was supported by National Key R &D Program of China (No.2022ZD0118202) and the National Natural Science Foundation of China (No. 62072386, No. 62376101).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Appendix
A Appendix
1.1 A.1 Semantic Segmentation
Settings: We chose the ADE20K dataset [53] to evaluate the semantic segmentation performance of GLViG. The ADE20K dataset consists of 20,000 training images, 2,000 validation images, and 3,000 test images, covering 150 semantic categories. For our framework, we utilized MMSEG [5] as the implementation framework and Semantic FPN [18] as the segmentation head. All experiments were conducted using 4 NVIDIA 3090 GPUs. During the training phase, the backbone was initialized with weights pre-trained on ImageNet-1K, while the newly added layers were initialized with Xavier [8]. We optimized our model using AdamW [28] with an initial learning rate of 1e-4. Following common practices [2, 18], we trained our models for 40,000 iterations with a batch size of 32. The learning rate was decayed using a polynomial decay schedule with a power of 0.9. In the training phase, we randomly resized and cropped the images to a size of 512 \(\times \) 512. During the testing phase, we rescaled the images to ensure the shorter side was 512 pixels.
Results: As shown in Table 4, GLViG outperforms the representative backbones in Semantic Segmentation, including ResNet [12] (CNN), Attention-based method PVT [41] (Vision Transformer). Considering no relevant data is available, we didn’t compare ViG [9] to the proposed method. For example, GLViG-B1 outperforms ResNet-18 by 7.4% mIoU (40.3% vs. 32.9%) and PVT-Tiny by 4.6% mIoU (40.3% vs.35.7%) under comparable parameters and FLOPs. In general, GLViG performs competently in semantic segmentation and consistently outperforms various backbone models (Tables 5 and 6).
1.2 A.2 Details of the GLViG Architecture
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Li, T., Lin, W., Zheng, X., Jin, T. (2024). GLViG: Global and Local Vision GNN May Be What You Need for Vision. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14433. Springer, Singapore. https://doi.org/10.1007/978-981-99-8546-3_30
Download citation
DOI: https://doi.org/10.1007/978-981-99-8546-3_30
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8545-6
Online ISBN: 978-981-99-8546-3
eBook Packages: Computer ScienceComputer Science (R0)