Skip to main content
Log in

Local structure consistency and pixel-correlation distillation for compact semantic segmentation

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Current state-of-the-art semantic segmentation methods usually contain millions of parameters and require high computational resources, which limit their applications in the low resources cases. Knowledge distillation is one promising way to achieve a good trade-off between performance and efficiency. In this paper, we propose a novel local structure consistency distillation (LSCD) to improve the segmentation accuracy of compact networks. Different from previous works mainly transferring the pixel-level and image-level knowledge, we propose to transfer the patch-level knowledge. Specially, we propose the local structure consistency as the patch-level knowledge, which integrate the structural similarity index measure into our framework to provide some local structural constrains between the outputs of teacher and the student. Furthermore, we propose the pixel-correlation distillation to capture the contextual dependencies between any two pixels of the feature maps in a global view. Distilling such pixel correlations from the teacher to the student could help the student mimic the teacher better in terms of contextual dependencies, and thus improve the segmentation accuracy. To validate the effectiveness of the proposed approach, extensive experiments have been conducted on three widely adopted benchmarks: Cityscapes, CamVid, and Pascal VOC 2012. Experimental results show that the proposed approach could consistently improve state-of-the-art methods.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. The FLOPs is calculated with the pytorch version implementation.

References

  1. Shelhamer E, Long J, Darrell T (2016) Fully convolutional networks for semantic segmentation. TPAMI 39(4):640–651

    Article  Google Scholar 

  2. Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL (2018) Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. TPAMI 40(4):834–848

    Article  Google Scholar 

  3. Zhao H, Shi J, Qi X, Wang X, Jia J (2017) Pyramid scene parsing network. In: Cvpr, pp 2881–2890

  4. Yang M, Yu K, Zhang C, Li Z, Yang K (2018) Denseaspp for semantic segmentation in street scenes. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2018, Salt Lake City, UT, USA, June 18-22, 2018, pp 3684–3692

  5. Lin G, Liu F, Milan A, Shen C, Reid I (2020) Refinenet: Multi-path refinement networks for dense prediction. TPAMI 42(5):1228–1242. https://doi.org/10.1109/TPAMI.2019.2893630

    Google Scholar 

  6. Fu J, Liu J, Tian H, Li Y, Bao Y, Fang Z, Lu H (2019) Dual attention network for scene segmentation. In: Cvpr, pp 3146–3154

  7. Huang Z, Wang X, Wei Y, Huang L, Shi H, Liu W, Huang T S (2020) Ccnet: Criss-cross attention for semantic segmentation. https://doi.org/10.1109/TPAMI.2020.3007032

  8. Cheng B, Collins MD, Zhu Y, Liu T, Huang TS, Adam H, Chen L-C (2020) Panoptic-deeplab: A simple, strong, and fast baseline for bottom-up panoptic segmentation. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2020, Seattle, WA, USA, June 13-19, 2020. Computer Vision Foundation / IEEE, pp 12472–12482

  9. Yuan Y, Chen X, Wang J (2020) Object-contextual representations for semantic segmentation. In: Eccv, pp 173–190

  10. Paszke A, Chaurasia A, Kim S, Culurciello E (2016) Enet: A deep neural network architecture for real-time semantic segmentation. arXiv:1606.02147

  11. Zhao H, Qi X, Shen X, Shi J, Jia J (2018) Icnet for real-time semantic segmentation on high-resolution images. In: Eccv, pp 405–420

  12. Badrinarayanan V, Kendall A, Cipolla R (2017) Segnet: A deep convolutional encoder-decoder architecture for image segmentation. TPAMI 12:2481–2495

    Article  Google Scholar 

  13. Sachin M, Mohammad R, Anat C, Linda S, Hannaneh H (2018) Espnet: Efficient spatial pyramid of dilated convolutions for semantic segmentation. In: Eccv, pp 552–568

  14. Yu C, Wang J, Peng C, Gao C, Yu G, Sang N (2018) Bisenet: Bilateral segmentation network for real-time semantic segmentation. In: Eccv, pp 325–341

  15. Li H, Xiong P, Fan H, Sun J (2019) Dfanet: Deep feature aggregation for real-time semantic segmentation. In: CVPR, pp 9522–9531

  16. Wang J, Xiong H, Wang H, Nian X (2020) Adscnet: asymmetric depthwise separable convolution for semantic segmentation in real-time. Appl Intell 50(4):1045–1056

    Article  Google Scholar 

  17. Wu Y, Jiang J, Huang Z, Tian Y (2021) Fpanet: Feature pyramid aggregation network for real-time semantic segmentation

  18. Hu X, Jing L, Sehar U (2021) Joint pyramid attention network for real-time semantic segmentation of urban scenes

  19. Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In: Cvpr, pp 4510–4520

  20. Zhang X, Zhou X, Lin M, Sun J (2018) Shufflenet: An extremely efficient convolutional neural network for mobile devices

  21. Tan M, Le Q (2019) Efficientnet: Rethinking model scaling for convolutional neural networks. In: Icml, PMLR, pp 6105–6114

  22. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition

  23. Feng Y, Sun X, Diao W, Li J, Gao X (2021) Double similarity distillation for semantic image segmentation. TIP 30:5363–5376. https://doi.org/10.1109/TIP.2021.3083113

    Article  Google Scholar 

  24. Hinton G, Vinyals O, Dean J (2014) Distilling the knowledge in a neural network. In: NIPSW

  25. Ba J, Caruana R (2014) Do deep nets really need to be deep?. In: Advances in Neural Information Processing Systems 27: Annual Conference on Neural Information Processing Systems 2014, December 8-13 2014, Montreal, Quebec, Canada, pp 2654–2662

  26. Xu G, Liu Z, Li X, Loy CC (2020) Knowledge distillation meets self-supervision. In: Eccv, pp 588–604

  27. Zhang Z, Zhang H, Arik SO, Lee H, Pfister T (2020) Distilling effective supervision from severe label noise. In: Cvpr, pp 9294–9303

  28. Deng J, Pan Y, Yao T, Zhou W, Li H, Mei T (2019) Relation distillation networks for video object detection. In: ICCV. https://doi.org/10.1109/ICCV.2019.00712. IEEE, pp 7022–7031

  29. Dong N, Zhang Y, Ding M, Xu S, Bai Y (2021) One-stage object detection knowledge distillation via adversarial learning

  30. Huang Y, Shen P, Tai Y, Li S, Liu X, Li J, Huang F, Ji R (2020) Improving face recognition from hard samplesvia distribution distillation loss. In: Eccv, pp 138–154

  31. Niu J-Y, Xie Z-H, Li Y, Cheng S-J, Fan J-W (2021) Scale fusion light cnn for hyperspectral face recognition with knowledge distillation and attention mechanism

  32. Zhou Y, Li R, Sun Y, Dong K, Li S (2022) Knowledge self-distillation for visible-infrared cross-modality person re-identification

  33. Wang W, Wei F, Dong L, Bao H, Yang N, Zhou M (2020) Minilm: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. In: NIPS

  34. Liu Y, Chen K, Liu C, Qin Z, Luo Z, Wang J (2019) Structured knowledge distillation for semantic segmentation. In: Cvpr, pp 2604–2613

  35. He T, Shen C, Tian Z, Gong D, Sun C, Yan Y (2019) Knowledge adaptation for efficient semantic segmentation. In: IEEE conference on computer vision and pattern recognition, CVPR 2019, long beach, ca, usa, june 16-20, 2019. Computer Vision Foundation / IEEE, pp 578–587

  36. Wang Y, Zhou W, Jiang T, Bai X, Xu Y (2020) Intra-class feature variation distillation for semantic segmentation. In: Eccv, Springer, pp 346–362

  37. Li SZ (2009) Markov random field modeling in image analysis, Advances in Pattern Recognition, Springer Science & Business Media. https://doi.org/10.1007/978-1-84800-279-1

  38. Qin X, Zhang Z, Huang C, Gao C, Dehghan M, Jagersand M (2019) Basnet: Boundary-aware salient object detection. In: CVPR, pp 7479–7489

  39. Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B (2016) The cityscapes dataset for semantic urban scene understanding. In: Cvpr, pp 3213–3223

  40. Everingham M, Gool LV, Williams CKI, Winn J, Zisserman A (2010) The pascal visual object classes (voc) challenge. IJCV 88(2):303–338. https://doi.org/10.1007/s11263-009-0275-4

    Article  Google Scholar 

  41. Noh H, Hong S, Han B (2015) Learning deconvolution network for semantic segmentation. In: ICCV, pp 1520–1528

  42. Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In: Proc. Medical Image Computing and Computer-Assisted Intervention, pp 234–241

  43. Huang G, Liu Z, van der Maaten L, Weinberger KQ (2017) Densely connected convolutional networks

  44. Wang J, Sun K, Cheng T, Jiang B, Deng C, Zhao Y, Liu D, Mu Y, Tan M, Wang X, Liu W, Xiao B (2021) Deep high-resolution representation learning for visual recognition. TPAMI 43(10):3349–3364. https://doi.org/10.1109/TPAMI.2020.2983686

    Article  Google Scholar 

  45. Yu F, Koltun V (May 2016) Multi-scale context aggregation by dilated convolutions

  46. Li Q, Jin S, Yan J (2017) Mimicking very efficient network for object detection. In: Cvpr, pp 7341–7349

  47. Zagoruyko S, Komodakis N (2017) Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer. In: Iclr. https://openreview.net/forum?id=Sks9_ajex

  48. Xiao T, Liu Y, Zhou B, Jiang Y, Sun J (2018) Unified perceptual parsing for scene understanding. In: Eccv, pp 418–434

  49. Liu Y, Shu C, Wang J, Shen C (2020) Structured knowledge distillation for dense prediction

  50. Wang Y, Ye H, Cao F (2022) A novel multi-discriminator deep network for image segmentation. Appl Intell 52(1):1092–1109

    Article  Google Scholar 

  51. Adriana R, Nicolas B, Samira EK, Antoine C, Carlo G, Yoshua B (2015) Fitnets: Hints for thin deep nets. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings. 1412.6550

  52. Wang Z, Bovik AC, Sheikh HR, Simoncelli EP (2004) Image quality assessment: from error visibility to structural similarity. TIP 13(4):600–612

    Google Scholar 

  53. Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser L, Polosukhin I (2017) Attention is all you need. In: Nips, pp 5998–6008

  54. Xie J, Shuai B, Hu J, Lin J, Zheng W (2018) Improving fast segmentation with teacher-student learning. In: Bmvc

  55. Hariharan B, Arbeláez P, Bourdev L, Maji S, Malik J (2011) Semantic contours from inverse detectors. In: Iccv. https://doi.org/10.1109/ICCV.2011.6126343, pp 991–998

  56. Loshchilov I, Hutter F (2017) SGDR: stochastic gradient descent with warm restarts. In: 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings

  57. Eduardo R, José M, Luis MB, Roberto A (2017) Erfnet: Efficient residual factorized convnet for real-time semantic segmentation. IEEE Trans Intell Transp Syst 19(1):263–272

    Google Scholar 

  58. Lin G, Milan A, Shen C, Reid I (2017) Refinenet: Multi-path refinement networks for high-resolution semantic segmentation. In: Cvpr, pp 5168–5177

  59. Drozdzal SJM, Vazquez D, Bengio ARY (2017) The one hundred layers tiramisu: Fully convolutional densenets for semantic segmentation. In: Cvprw. https://doi.org/10.1109/CVPRW.2017.156. IEEE Computer Society, pp 1175–1183

  60. Chandra S, Couprie C, Kokkinos I (2018) Deep spatio-temporal random fields for efficient video segmentation. In: CVPR

  61. Huang Z, Wang X, Huang L, Huang C, Wei Y, Liu W (2019) Ccnet: Criss-cross attention for semantic segmentation. In: Iccv, pp 603–612

  62. Zheng S, Jayasumana S, Romera-Paredes B, Vineet V, Su Z, Du D, Huang C, Torr Philip HS (2015) Conditional random fields as recurrent neural networks. In: Iccv, pp 1529–1537

  63. Chen L-C, Papandreou G, Schroff F, Adam H (2017) Rethinking atrous convolution for semantic image segmentation. In: Cvpr

  64. Zhang Z, Zhang X, Peng C, Xue X, Sun J (2018) Exfuse: Enhancing feature fusion for semantic segmentation. In: Eccv, pp 269–284

Download references

Acknowledgements

Authors would like to thanks Dingding Chen, Yupin Yang, Ganghong Huang and Yafei Qi for their helps on the codes and discussion. This research was partially supported by the the National Natural Science Foundation of China (62176029 and 61876026), the National Key Research and Development Program of China (2017YFB1402400 and 2017YFB1402401), the Key Research Program of Chongqing Science and Technology Bureau (cstc2020jscx-msxmX0149, cstc2019jscx-mbdxX0012, and cstc2019jscx-fxyd0142).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jiang Zhong.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, C., Zhong, J., Dai, Q. et al. Local structure consistency and pixel-correlation distillation for compact semantic segmentation. Appl Intell 53, 6307–6323 (2023). https://doi.org/10.1007/s10489-022-03656-4

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-022-03656-4

Keywords

Navigation