Skip to main content
Log in

CrackViT: a unified CNN-transformer model for pixel-level crack extraction

  • Original Article
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Pixel-level crack extraction (PCE) is challenging due to topology complexity, irregular edges, low contrast ratio, and complex background. Recently, Transformer architectures have shown great potential on many vision tasks and even outperform convolutional neural networks (CNNs). Benefiting from the self-attention mechanism, Transformers can invariably capture the global context information to establish long-range dependencies on the detected objects. However, there was little work on the Transformer architectures for PCE. In this paper, a systematic analysis of three well-designed Transformer architectures for PCE task in terms of network structures and parameters, feature fusion modes, training data and strategy, and generalization ability was developed for the first time. We proposed a Crack extraction network with Vision Transformer (CrackViT) that jointly captures the detailed structures and long-distance dependencies with a novel hybrid encoder with CNN and Transformer to keep the corresponding topologies. In order to be more suitable for PCE task, we explored three feature fusion modes between CNN and Transformer. In addition, a novel feature aggregation block was proposed to sharpen the edges of the decoder upsampling and reduce the noise effect of shallow features. Moreover, a multi-task supervised training strategy was adopted to further improve the details of crack edges. Results on four challenging datasets, including CrackForest, DeepCrack, CRKWH100, and CRACK500, show that CrackViT outperforms state-of-the-art CNN-based methods and the other two novel Transformer architectures. Our codes are available at: https://github.com/SmilQe/CrackViT.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Data availability

All data generated or analyzed during this study have been properly cited in this published article (see Sect. 4.1 and References). If found difficulty in finding the data links, same can be available from the corresponding author on reasonable request.

References

  1. Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. arXiv preprint arXiv:1607.06450

  2. Bray J, Verma B, Li X, et al (2006) A neural network based technique for automatic classification of road cracks. In: The 2006 IEEE International Joint Conference on Neural Network Proceedings, IEEE, pp 907–912

  3. Brown TB, Mann B, Ryder N, et al (2020) Language models are few-shot learners. arXiv preprint arXiv:2005.14165

  4. Cao H, Wang Y, Chen J, et al (2021) Swin-unet: unet-like pure transformer for medical image segmentation. arXiv preprint arXiv:2105.05537

  5. Carion N, Massa F, Synnaeve G, et al (2020) End-to-end object detection with transformers. In: European Conference on Computer Vision, Springer, pp 213–229

  6. Cha YJ, Choi W, Büyüköztürk O (2017) Deep learning-based crack damage detection using convolutional neural networks. Comput-Aided Civil Infrastruct Eng 32(5):361–378

    Article  Google Scholar 

  7. Chen B, Zhang H, Li Y et al (2022) Quantify pixel-level detection of dam surface crack using deep learning. Measurement Sci Technol 33(6):065402

    Article  Google Scholar 

  8. Chen W, Du X, Yang F, et al (2021) A simple single-scale vision transformer for object localization and instance segmentation. arXiv preprint arXiv:2112.09747

  9. Cheng H, Shi X, Glazier C (2003) Real-time image thresholding based on sample space reduction and interpolation approach. J Comput Civil Eng 17(4):264–272

    Article  Google Scholar 

  10. Cheng M, Zhao K, Guo X, et al (2021) Joint topology-preserving and feature-refinement network for curvilinear structure segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp 7147–7156

  11. Deng J, Dong W, Socher R, et al (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, IEEE, pp 248–255

  12. Dosovitskiy A, Beyer L, Kolesnikov A, et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929

  13. Eisenbach M, Stricker R, Seichter D, et al (2017) How to get pavement distress detection ready for deep learning? a systematic approach. In: International Joint Conference on Neural Networks (IJCNN), pp 2039–2047

  14. Fang J, Qu B, Yuan Y (2021) Distribution equalization learning mechanism for road crack detection. Neurocomputing 424:193–204

    Article  Google Scholar 

  15. Forsyth D, Ponce J (2011) Computer vision: a modern approach. Prentice hall

    Google Scholar 

  16. Gavilán M, Balcones D, Marcos O et al (2011) Adaptive road crack detection system by pavement classification. Sensors 11(10):9628–9657

    Article  Google Scholar 

  17. Guo JM, Markoni H, Lee JD (2021) BARNet: boundary aware refinement network for crack detection. IEEE Trans Intell Transp Syst

  18. Han C, Ma T, Huyan J, et al (2021a) CrackW-Net: a novel pavement crack image segmentation convolutional neural network. IEEE Trans Intell Transp Syst pp 1–10

  19. Han K, Wang Y, Chen H, et al (2020) A survey on visual transformer. arXiv preprint arXiv:2012.12556

  20. Han K, Xiao A, Wu E et al (2021) Transformer in transformer. Adv Neural Inf Process Syst 34:15908–15919

    Google Scholar 

  21. He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  22. Hendrycks D, Gimpel K (2016) Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415

  23. Hoang ND (2018) Detection of surface crack in building structures using image processing technique with an improved otsu method for image thresholding. Adv Civil Eng

  24. Hong Z, Yang F, Pan H et al (2022) Highway crack segmentation from unmanned aerial vehicle images using deep learning. IEEE Geosci Remote Sens Lett 19:1–5

    Article  Google Scholar 

  25. Huang H, Lin L, Tong R, et al (2020) UNet 3+: a full-scale connected unet for medical image segmentation. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp 1055–1059

  26. Hutchinson TC, Chen Z (2006) Improved image analysis for evaluating concrete damage. J Comput Civil Eng 20(3):210–216

    Article  Google Scholar 

  27. Huyan J, Ma T, Li W, et al (2022) Pixelwise asphalt concrete pavement crack detection via deep learning-based semantic segmentation method. Struct Control Health Monit p e2974

  28. Kim H, Ahn E, Cho S et al (2017) Comparative analysis of image binarization methods for crack identification in concrete structures. Cement Concr Res 99:53–61

    Article  Google Scholar 

  29. Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980

  30. König J, Jenkins MD, Mannion M et al (2021) Optimized deep encoder-decoder methods for crack segmentation. Digit Signal Process 108(102):907

    Google Scholar 

  31. Lee BY, Kim YY, Yi ST et al (2013) Automated image processing technique for detecting and analysing concrete surface cracks. Struct Infrastruct Eng 9(6):567–577

    Article  Google Scholar 

  32. Lee D, Kim J, Lee D (2019) Robust concrete crack detection using deep learning-based semantic segmentation. Int J Aeronaut Sp Sci 20(1):287–299

    Article  Google Scholar 

  33. Li G, Xie Y, Lin L, et al (2017) Instance-level salient object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2386–2395

  34. Li Q, Zou Q, Zhang D et al (2011) FoSA: f* seed-growing approach for crack-line detection from pavement images. Image Vision Comput 29(12):861–872

    Article  Google Scholar 

  35. Li Z, Sun Y, Zhang L, et al (2021) CTNet: context-based tandem network for semantic segmentation. IEEE Trans Pattern Anal Mach Intell

  36. Liu H, Miao X, Mertz C, et al (2021a) CrackFormer: transformer network for fine-grained crack detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp 3783–3792

  37. Liu JJ, Hou Q, Cheng MM, et al (2019a) A simple pooling-based design for real-time salient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3917–3926

  38. Liu N, Zhang N, Wan K, et al (2021b) Visual saliency transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4722–4732

  39. Liu Y, Yao J, Lu X et al (2019) DeepCrack: a deep hierarchical feature learning architecture for crack segmentation. Neurocomputing 338:139–153

    Article  Google Scholar 

  40. Liu Z, Lin Y, Cao Y, et al (2021c) Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030

  41. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440

  42. Luo Q, Ge B, Tian Q (2019) A fast adaptive crack detection algorithm based on a double-edge extraction operator of fsm. Constr Build Mater 204:244–254

    Article  Google Scholar 

  43. Mandal V, Uong L, Adu-Gyamfi Y (2018) Automated road crack detection using deep convolutional neural networks. In: 2018 IEEE International Conference on Big Data (Big Data), IEEE, pp 5212–5215

  44. Maninis KK, Pont-Tuset J, Arbeláez P et al (2016) Deep retinal image understanding. In: Ourselin S, Joskowicz L, Sabuncu MR et al (eds) Medical image computing and computer-assisted intervention - MICCAI 2016. Springer International Publishing, Cham, pp 140–148

    Chapter  Google Scholar 

  45. Mohan A, Poobal S (2018) Crack detection using image processing: a critical review and analysis. Alex Eng J 57(2):787–798

    Article  Google Scholar 

  46. Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: ICML, pp 807–814

  47. Peng C, Yang M, Zheng Q et al (2020) A triple-thresholds pavement crack detection method leveraging random structured forest. Constr Build Mater 263(120):080

    Google Scholar 

  48. Peng Z, Li Z, Zhang J, et al (2019) Few-shot image recognition with knowledge transfer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 441–449

  49. Qu Z, Cao C, Liu L, et al (2021a) A deeply supervised convolutional neural network for pavement crack detection with multiscale feature fusion. IEEE Trans Neural Netw Learn Syst

  50. Qu Z, Chen W, Wang SY, et al (2021b) A crack detection algorithm for concrete pavement based on attention mechanism and multi-features fusion. IEEE Trans Intell Transp Syst

  51. Quan J, Ge B, Chen L (2022) Cross attention redistribution with contrastive learning for few shot object detection. Displays 72(102):162

    Google Scholar 

  52. Quintana M, Torres J, Menéndez JM (2016) A simplified computer vision system for road surface inspection and maintenance. IEEE Trans Intell Transp Syst 17(3):608–619

    Article  Google Scholar 

  53. Ren S, He K, Girshick R et al (2016) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149

    Article  Google Scholar 

  54. Ronneberger O, Fischer P, Brox T (2015) U-Net: convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention, Springer, pp 234–241

  55. Sebe N, Cohen I, Garg A et al (2005) Machine learning in computer vision, vol 29. Springer, Berlin

    MATH  Google Scholar 

  56. Shi Y, Cui L, Qi Z et al (2016) Automatic road crack detection using random structured forests. IEEE Trans Intell Transp Syst 17(12):3434–3445

    Article  Google Scholar 

  57. Shi Y, Cui L, Qi Z et al (2016) Automatic road crack detection using random structured forests. IEEE Trans Intell Transp Syst 17(12):3434–3445

    Article  Google Scholar 

  58. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556

  59. Touvron H, Cord M, Douze M, et al (2021) Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp 10,347–10,357

  60. Valença J, Dias-da Costa D, Júlio E et al (2013) Automatic crack monitoring using photogrammetry and image processing. Measurement 46(1):433–441

    Article  Google Scholar 

  61. Varadharajan S, Jose S, Sharma K, et al (2014) Vision for road inspection. In: IEEE Winter Conference on Applications of Computer Vision, IEEE, pp 115–122

  62. Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Adv neural inf process syst pp 5998–6008

  63. Wang W, Xie E, Li X, et al (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp 568–578

  64. Wu Z, Zhang J, Zhang L et al (2022) Bi-hrnet: a road extraction framework from satellite imagery based on node heatmap and bidirectional connectivity. Remote Sens 14(7):1732

    Article  Google Scholar 

  65. Xie E, Wang W, Yu Z, et al (2021) SegFormer: simple and efficient design for semantic segmentation with transformers. arXiv preprint arXiv:2105.15203

  66. Yamaguchi T, Hashimoto S (2009) Practical image measurement of crack width for real concrete structure. Electron Commun Jpn 92(10):1–12

    Article  Google Scholar 

  67. Yamaguchi T, Hashimoto S (2010) Fast crack detection method for large-size concrete surface images using percolation-based image processing. Mach Vision Appl 21(5):797–809

    Article  Google Scholar 

  68. Yang F, Zhang L, Yu S et al (2019) Feature pyramid and hierarchical boosting network for pavement crack detection. IEEE Trans Intell Transp Syst 21(4):1525–1535

    Article  Google Scholar 

  69. Young T, Hazarika D, Poria S et al (2018) Recent trends in deep learning based natural language processing. Ieee Comput Intell Mag 13(3):55–75

    Article  Google Scholar 

  70. Yuan L, Chen Y, Wang T, et al (2021) Tokens-to-Token Vit: training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp 558–567

  71. Zhang L, Yang F, Zhang YD, et al (2016) Road crack detection using deep convolutional neural network. In: Image Processing (ICIP), 2016 IEEE International Conference on, IEEE, pp 3708–3712

  72. Zhang Y, He M, Chen Z et al (2022) Bridge-net: context-involved u-net with patch-based loss weight mapping for retinal blood vessel segmentation. Expert Syst Appl 195(116):526

    Google Scholar 

  73. Zhao H, Shi J, Qi X, et al (2017) Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2881–2890

  74. Zheng S, Lu J, Zhao H, et al (2021) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 6881–6890

  75. Zhou H, Li Z, Ning C, et al (2017) CAD: Scale invariant framework for real-time object detection. In: Proceedings of the IEEE international conference on computer vision workshops, pp 760–768

  76. Zhou Q, Qu Z, Cao C (2021) Mixed pooling and richer attention feature fusion for crack detection. Pattern Recognit Lett 145:96–102

    Article  Google Scholar 

  77. Zou Q, Zhang Z, Li Q et al (2018) DeepCrack: learning hierarchical convolutional features for crack detection. IEEE Trans Image Process 28(3):1498–1512

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (61535008).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Baozhen Ge.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Quan, J., Ge, B. & Wang, M. CrackViT: a unified CNN-transformer model for pixel-level crack extraction. Neural Comput & Applic 35, 10957–10973 (2023). https://doi.org/10.1007/s00521-023-08277-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-023-08277-7

Keywords

Navigation