Abstract
Pixel-level crack extraction (PCE) is challenging due to topology complexity, irregular edges, low contrast ratio, and complex background. Recently, Transformer architectures have shown great potential on many vision tasks and even outperform convolutional neural networks (CNNs). Benefiting from the self-attention mechanism, Transformers can invariably capture the global context information to establish long-range dependencies on the detected objects. However, there was little work on the Transformer architectures for PCE. In this paper, a systematic analysis of three well-designed Transformer architectures for PCE task in terms of network structures and parameters, feature fusion modes, training data and strategy, and generalization ability was developed for the first time. We proposed a Crack extraction network with Vision Transformer (CrackViT) that jointly captures the detailed structures and long-distance dependencies with a novel hybrid encoder with CNN and Transformer to keep the corresponding topologies. In order to be more suitable for PCE task, we explored three feature fusion modes between CNN and Transformer. In addition, a novel feature aggregation block was proposed to sharpen the edges of the decoder upsampling and reduce the noise effect of shallow features. Moreover, a multi-task supervised training strategy was adopted to further improve the details of crack edges. Results on four challenging datasets, including CrackForest, DeepCrack, CRKWH100, and CRACK500, show that CrackViT outperforms state-of-the-art CNN-based methods and the other two novel Transformer architectures. Our codes are available at: https://github.com/SmilQe/CrackViT.
Similar content being viewed by others
Data availability
All data generated or analyzed during this study have been properly cited in this published article (see Sect. 4.1 and References). If found difficulty in finding the data links, same can be available from the corresponding author on reasonable request.
References
Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. arXiv preprint arXiv:1607.06450
Bray J, Verma B, Li X, et al (2006) A neural network based technique for automatic classification of road cracks. In: The 2006 IEEE International Joint Conference on Neural Network Proceedings, IEEE, pp 907–912
Brown TB, Mann B, Ryder N, et al (2020) Language models are few-shot learners. arXiv preprint arXiv:2005.14165
Cao H, Wang Y, Chen J, et al (2021) Swin-unet: unet-like pure transformer for medical image segmentation. arXiv preprint arXiv:2105.05537
Carion N, Massa F, Synnaeve G, et al (2020) End-to-end object detection with transformers. In: European Conference on Computer Vision, Springer, pp 213–229
Cha YJ, Choi W, Büyüköztürk O (2017) Deep learning-based crack damage detection using convolutional neural networks. Comput-Aided Civil Infrastruct Eng 32(5):361–378
Chen B, Zhang H, Li Y et al (2022) Quantify pixel-level detection of dam surface crack using deep learning. Measurement Sci Technol 33(6):065402
Chen W, Du X, Yang F, et al (2021) A simple single-scale vision transformer for object localization and instance segmentation. arXiv preprint arXiv:2112.09747
Cheng H, Shi X, Glazier C (2003) Real-time image thresholding based on sample space reduction and interpolation approach. J Comput Civil Eng 17(4):264–272
Cheng M, Zhao K, Guo X, et al (2021) Joint topology-preserving and feature-refinement network for curvilinear structure segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp 7147–7156
Deng J, Dong W, Socher R, et al (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, IEEE, pp 248–255
Dosovitskiy A, Beyer L, Kolesnikov A, et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Eisenbach M, Stricker R, Seichter D, et al (2017) How to get pavement distress detection ready for deep learning? a systematic approach. In: International Joint Conference on Neural Networks (IJCNN), pp 2039–2047
Fang J, Qu B, Yuan Y (2021) Distribution equalization learning mechanism for road crack detection. Neurocomputing 424:193–204
Forsyth D, Ponce J (2011) Computer vision: a modern approach. Prentice hall
Gavilán M, Balcones D, Marcos O et al (2011) Adaptive road crack detection system by pavement classification. Sensors 11(10):9628–9657
Guo JM, Markoni H, Lee JD (2021) BARNet: boundary aware refinement network for crack detection. IEEE Trans Intell Transp Syst
Han C, Ma T, Huyan J, et al (2021a) CrackW-Net: a novel pavement crack image segmentation convolutional neural network. IEEE Trans Intell Transp Syst pp 1–10
Han K, Wang Y, Chen H, et al (2020) A survey on visual transformer. arXiv preprint arXiv:2012.12556
Han K, Xiao A, Wu E et al (2021) Transformer in transformer. Adv Neural Inf Process Syst 34:15908–15919
He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hendrycks D, Gimpel K (2016) Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415
Hoang ND (2018) Detection of surface crack in building structures using image processing technique with an improved otsu method for image thresholding. Adv Civil Eng
Hong Z, Yang F, Pan H et al (2022) Highway crack segmentation from unmanned aerial vehicle images using deep learning. IEEE Geosci Remote Sens Lett 19:1–5
Huang H, Lin L, Tong R, et al (2020) UNet 3+: a full-scale connected unet for medical image segmentation. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp 1055–1059
Hutchinson TC, Chen Z (2006) Improved image analysis for evaluating concrete damage. J Comput Civil Eng 20(3):210–216
Huyan J, Ma T, Li W, et al (2022) Pixelwise asphalt concrete pavement crack detection via deep learning-based semantic segmentation method. Struct Control Health Monit p e2974
Kim H, Ahn E, Cho S et al (2017) Comparative analysis of image binarization methods for crack identification in concrete structures. Cement Concr Res 99:53–61
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
König J, Jenkins MD, Mannion M et al (2021) Optimized deep encoder-decoder methods for crack segmentation. Digit Signal Process 108(102):907
Lee BY, Kim YY, Yi ST et al (2013) Automated image processing technique for detecting and analysing concrete surface cracks. Struct Infrastruct Eng 9(6):567–577
Lee D, Kim J, Lee D (2019) Robust concrete crack detection using deep learning-based semantic segmentation. Int J Aeronaut Sp Sci 20(1):287–299
Li G, Xie Y, Lin L, et al (2017) Instance-level salient object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2386–2395
Li Q, Zou Q, Zhang D et al (2011) FoSA: f* seed-growing approach for crack-line detection from pavement images. Image Vision Comput 29(12):861–872
Li Z, Sun Y, Zhang L, et al (2021) CTNet: context-based tandem network for semantic segmentation. IEEE Trans Pattern Anal Mach Intell
Liu H, Miao X, Mertz C, et al (2021a) CrackFormer: transformer network for fine-grained crack detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp 3783–3792
Liu JJ, Hou Q, Cheng MM, et al (2019a) A simple pooling-based design for real-time salient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3917–3926
Liu N, Zhang N, Wan K, et al (2021b) Visual saliency transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4722–4732
Liu Y, Yao J, Lu X et al (2019) DeepCrack: a deep hierarchical feature learning architecture for crack segmentation. Neurocomputing 338:139–153
Liu Z, Lin Y, Cao Y, et al (2021c) Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Luo Q, Ge B, Tian Q (2019) A fast adaptive crack detection algorithm based on a double-edge extraction operator of fsm. Constr Build Mater 204:244–254
Mandal V, Uong L, Adu-Gyamfi Y (2018) Automated road crack detection using deep convolutional neural networks. In: 2018 IEEE International Conference on Big Data (Big Data), IEEE, pp 5212–5215
Maninis KK, Pont-Tuset J, Arbeláez P et al (2016) Deep retinal image understanding. In: Ourselin S, Joskowicz L, Sabuncu MR et al (eds) Medical image computing and computer-assisted intervention - MICCAI 2016. Springer International Publishing, Cham, pp 140–148
Mohan A, Poobal S (2018) Crack detection using image processing: a critical review and analysis. Alex Eng J 57(2):787–798
Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: ICML, pp 807–814
Peng C, Yang M, Zheng Q et al (2020) A triple-thresholds pavement crack detection method leveraging random structured forest. Constr Build Mater 263(120):080
Peng Z, Li Z, Zhang J, et al (2019) Few-shot image recognition with knowledge transfer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 441–449
Qu Z, Cao C, Liu L, et al (2021a) A deeply supervised convolutional neural network for pavement crack detection with multiscale feature fusion. IEEE Trans Neural Netw Learn Syst
Qu Z, Chen W, Wang SY, et al (2021b) A crack detection algorithm for concrete pavement based on attention mechanism and multi-features fusion. IEEE Trans Intell Transp Syst
Quan J, Ge B, Chen L (2022) Cross attention redistribution with contrastive learning for few shot object detection. Displays 72(102):162
Quintana M, Torres J, Menéndez JM (2016) A simplified computer vision system for road surface inspection and maintenance. IEEE Trans Intell Transp Syst 17(3):608–619
Ren S, He K, Girshick R et al (2016) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Ronneberger O, Fischer P, Brox T (2015) U-Net: convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention, Springer, pp 234–241
Sebe N, Cohen I, Garg A et al (2005) Machine learning in computer vision, vol 29. Springer, Berlin
Shi Y, Cui L, Qi Z et al (2016) Automatic road crack detection using random structured forests. IEEE Trans Intell Transp Syst 17(12):3434–3445
Shi Y, Cui L, Qi Z et al (2016) Automatic road crack detection using random structured forests. IEEE Trans Intell Transp Syst 17(12):3434–3445
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Touvron H, Cord M, Douze M, et al (2021) Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp 10,347–10,357
Valença J, Dias-da Costa D, Júlio E et al (2013) Automatic crack monitoring using photogrammetry and image processing. Measurement 46(1):433–441
Varadharajan S, Jose S, Sharma K, et al (2014) Vision for road inspection. In: IEEE Winter Conference on Applications of Computer Vision, IEEE, pp 115–122
Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Adv neural inf process syst pp 5998–6008
Wang W, Xie E, Li X, et al (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp 568–578
Wu Z, Zhang J, Zhang L et al (2022) Bi-hrnet: a road extraction framework from satellite imagery based on node heatmap and bidirectional connectivity. Remote Sens 14(7):1732
Xie E, Wang W, Yu Z, et al (2021) SegFormer: simple and efficient design for semantic segmentation with transformers. arXiv preprint arXiv:2105.15203
Yamaguchi T, Hashimoto S (2009) Practical image measurement of crack width for real concrete structure. Electron Commun Jpn 92(10):1–12
Yamaguchi T, Hashimoto S (2010) Fast crack detection method for large-size concrete surface images using percolation-based image processing. Mach Vision Appl 21(5):797–809
Yang F, Zhang L, Yu S et al (2019) Feature pyramid and hierarchical boosting network for pavement crack detection. IEEE Trans Intell Transp Syst 21(4):1525–1535
Young T, Hazarika D, Poria S et al (2018) Recent trends in deep learning based natural language processing. Ieee Comput Intell Mag 13(3):55–75
Yuan L, Chen Y, Wang T, et al (2021) Tokens-to-Token Vit: training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp 558–567
Zhang L, Yang F, Zhang YD, et al (2016) Road crack detection using deep convolutional neural network. In: Image Processing (ICIP), 2016 IEEE International Conference on, IEEE, pp 3708–3712
Zhang Y, He M, Chen Z et al (2022) Bridge-net: context-involved u-net with patch-based loss weight mapping for retinal blood vessel segmentation. Expert Syst Appl 195(116):526
Zhao H, Shi J, Qi X, et al (2017) Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2881–2890
Zheng S, Lu J, Zhao H, et al (2021) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 6881–6890
Zhou H, Li Z, Ning C, et al (2017) CAD: Scale invariant framework for real-time object detection. In: Proceedings of the IEEE international conference on computer vision workshops, pp 760–768
Zhou Q, Qu Z, Cao C (2021) Mixed pooling and richer attention feature fusion for crack detection. Pattern Recognit Lett 145:96–102
Zou Q, Zhang Z, Li Q et al (2018) DeepCrack: learning hierarchical convolutional features for crack detection. IEEE Trans Image Process 28(3):1498–1512
Acknowledgements
This work is supported by the National Natural Science Foundation of China (61535008).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Quan, J., Ge, B. & Wang, M. CrackViT: a unified CNN-transformer model for pixel-level crack extraction. Neural Comput & Applic 35, 10957–10973 (2023). https://doi.org/10.1007/s00521-023-08277-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-023-08277-7