CrackViT: a unified CNN-transformer model for pixel-level crack extraction

Quan, Jianing; Ge, Baozhen; Wang, Min

doi:10.1007/s00521-023-08277-7

CrackViT: a unified CNN-transformer model for pixel-level crack extraction

Original Article
Published: 31 January 2023

Volume 35, pages 10957–10973, (2023)
Cite this article

Neural Computing and Applications Aims and scope Submit manuscript

Jianing Quan^1,2,
Baozhen Ge^1,2 &
Min Wang^1,2

934 Accesses
2 Citations
Explore all metrics

Abstract

Pixel-level crack extraction (PCE) is challenging due to topology complexity, irregular edges, low contrast ratio, and complex background. Recently, Transformer architectures have shown great potential on many vision tasks and even outperform convolutional neural networks (CNNs). Benefiting from the self-attention mechanism, Transformers can invariably capture the global context information to establish long-range dependencies on the detected objects. However, there was little work on the Transformer architectures for PCE. In this paper, a systematic analysis of three well-designed Transformer architectures for PCE task in terms of network structures and parameters, feature fusion modes, training data and strategy, and generalization ability was developed for the first time. We proposed a Crack extraction network with Vision Transformer (CrackViT) that jointly captures the detailed structures and long-distance dependencies with a novel hybrid encoder with CNN and Transformer to keep the corresponding topologies. In order to be more suitable for PCE task, we explored three feature fusion modes between CNN and Transformer. In addition, a novel feature aggregation block was proposed to sharpen the edges of the decoder upsampling and reduce the noise effect of shallow features. Moreover, a multi-task supervised training strategy was adopted to further improve the details of crack edges. Results on four challenging datasets, including CrackForest, DeepCrack, CRKWH100, and CRACK500, show that CrackViT outperforms state-of-the-art CNN-based methods and the other two novel Transformer architectures. Our codes are available at: https://github.com/SmilQe/CrackViT.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

U-Net: Convolutional Networks for Biomedical Image Segmentation

Object detection using YOLO: challenges, architectural successors, datasets and applications

Article 08 August 2022

SSD: Single Shot MultiBox Detector

Data availability

All data generated or analyzed during this study have been properly cited in this published article (see Sect. 4.1 and References). If found difficulty in finding the data links, same can be available from the corresponding author on reasonable request.

References

Ba JL, Kiros JR, Hinton GE (2016) Layer normalization. arXiv preprint arXiv:1607.06450
Bray J, Verma B, Li X, et al (2006) A neural network based technique for automatic classification of road cracks. In: The 2006 IEEE International Joint Conference on Neural Network Proceedings, IEEE, pp 907–912
Brown TB, Mann B, Ryder N, et al (2020) Language models are few-shot learners. arXiv preprint arXiv:2005.14165
Cao H, Wang Y, Chen J, et al (2021) Swin-unet: unet-like pure transformer for medical image segmentation. arXiv preprint arXiv:2105.05537
Carion N, Massa F, Synnaeve G, et al (2020) End-to-end object detection with transformers. In: European Conference on Computer Vision, Springer, pp 213–229
Cha YJ, Choi W, Büyüköztürk O (2017) Deep learning-based crack damage detection using convolutional neural networks. Comput-Aided Civil Infrastruct Eng 32(5):361–378
Article Google Scholar
Chen B, Zhang H, Li Y et al (2022) Quantify pixel-level detection of dam surface crack using deep learning. Measurement Sci Technol 33(6):065402
Article Google Scholar
Chen W, Du X, Yang F, et al (2021) A simple single-scale vision transformer for object localization and instance segmentation. arXiv preprint arXiv:2112.09747
Cheng H, Shi X, Glazier C (2003) Real-time image thresholding based on sample space reduction and interpolation approach. J Comput Civil Eng 17(4):264–272
Article Google Scholar
Cheng M, Zhao K, Guo X, et al (2021) Joint topology-preserving and feature-refinement network for curvilinear structure segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp 7147–7156
Deng J, Dong W, Socher R, et al (2009) Imagenet: a large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition, IEEE, pp 248–255
Dosovitskiy A, Beyer L, Kolesnikov A, et al (2020) An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929
Eisenbach M, Stricker R, Seichter D, et al (2017) How to get pavement distress detection ready for deep learning? a systematic approach. In: International Joint Conference on Neural Networks (IJCNN), pp 2039–2047
Fang J, Qu B, Yuan Y (2021) Distribution equalization learning mechanism for road crack detection. Neurocomputing 424:193–204
Article Google Scholar
Forsyth D, Ponce J (2011) Computer vision: a modern approach. Prentice hall
Google Scholar
Gavilán M, Balcones D, Marcos O et al (2011) Adaptive road crack detection system by pavement classification. Sensors 11(10):9628–9657
Article Google Scholar
Guo JM, Markoni H, Lee JD (2021) BARNet: boundary aware refinement network for crack detection. IEEE Trans Intell Transp Syst
Han C, Ma T, Huyan J, et al (2021a) CrackW-Net: a novel pavement crack image segmentation convolutional neural network. IEEE Trans Intell Transp Syst pp 1–10
Han K, Wang Y, Chen H, et al (2020) A survey on visual transformer. arXiv preprint arXiv:2012.12556
Han K, Xiao A, Wu E et al (2021) Transformer in transformer. Adv Neural Inf Process Syst 34:15908–15919
Google Scholar
He K, Zhang X, Ren S, et al (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778
Hendrycks D, Gimpel K (2016) Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415
Hoang ND (2018) Detection of surface crack in building structures using image processing technique with an improved otsu method for image thresholding. Adv Civil Eng
Hong Z, Yang F, Pan H et al (2022) Highway crack segmentation from unmanned aerial vehicle images using deep learning. IEEE Geosci Remote Sens Lett 19:1–5
Article Google Scholar
Huang H, Lin L, Tong R, et al (2020) UNet 3+: a full-scale connected unet for medical image segmentation. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), IEEE, pp 1055–1059
Hutchinson TC, Chen Z (2006) Improved image analysis for evaluating concrete damage. J Comput Civil Eng 20(3):210–216
Article Google Scholar
Huyan J, Ma T, Li W, et al (2022) Pixelwise asphalt concrete pavement crack detection via deep learning-based semantic segmentation method. Struct Control Health Monit p e2974
Kim H, Ahn E, Cho S et al (2017) Comparative analysis of image binarization methods for crack identification in concrete structures. Cement Concr Res 99:53–61
Article Google Scholar
Kingma DP, Ba J (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980
König J, Jenkins MD, Mannion M et al (2021) Optimized deep encoder-decoder methods for crack segmentation. Digit Signal Process 108(102):907
Google Scholar
Lee BY, Kim YY, Yi ST et al (2013) Automated image processing technique for detecting and analysing concrete surface cracks. Struct Infrastruct Eng 9(6):567–577
Article Google Scholar
Lee D, Kim J, Lee D (2019) Robust concrete crack detection using deep learning-based semantic segmentation. Int J Aeronaut Sp Sci 20(1):287–299
Article Google Scholar
Li G, Xie Y, Lin L, et al (2017) Instance-level salient object segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp 2386–2395
Li Q, Zou Q, Zhang D et al (2011) FoSA: f* seed-growing approach for crack-line detection from pavement images. Image Vision Comput 29(12):861–872
Article Google Scholar
Li Z, Sun Y, Zhang L, et al (2021) CTNet: context-based tandem network for semantic segmentation. IEEE Trans Pattern Anal Mach Intell
Liu H, Miao X, Mertz C, et al (2021a) CrackFormer: transformer network for fine-grained crack detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp 3783–3792
Liu JJ, Hou Q, Cheng MM, et al (2019a) A simple pooling-based design for real-time salient object detection. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp 3917–3926
Liu N, Zhang N, Wan K, et al (2021b) Visual saliency transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 4722–4732
Liu Y, Yao J, Lu X et al (2019) DeepCrack: a deep hierarchical feature learning architecture for crack segmentation. Neurocomputing 338:139–153
Article Google Scholar
Liu Z, Lin Y, Cao Y, et al (2021c) Swin transformer: hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030
Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431–3440
Luo Q, Ge B, Tian Q (2019) A fast adaptive crack detection algorithm based on a double-edge extraction operator of fsm. Constr Build Mater 204:244–254
Article Google Scholar
Mandal V, Uong L, Adu-Gyamfi Y (2018) Automated road crack detection using deep convolutional neural networks. In: 2018 IEEE International Conference on Big Data (Big Data), IEEE, pp 5212–5215
Maninis KK, Pont-Tuset J, Arbeláez P et al (2016) Deep retinal image understanding. In: Ourselin S, Joskowicz L, Sabuncu MR et al (eds) Medical image computing and computer-assisted intervention - MICCAI 2016. Springer International Publishing, Cham, pp 140–148
Chapter Google Scholar
Mohan A, Poobal S (2018) Crack detection using image processing: a critical review and analysis. Alex Eng J 57(2):787–798
Article Google Scholar
Nair V, Hinton GE (2010) Rectified linear units improve restricted boltzmann machines. In: ICML, pp 807–814
Peng C, Yang M, Zheng Q et al (2020) A triple-thresholds pavement crack detection method leveraging random structured forest. Constr Build Mater 263(120):080
Google Scholar
Peng Z, Li Z, Zhang J, et al (2019) Few-shot image recognition with knowledge transfer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp 441–449
Qu Z, Cao C, Liu L, et al (2021a) A deeply supervised convolutional neural network for pavement crack detection with multiscale feature fusion. IEEE Trans Neural Netw Learn Syst
Qu Z, Chen W, Wang SY, et al (2021b) A crack detection algorithm for concrete pavement based on attention mechanism and multi-features fusion. IEEE Trans Intell Transp Syst
Quan J, Ge B, Chen L (2022) Cross attention redistribution with contrastive learning for few shot object detection. Displays 72(102):162
Google Scholar
Quintana M, Torres J, Menéndez JM (2016) A simplified computer vision system for road surface inspection and maintenance. IEEE Trans Intell Transp Syst 17(3):608–619
Article Google Scholar
Ren S, He K, Girshick R et al (2016) Faster r-cnn: towards real-time object detection with region proposal networks. IEEE Trans Pattern Anal Mach Intell 39(6):1137–1149
Article Google Scholar
Ronneberger O, Fischer P, Brox T (2015) U-Net: convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer-assisted intervention, Springer, pp 234–241
Sebe N, Cohen I, Garg A et al (2005) Machine learning in computer vision, vol 29. Springer, Berlin
MATH Google Scholar
Shi Y, Cui L, Qi Z et al (2016) Automatic road crack detection using random structured forests. IEEE Trans Intell Transp Syst 17(12):3434–3445
Article Google Scholar
Shi Y, Cui L, Qi Z et al (2016) Automatic road crack detection using random structured forests. IEEE Trans Intell Transp Syst 17(12):3434–3445
Article Google Scholar
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556
Touvron H, Cord M, Douze M, et al (2021) Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp 10,347–10,357
Valença J, Dias-da Costa D, Júlio E et al (2013) Automatic crack monitoring using photogrammetry and image processing. Measurement 46(1):433–441
Article Google Scholar
Varadharajan S, Jose S, Sharma K, et al (2014) Vision for road inspection. In: IEEE Winter Conference on Applications of Computer Vision, IEEE, pp 115–122
Vaswani A, Shazeer N, Parmar N, et al (2017) Attention is all you need. Adv neural inf process syst pp 5998–6008
Wang W, Xie E, Li X, et al (2021) Pyramid vision transformer: a versatile backbone for dense prediction without convolutions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp 568–578
Wu Z, Zhang J, Zhang L et al (2022) Bi-hrnet: a road extraction framework from satellite imagery based on node heatmap and bidirectional connectivity. Remote Sens 14(7):1732
Article Google Scholar
Xie E, Wang W, Yu Z, et al (2021) SegFormer: simple and efficient design for semantic segmentation with transformers. arXiv preprint arXiv:2105.15203
Yamaguchi T, Hashimoto S (2009) Practical image measurement of crack width for real concrete structure. Electron Commun Jpn 92(10):1–12
Article Google Scholar
Yamaguchi T, Hashimoto S (2010) Fast crack detection method for large-size concrete surface images using percolation-based image processing. Mach Vision Appl 21(5):797–809
Article Google Scholar
Yang F, Zhang L, Yu S et al (2019) Feature pyramid and hierarchical boosting network for pavement crack detection. IEEE Trans Intell Transp Syst 21(4):1525–1535
Article Google Scholar
Young T, Hazarika D, Poria S et al (2018) Recent trends in deep learning based natural language processing. Ieee Comput Intell Mag 13(3):55–75
Article Google Scholar
Yuan L, Chen Y, Wang T, et al (2021) Tokens-to-Token Vit: training vision transformers from scratch on imagenet. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp 558–567
Zhang L, Yang F, Zhang YD, et al (2016) Road crack detection using deep convolutional neural network. In: Image Processing (ICIP), 2016 IEEE International Conference on, IEEE, pp 3708–3712
Zhang Y, He M, Chen Z et al (2022) Bridge-net: context-involved u-net with patch-based loss weight mapping for retinal blood vessel segmentation. Expert Syst Appl 195(116):526
Google Scholar
Zhao H, Shi J, Qi X, et al (2017) Pyramid scene parsing network. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2881–2890
Zheng S, Lu J, Zhao H, et al (2021) Rethinking semantic segmentation from a sequence-to-sequence perspective with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp 6881–6890
Zhou H, Li Z, Ning C, et al (2017) CAD: Scale invariant framework for real-time object detection. In: Proceedings of the IEEE international conference on computer vision workshops, pp 760–768
Zhou Q, Qu Z, Cao C (2021) Mixed pooling and richer attention feature fusion for crack detection. Pattern Recognit Lett 145:96–102
Article Google Scholar
Zou Q, Zhang Z, Li Q et al (2018) DeepCrack: learning hierarchical convolutional features for crack detection. IEEE Trans Image Process 28(3):1498–1512
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work is supported by the National Natural Science Foundation of China (61535008).

Author information

Authors and Affiliations

School of Precision Instrument and Opto-Electronics Engineering, Tianjin University, 92 Weijin Road, Tianjin, 300072, China
Jianing Quan, Baozhen Ge & Min Wang
Key Laboratory of Opto-Electronics Information Technology of Ministry of Education, 92 Weijin Road, Tianjin, 300072, China
Jianing Quan, Baozhen Ge & Min Wang

Authors

Jianing Quan
View author publications
You can also search for this author in PubMed Google Scholar
Baozhen Ge
View author publications
You can also search for this author in PubMed Google Scholar
Min Wang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Baozhen Ge.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Quan, J., Ge, B. & Wang, M. CrackViT: a unified CNN-transformer model for pixel-level crack extraction. Neural Comput & Applic 35, 10957–10973 (2023). https://doi.org/10.1007/s00521-023-08277-7

Download citation

Received: 28 June 2022
Accepted: 06 January 2023
Published: 31 January 2023
Issue Date: May 2023
DOI: https://doi.org/10.1007/s00521-023-08277-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

CrackViT: a unified CNN-transformer model for pixel-level crack extraction

Abstract

Access this article

Similar content being viewed by others

U-Net: Convolutional Networks for Biomedical Image Segmentation

Object detection using YOLO: challenges, architectural successors, datasets and applications

SSD: Single Shot MultiBox Detector

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

CrackViT: a unified CNN-transformer model for pixel-level crack extraction

Abstract

Access this article

Similar content being viewed by others

U-Net: Convolutional Networks for Biomedical Image Segmentation

Object detection using YOLO: challenges, architectural successors, datasets and applications

SSD: Single Shot MultiBox Detector

Data availability

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation