Skip to main content
Log in

Very Fast Semantic Image Segmentation Using Hierarchical Dilation and Feature Refining

Cognitive Computation Aims and scope Submit manuscript

Abstract

With the rapid development of deep learning techniques, semantic image segmentation has been considerably improved recently, which is viewed as the key problem of scene understanding in computer vision. These advances are built upon the capability of complex architectures for deep neural network. In this paper, we present a novel deep neural network architecture designed for semantic image segmentation. In order to improve the segmentation accuracy, we introduce a novel hierarchical dilation block to effectively enlarge the size of receptive field and enable multi-scale processing in fully convolutional neural network. Moreover, we exploit the technique of bypass and intermediate supervision to capture the context information during upsampling and refining coarse features. We have conducted extensive experiments on several popular semantic segmentation testbeds, including Cityscapes, CamVid, Kitti, and Helen facial datasets. The experimental results demonstrate that our proposed approach runs two times faster than the state-of-the-art method. Our full system is able to obtain realtime inference performance on 1080P images using a PC with single GPU. It executes a network forwarding at 200fps in our experiment while retaining high accuracy. Our proposed approach not only runs faster than the existing realtime methods but also performs on par with them.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Notes

  1. https://pjreddie.com/darknet/tiny-darknet/

  2. https://github.com/TimoSaemann/ENet

  3. https://www.cityscapes-dataset.com/submit/

References

  1. Badrinarayanan V, Kendall A, Cipolla R. 2015. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. arXiv:1511.00561.

  2. Brostow GJ, Fauqueur J, Cipolla R. Semantic object classes in video: A high-definition ground truth database. Pattern Recogn Lett 2009;30(2):88–97.

    Article  Google Scholar 

  3. Chen L-C, Papandreou G, Kokkinos I, Murphy K, Yuille AL. 2016. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. arXiv:1606.00915.

  4. Collobert R, Kavukcuoglu K, Farabet C. Torch7: A matlab-like environment for machine learning. BigLearn, NIPS Workshop, number EPFL-CONF-192376; 2011.

  5. Cordts M, Omran M, Ramos S, Rehfeld T, Enzweiler M, Benenson R, Franke U, Roth S, Schiele B. The cityscapes dataset for semantic urban scene understanding. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016. p. 3213–3223.

  6. Deng J, Dong W, Socher R, Li L-J, Li K, Li F-F. Imagenet: A large-scale hierarchical image database. IEEE Conference on Computer Vision and Pattern Recognition, 2009. CVPR 2009. p. 248–255. IEEE; 2009.

  7. Eigen D, Fergus R. Predicting depth, surface normals and semantic labels with a common multi-scale convolutional architecture. Proceedings of the IEEE International Conference on Computer Vision; 2015. p. 2650–2658.

  8. Garcia-Garcia A, Orts-Escolano S, Oprea S, Villena-Martinez V, Garcia-Rodriguez J. 2017. A review on deep learning techniques applied to semantic segmentation. arXiv:1704.06857.

  9. Geiger A, Lenz P, Stiller C, Urtasun R. Vision meets robotics: The kitti dataset. Int J Robot Res 2013;32(11):1231–1237.

    Article  Google Scholar 

  10. Gros C. Cognitive computation with autonomously active neural networks: an emerging field. Cogn Comput 2009;1(1):77–90.

    Article  Google Scholar 

  11. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. Proceedings of the IEEE conference on computer vision and pattern recognition; 2016. p. 770–778.

  12. Hubara I, Courbariaux M, Soudry D, El-Yaniv R, Bengio Y. Binarized neural networks. Advances in neural information processing systems; 2016. p. 4107–4115.

  13. Iandola FN, Han S, Moskewicz MW, Ashraf K, Dally WJ, Keutzer K. 2016. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and < 0.5 mb model size. arXiv:1602.07360.

  14. Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T. 2014. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093.

  15. Kingma D, Adam JB. 2014. A method for stochastic optimization. arXiv preprint. arXiv:1412.6980.

  16. Krizhevsky A, Sutskever I, Hinton GE. Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems; 2012. p. 1097–1105.

  17. Le V, Brandt J, Lin Z, Bourdev L, Huang T. Interactive facial feature localization. Comput Vision–ECCV 2012;2012:679–692.

    Google Scholar 

  18. Li H, Kadav A, Durdanovic I, Samet H, Graf HP. 2016. Pruning filters for efficient convnets. arXiv:1608.08710.

  19. Liu B, Wang M, Foroosh H, Tappen M, Pensky M. Sparse convolutional neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2015. p. 806–814.

  20. Newell A, Yang K, Deng J. Stacked hourglass networks for human pose estimation. European Conference on Computer Vision, Springer; 2016. p. 483–499.

  21. Noh Hyeonwoo, Hong Seunghoon, Han Bohyung. Learning deconvolution network for semantic segmentation. Proceedings of the IEEE International Conference on Computer Vision; 2015. p. 1520–1528.

  22. Paszke A, Chaurasia A, Kim S, Culurciello E. 2016. Enet: A deep neural network architecture for real-time semantic segmentation. arXiv:1606.02147.

  23. Pylyshyn ZW. Computation cognition: Toward a foundation for cognitive science. Cambridge: The MIT Press; 1986.

    Google Scholar 

  24. Rastegari M, Ordonez V, Redmon J, Farhadi A. Xnor-net: Imagenet classification using binary convolutional neural networks. European Conference on Computer Vision, Springer; 2016. p. 525–542.

  25. Redmon J, Divvala S, Girshick R, Farhadi A. You only look once: Unified, real-time object detection. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2016. p. 779–788.

  26. Roy A, Todorovic S. A multi-scale cnn for affordance segmentation in rgb images. European Conference on Computer Vision, Springer; 2016. p. 186–201.

  27. Shelhamer E, Long J, Darrell T. Fully convolutional networks for semantic segmentation. IEEE Trans Pattern Anal Mach Intell 2017;39(4):640–651.

    Article  PubMed  Google Scholar 

  28. Shotton J, Johnson M, Cipolla R. Semantic texton forests for image categorization and segmentation. IEEE Conference on Computer vision and pattern recognition, 2008. CVPR 2008, IEEE; 2008. p. 1–8.

  29. Simonyan K, Zisserman A. 2014. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556.

  30. Smith BM, Li Z, Brandt J, Lin Z, Yang J. Exemplar-based face parsing. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition; 2013. p. 3484–3491.

  31. Sturgess P, Alahari K, Ladicky L, Torr PHS. Combining appearance and structure from motion features for road scene understanding. BMVC 2012-23rd British Machine Vision Conference. BMVA; 2009.

  32. Szegedy C, Liu W, Jia Y, Sermanet P, Reed S, Anguelov D, Erhan D, Vanhoucke V, Rabinovich A. Going deeper with convolutions. Proceedings of the IEEE conference on computer vision and pattern recognition; 2015. p. 1–9.

  33. Wang Y, Zhao Q, Bo W, Wang S, Zhang Y, Guo W, Feng Z. A real-time active pedestrian tracking system inspired by the human visual system. Cogn Comput 2016;8(1):39–51.

    Article  Google Scholar 

  34. Wen G, Hou Z, Li H, Li D, Jiang L, Xun E. Ensemble of deep neural networks with probability-based fusion for facial expression recognition. Cogn Comput 2017;9(5):597–610.

    Article  Google Scholar 

  35. Xie J, Lu Y, Zhu L, Chen X. Semantic image segmentation method with multiple adjacency trees and multiscale features. Cogn Comput 2017;9(2):168–179.

    Article  Google Scholar 

  36. Fisher Y, Koltun V. 2015. Multi-scale context aggregation by dilated convolutions. arXiv:1511.07122.

  37. Zeiler MD, Fergus R. Visualizing and understanding convolutional networks. European conference on computer vision, Springer; 2014. p. 818–833.

  38. Zeng Dan, Zhao Fan, Shen Wei, Ge Shiming. 2017. Compressing and accelerating neural network for facial point localization. Cognitive Computation.

  39. Zhang R, Candra SA, Vetter K, Zakhor A. Sensor fusion for semantic segmentation of urban scenes. 2015 IEEE International Conference on Robotics and Automation (ICRA), IEEE; 2015. p. 1850–1857.

  40. Zhao H, Shi J, Qi X, Wang X, Jia J. 2016. Pyramid scene parsing network. arXiv:1612.01105.

  41. Zhao J, Chun D, Sun H, Liu X, Sun J. Biologically motivated model for outdoor scene classification. Cogn Comput 2015;7(1):20–33.

    Article  Google Scholar 

  42. Zheng S, Jayasumana S, Romera-Paredes B, Vineet V, Zhizhong S, Dalong D, Huang C, Torr PHS. Conditional random fields as recurrent neural networks. Proceedings of the IEEE International Conference on Computer Vision; 2015. p. 1529–1537.

  43. Zhou A, Yao A, Guo Y, Xu L, Chen Y. 2017. Incremental network quantization: Towards lossless cnns with low-precision weights. arXiv:1702.03044.

  44. Zhou S, Wu Y, Ni Z, Zhou X, Wen H, Zou Y. 2016. Dorefa-net: Training low bitwidth convolutional neural networks with low bitwidth gradients. arXiv:1606.06160.

Download references

Acknowledgments

This work is supported by the National Key Research and Development Program of China (No. 2016YFB1001501).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianke Zhu.

Ethics declarations

Conflict of Interests

Jianke Zhu has received research grants from Alibaba Group.

Ethical Approval

This article does not contain any studies with human participants performed by any of the authors.

Informed Consent

Informed consent was obtained from all individual participants included in the study.

Additional information

Conflict of Interests

Jianke Zhu has received research grants from Alibaba Group.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ning, Q., Zhu, J. & Chen, C. Very Fast Semantic Image Segmentation Using Hierarchical Dilation and Feature Refining. Cogn Comput 10, 62–72 (2018). https://doi.org/10.1007/s12559-017-9530-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12559-017-9530-0

Keywords

Navigation