Abstract
Vehicle detection in videos is a valuable but challenging technology in traffic monitoring. Due to the advantage of real-time detection, Single Shot MultiBox Detector (SSD) is often used to detect vehicles in images. However, the accuracy degradation caused by SSD is one of the significant problems in video vehicle detection. To address this problem in real time, this paper enhances the detection performance by improving the SSD and employing the relationship of inter-frame detections. We propose a feature-fused SSD detector and a Tracking-guided Detections Optimizing (TDO) strategy for fast and effective video vehicle detection. We introduce a lightweight feature fusion sub-network to the standard SSD network, which aggregate the deeper layer features into the shallower layer features to enhance the semantic information of the shallower layer features. At the post-processing stage of the feature-fused SSD, the non-maximum suppression (NMS) is replaced by the TDO strategy, which link vehicles of inter-frames by fast tracking algorithm. Thus the missed detections can be compensated by the propagated results, and the confidence of the final results can be optimized in the temporal. Our approach significantly improves the temporal consistency of the detection results with lower complexity computations. We evaluate the proposed method on two datasets. The experiments on our labeled highway dataset show that the mean average precision (mAP) of our method is 8.2% higher than that of the base detector. The runtime of our feature-fused SSD is 27.1 frames per second (fps), which is suitable for real-time detection. The experiments on the ImageNet VID dataset prove that the proposed method is comparable with the state-of-the-art detectors as well.
Similar content being viewed by others
References
Zhang, J., Wang, F.-Y., Wang, K., Lin, W.-H., Xu, X., Chen, C.: Data-driven intelligent transportation systems: a survey. IEEE Trans. Intell. Transport. Syst. 12(4), 1624–1639 (2011)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521(7553), 436–444 (2015)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition, In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Gao, Y., Xiao, G.: Real-time Chinese traffic warning signs recognition based on cascade and CNN. J. Real-Time Image Process. 1–12 (2020). https://doi.org/10.1007/s11554-020-01003-9
Taylor, G.W., Fergus, R., LeCun, Y., Bregler, C.: Convolutional learning of spatio-temporal features, In: European Conference on Computer Vision. Springer, pp. 140–153 (2010)
Zhu, X., Xiong, Y., Dai, J., Yuan, L., Wei, Y.: Deep feature flow for video recognition, In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2349–2358 (2017)
Zhu, X., Wang, Y., Dai, J., Yuan, L., Wei, Y.: Flow-guided feature aggregation for video object detection, In: Proceedings of the IEEE International Conference on Computer Vision, pp. 408–417 (2017)
Zhu, X., Dai, J., Yuan, L., Wei, Y.: Towards high performance video object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7210–7218 (2018)
Dosovitskiy, A., Fischer, P., Ilg, E., Hausser, P., Hazirbas, C., Golkov, V., Van Der Smagt, P., Cremers, D., Brox, T.: Flownet: Learning optical flow with convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2758–2766 (2015)
Wang, S., Zhou, Y., Yan, J., Deng, Z.: Fully motion-aware network for video object detection. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 542–557 (2018)
Feichtenhofer, C., Pinz, A., Zisserman, A.: Detect to track and track to detect. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3038–3046 (2017)
Tang, P., Wang, C., Wang, X., Liu, W., Zeng, W., Wang, J.: Object detection in videos by short and long range object linking. arXiv preprint arXiv:1801.09823 (2018)
Lee, B., Erdenee, E., Jin, S., Nam, M.Y., Jung, Y.G., Rhee, P.K.: Multi-class multi-object tracking using changing point detection. In: European Conference on Computer Vision. Springer, pp. 68–83 (2016)
Kang, K., Li, H., Yan, J., Zeng, X., Yang, B., Xiao, T., Zhang, C., Wang, Z., Wang, R., Wang, X., et al.: T-cnn: Tubelets with convolutional neural networks for object detection from videos. IEEE Trans. Circ. Syst. Video Technol. 28(10), 2896–2907 (2017)
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.: Ssd: Single shot multibox detector. In: European Conference on Computer Vision. Springer, pp. 21–37 (2016)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vision 60(2), 91–110 (2004)
Chen, Z., Ellis, T., Velastin, S.A.: Vehicle detection, tracking and classification in urban traffic. In: 2012 15th International IEEE Conference on Intelligent Transportation Systems. IEEE, pp. 951–956 (2012)
Montero, V. J., Jung, W.-Y., Jeong, Y.-J.: Fast background subtraction with adaptive block learning using expectation value suitable for real-time moving object detection. J. Real-Time Image Process. 1–15 (2021). https://doi.org/10.1007/s11554-020-01058-8
Dai, J., Li, Y., He, K., Sun, J.: R-fcn: Object detection via region-based fully convolutional networks. Adv. Neural Info. Process. Syst. 29, 379–387 (2016)
Redmon, J., Farhadi, A.: Yolov3: An incremental improvement. arXiv preprintarXiv:1804.02767 (2018)
Yang, H., Qu, S.: Real-time vehicle detection and counting in complex traffic scenes using background subtraction model with low-rank decomposition. IET Intell. Transp. Syst. 12(1), 75–85 (2017)
Stauffer, C., Grimson, W.E.L.: Adaptive background mixture models for real-time tracking, In: Proceedings. IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149), vol. 2. IEEE, 246–252 (1999)
Cocorullo, G., Corsonello, P., Frustaci, F., Perri, S., et al.: Multimodal background subtraction for high-performance embedded systems. J. Real-Time Image Process. 16(5), 1407–1423 (2019)
Hsieh, J.-W., Chen, L.-C., Chen, D.-Y.: Symmetrical surf and its applications to vehicle detection and vehicle make and model recognition. IEEE Trans. Intell. Transp. Syst. 15(1), 6–20 (2014)
Sun, Z., Bebis, G., Miller, R.: Monocular precrash vehicle detection: features and classifiers. IEEE Trans. Image Process. 15(7), 2019–2034 (2006)
Juan, L., Gwun, O.: A comparison of sift, pca-sift and surf. Int. J. Image Process. (IJIP) 3(4), 143–152 (2009)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 28, 91–99 (2015)
Hu, X., Xu, X., Xiao, Y., Chen, H., He, S., Qin, J., Heng, P.-A.: Sinet: A scale-insensitive convolutional neural network for fast vehicle detection. IEEE Trans. Intell. Transp. Syst. 20(3), 1010–1019 (2018)
Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition 779–788 (2016)
Everingham, M., Eslami, S.A., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes challenge: a retrospective. Int. J. Comput. Vision 111(1), 98–136 (2015)
Bertasius, G., Torresani, L., Shi, J.: Object detection in video with spatiotemporal sampling networks. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 331–346 (2018)
Xiao, F., Lee, Y.J.: Video object detection with an aligned spatial-temporal memory. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 485–501 (2018)
Kang, K., Li, H., Xiao, T., Ouyang, W., Yan, J., Liu, X., Wang, X.: Object detection in videos with tubelet proposal networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 727–735 (2017)
Congrui, H., Qin, H., Liu, S., Yan, J.: Impression network for video object detection. arXiv preprint arXiv:1712.05896, (2017)
Kang, K., Ouyang, W., Li, H., Wang, X.: Object detection from video tubelets with convolutional neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 817–825 (2016)
Han, W., Khorrami, P., Paine, T.L., Ramachandran, P., Babaeizadeh, M., Shi, H., Li, J., Yan, S., Huang, T.S.: Seq-nms for video object detection. arXiv preprint arXiv:1602.08465 (2016)
Xingjian, S., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-K., Woo, W.-C.: Convolutional lstm network: a machine learning approach for precipitation nowcasting. Adv. Neural Inf. Process. Syst. 28, 802–810 (2015)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: International Conference on Learning Representations, pp. 1–14, (2015)
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recognition challenge. Int. J. Comput. Vision 115(3), 211–252 (2015)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 580–587 (2014)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. 25, 1097–1105 (2012)
Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1440–1448 (2015)
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., Rabinovich, A.: Going deeper with convolutions, In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–9 (2015)
Henriques, J.F., Caseiro, R., Martins, P., Batista, J.: High-speed tracking with kernelized correlation filters. IEEE Trans. Pattern Anal. Mach. Intell. 37(3), 583–596 (2014)
Bewley, A., Ge, Z., Ott, L., Ramos, F., Upcroft, B., Simple online and realtime tracking, In: IEEE International Conference on Image Processing (ICIP). IEEE, pp. 3464–3468 (2016)
Acknowledgements
This research was funded by the National Natural Science Foundation of China (62072053), the Fundamental Research Funds for the Central Universities (300102249317), Natural Science Foundation of Shaanxi Province (2019SF-258), and Key R & D project of Shaanxi Science and Technology Department (2019YFB1600500).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Yang, Y., Song, H., Sun, S. et al. A fast and effective video vehicle detection method leveraging feature fusion and proposal temporal link. J Real-Time Image Proc 18, 1261–1274 (2021). https://doi.org/10.1007/s11554-021-01121-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11554-021-01121-y