Abstract
State-of-the-art (SoTA) detection-based tracking methods mostly accomplish the detection and the identification feature learning tasks separately. Only a few efforts include the joint learning of detection and identification features. This work proposes two novel one-stage trackers by introducing implicit and explicit attention to the tracking research topic. For our tracking system based on implicit attention, we further introduce a novel fusion of feature maps combining information from different abstraction levels. For our tracking system based on explicit attention, we introduce utilization of an additional auxiliary function. These systems outperform the SoTA tracking systems in terms of MOTP (Multi-Object Tracking Precision) and IDF1 score when evaluated on public benchmark datasets including MOT15, MOT16, and MOT17. High MOTP score indicates precise detection of bounding boxes of objects, while high IDF1 score indicates accurate ID detections, which is very crucial for surveillance and security systems. Therefore, proposed systems are good choice for event-detections in surveillance feeds as we are capable of detecting correct ID and precise location.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Bera, A., Kim, S., Manocha, D.: Realtime anomaly detection using trajectory-level crowd behavior learning. In: CVPRW, pp. 1289–1296 (2016)
Bergmann, P., Meinhardt, T., Leal-Taixe, L.: Tracking without bells and whistles. In: ICCV (2019)
Bernardin, K., Stiefelhagen, R.: Evaluating multiple object tracking performance: the CLEAR MOT metrics. EURASIP J. Image Video Process. 1–10, 2008 (2008)
Bertinetto, L., Valmadre, J., Henriques, F., Vedaldi, A., Philip Torr, H.S.: Fully-convolutional siamese networks for object tracking. In: ECCV Workshops (2016)
Chu, P., Ling, H.: FamNET: joint learning of feature, affinity and multi-dimensional assignment for online multiple object tracking. In: ICCV, pp. 6171–6180 (2019)
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR (2009)
Dollár, P., Wojek, C., Schiele, B., Perona, P.: Pedestrian detection: a benchmark. In: CVPR (2009)
Ess, A., Leibe, B., Schindler, K., van Gool, L.: A mobile vision system for robust multi-person tracking. In: CVPR. IEEE Press (2008)
Fitts, J.M.: Precision correlation tracking via optimal weighting functions. In: 1979 18th IEEE Conference on Decision and Control including the Symposium on Adaptive Processes, vol. 2, pp. 280–283 (1979)
Fu, J., et al.: Dual attention network for scene segmentation. In: CVPR (2019)
Fukunaga, K., Hostetler, L.: The estimation of the gradient of a density function, with applications in pattern recognition. IEEE Trans. Inf. Theory 21(1), 32–40 (1975)
Gao, J., Zhang, T., Xu, C.: Graph convolutional tracking. In: CVPR, pp. 4644–4654 (2019)
Graves, A.: Generating sequences with recurrent neural networks. CoRR, abs/1308.0850 (2013)
Graves, A., Wayne, G., Danihelka, I.: Neural Turing machines. CoRR, abs/1410.5401 (2014)
Guo, H., Zheng, K., Fan, X., Yu, H., Wang, S.: Visual attention consistency under image transforms for multi-label image classification. In: CVPR (2019)
He, C., Hu, H.: Image captioning with visual-semantic double attention. ACM Trans. Multimedia Comput. Commun. Appl. 15(1), 1–16 (2019)
Hoffmann, G.M., Tomlin, C.J., Montemerlo, M., Thrun, S.: Autonomous automobile trajectory tracking for off-road driving: controller design, experimental validation and racing. In: American Control Conference, pp. 2296–2301 (2007)
Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: CVPR (2018)
Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: CVPR, pp. 2261–2269 (2017)
Huang, Y., Liao, I., Chen, C., İk, T., Peng, W.: TrackNet: a deep learning network for tracking high-speed and tiny objects in sports applications*. In 16th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–8 (2019)
Huang, Z., Liang, S., Liang, M., Yang, H.: DiaNet: dense-and-implicit attention network. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 04, pp. 4206–4214 (2020)
Jadhav, A., Mukherjee, P., Kaushik, V., Lall, B.: Aerial multi-object tracking by detection using deep association networks. In: 2020 National Conference on Communications (NCC), pp. 1–6 (2020)
Jetley, A., Lord, N.A., Lee, N., Torr, P.: Learn to pay attention. In: International Conference on Learning Representations (2018)
Jin, S., Liu, W., Ouyang, W., Qian, C.: Multi-person articulated tracking with spatial and temporal embeddings. In: CVPR (2019)
Kang, K., et al.: Object detection in videos with tubelet proposal networks. In: CVPR (2017)
Kendall, A., Gal, Y., Cipolla, R.: Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: CVPR (2018)
Kuhn, H.W., Yaw, B.: The Hungarian method for the assignment problem. Naval Res. Logist. Quart. 2, 83–97 (1955)
Li, K., Wu, Z., Peng, K.-C., Ernst, J., Fu, Y.: Tell me where to look: guided attention inference network. In: CVPR, pp. 9215–9223 (2018)
Liu, C., Mao, J., Sha, F., Alan Yuille, L.: Attention correctness in neural image captioning. In: AAAI (2017)
Liu, W., et al.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision – ECCV2020, vol. 9905, pp. 21–37. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46448-0_2
Milan, a., Leal-Taixé, L., Reid, I., Roth, S., Schindler, K.: Mot16: a benchmark for multi-object tracking. arXiv preprint arXiv:1603.00831 (2016)
Ning, G., et al.: Spatially supervised recurrent convolutional neural networks for visual object tracking. In: IEEE International Symposium on Circuits and Systems (ISCAS), pp. 1–4 (2017)
Oktay, O., et al.: Attention U-NET: learning where to look for the pancreas. ArXiv, abs/1804.03999 (2018)
Peng, J., et al.: Chained-tracker: chaining paired attentive regression results for end-to-end joint multiple-object detection and tracking. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.M. (eds.) Computer Vision. Lecture Notes in Computer Science, vol. 12349. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58548-8_9
Redmon, J., Divvala, S.K., Girshick, R.B., Farhadi, A.: You only look once: unified, real-time object detection. CoRR, abs/1506.02640 (2015)
Ren, s., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: International Conference on Neural Information Processing Systems, NIPS 2015, pp. 91–99 (2015)
Ristani, E., Solera, F., Zou, R.S., Cucchiara, R., Tomasi, C.: Performance measures and a data set for multi-target, multi-camera tracking. In: ECCV Workshops, vol. 2, pp. 17–35 (2016)
Schlemper, J., et al.: Learning to leverage salient regions in medical images, Attention gated networks (2019)
Song, Y., et al.: Vital: visual tracking via adversarial learning. In: CVPR, pp. 8990–8999 (2018)
Sun, K., et al.: Bottom-up human pose estimation by ranking heatmap-guided adaptive keypoint estimates (2020)
Voigtlaender, P., et al.: Mots: multi-object tracking and segmentation. In: CVPR, pp. 7934–7943 (2019)
Wang, F., et al.: Residual attention network for image classification. In: CVPR, July 2017
Wang, J., et al.: Deep high-resolution representation learning for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2020)
Wang, Z., Zheng, L., Liu, Y., Wang, S.: Towards real-time multi-object tracking. In: ECCV (2020)
Welch, G., Bishop, G.: An introduction to the Kalman filter (1995)
Wojke, N., Bewley, A., Paulus, D.: Simple online and realtime tracking with a deep association metric. In: 2017 IEEE International Conference on Image Processing (ICIP), pp. 3645–3649 (2017)
Woo, S., Park, J., Lee, J.-Y., Kweon, I.S.: Convolutional block attention module. In: ECCV, Cham (2018)
Xiao, T., Li, S., Wang, B., Lin, L., Wang, X.: Joint detection and identification feature learning for person search. In: CVPR (2017)
Yu, F., Li, W., Li, Q., Liu, Y., Shi, X., Yan, J.: Poi: multiple object tracking with high performance detection and appearance feature. In: ECCV Workshops (2016)
Yu, F., Li, W., Li, Q., Liu, Y., Shi, X., Yan, J.: Poi: multiple object tracking with high performance detection and appearance feature. In: ECCV Workshops (2016)
Zhang, S., Benenson, R., Schiele, B.: A diverse dataset for pedestrian detection. In: CVPR, Citypersons (2017)
Zhang, Y., Wang, C., Wang, X., Zeng, W., Liu, W.: A simple baseline for multi-object tracking. arXiv preprint arXiv:2004.01888 (2020)
Zheng, L., Zhang, H., Sun, S., Chandraker, M., Yang, Y., Tian, Q.: Person re-identification in the wild. In: CVPR, pp. 1367–1376 (2017)
Zhou, X., Koltun, V., Krähenbühl, P.: Tracking objects as points. In: ECCV (2020)
Acknowledgments
This work was supported by the Milestone Research Programme at Aalborg University (MRPA).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Madan, N., Nasrollahi, K., Moeslund, T.B. (2022). Attention-Enabled Object Detection to Improve One-Stage Tracker. In: Arai, K. (eds) Intelligent Systems and Applications. IntelliSys 2021. Lecture Notes in Networks and Systems, vol 295. Springer, Cham. https://doi.org/10.1007/978-3-030-82196-8_55
Download citation
DOI: https://doi.org/10.1007/978-3-030-82196-8_55
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-82195-1
Online ISBN: 978-3-030-82196-8
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)