Elsevier

Neurocomputing

Volume 447, 4 August 2021, Pages 80-91
Neurocomputing

Multi-object tracking with hard-soft attention network and group-based cost minimization

https://doi.org/10.1016/j.neucom.2021.02.084Get rights and content

Abstract

Multi-object tracking (MOT) has received constant attention from researchers with the development of deep learning and person re-identification (ReID). However, the occlusion caused tracking failure is still far from solved. In this paper, we propose a Hard-Soft Attention Network (HSAN) to improve the ReID performance and get robust appearance features of different targets. The pose information and attention mechanism are combined to distinguish between challenging targets. Besides, the unary and binary costs are constructed to ensure consistency and long-term tracking, which consider not only the appearance-motion affinity of single tracks, but also the interactions between neighboring tracks. For that we cluster the tracks into different groups and choose reliable tracks as anchors to establish the two types of costs. Our HSAN appearance model is evaluated on the Market-1501, DUKE and CUHK03 ReID datasets and the MOT tracking method is conducted on MOTChallenge 15, 16 and 17. The experimental results demonstrate that our method can improve tracking accuracy and reduce fragments.

Introduction

Multi-object tracking (MOT) is a challenging computer vision task that aims to obtain the motion tracks of different targets, as shown in Fig. 1a. Nowadays, with the development of deep neural networks, the use of deep features to obtain the appearance and motion similarities increases the discrimination ability of different candidates. Pedestrians are a class of important object, and the tracking of pedestrians has many applications such as intelligent visual surveillance and autonomous driving. Although a large number of researchers have conducted in-depth research, the performance of pedestrian tracking still needs to be improved, due to complex pedestrian trajectories and frequent occlusions among different persons.

Compared with other vision tasks, MOT has the characteristics of covering a wide range of topics and possessing more indicators to measure the tracking performance. There are several reasons that may cause tracking failure. First, detection error is the main reason, and several common situations that cause tracking failure are as follows: the corresponding pedestrian bounding boxes are not detected, especially when the target is too far or too close to the field of view, the camera shakes, or the illumination in the scene is insufficient; multiple bounding boxes are generated for a single pedestrian, which brings a lot of noises to the tracking procedure; occlusion occurs among different targets, so that the blocked pedestrian cannot be detected. Second, the existing tracking algorithm is not robust enough to deal with the complexity of videos. For example, when pedestrians occlude each other due to long-term overlap of the tracks, the appearance and motion similarities of occluded pairs are high, and the IDs of the blocked tracks may switch. Besides, when a target disappears and then enters the field of view again, the algorithm may recognize it as a new one. All the above make the generated tracks unstable and even cause fragmentation.

In order to focus on the tracking algorithm rather than detection performance to improve the tracking effect, we use the public detections of the open-source MOTChallenge dataset 1 as the initial input to conduct our tracking experiment. When the detection results of each frame are given, mutual occlusion among targets is an important cause of tracking interruption and failure. Therefore, this paper tries to address the problem of missing detections and correlation errors in the case of tracking with occlusion.

In this paper, we focus on the research of online multi-target tracking which is more widely usable compared with offline tracking. This means that we can only get the detection results of the historical moment and the current moment, rather than the whole offline video. Therefore, when occlusion occurs, we cannot utilize the location information of the future moment to deduce the locations of the occluded objects.

The tracking-by-detection method [1], [2], [3] has already attracted much attention of researchers. The object features can be roughly divided into two types: appearance feature and motion feature. The appearance feature mainly focuses on distinguishing different targets in the same frame and reconnecting the targets when they return to the field of view after disappearing for a while. Worth mentioning, person re-identification (ReID) is often introduced in MOT feature extraction to distinguish the appearances of pedestrians. Generally speaking, ReID aims at finding the same person from a large number of people under multiple cameras, as shown in Fig. 1b. Compared with MOT dataset, ReID data possess competitive candidates such as similar targets, occluded objects, pedestrians of various sizes and multi-view candidates. To solve these challenges, some researchers use semantic information such as human pose, but the redundant background information and noises still exist. Besides, many researchers apply attention mechanism to reduce the negative impacts, but are unable to utilize the pixel-level information and find the vital regions of human body parts. In order to better extract the appearance features and improve the discrimination of the tracked target, we use the pedestrian’s posture information as the mask, and use the attention mechanism to screen more discriminative pixels. Thus, we propose a hard-soft attention based ReID model to improve the identification ability of different targets on both ReID and MOT datasets.

Besides, compared with detection and ReID tasks, the MOT trajectories are influenced not only by the targets’ own intentions but also by neighboring objects. Only considering the single object movement is not sufficient in predicting the locations of crowded targets. Therefore, we use the existing motion and appearance features, as well as spatio-temporal sequence information to construct the energy terms for the grouped targets. The final tracking result is obtained by minimizing the overall energy function.

In this work, we aim to extract more accurate and robust appearance-motion features to improve the long-term stability of multi-object tracking. The contributions made in this paper are threefold.

  • First, we propose an appearance feature extraction method to efficiently distinguish different people under challenging videos, including pose-guided hard attention (PHA) module and regional soft attention (RSA) module. The keypoints of pedestrians are generated through pose estimation to enhance the foreground information and calibrate poor detections in PHA module, while RSA is utilized in both global and local branches to weaken the background information.

  • Second, we propose a method to improve movement prediction and correlation, including a grouping step, a prediction step, and an optimization step. Within them, we use DBSCAN (a density-based spatial clustering algorithm) to group pedestrians, and propose a confidence-based method to recognize reliable tracks. Then the prediction of unreliable tracks is refined as well. Based on a binned distance, we construct the unary and binary energy terms for the correlation problem.

  • Third, extensive experiments are conducted on three ReID datasets (Market-1501, DUKE, and CUHK03) as well as MOT datasets (MOT15, MOT16, and MOT17) to validate our proposed method. According to the experimental results, our method reduces the fragments and mostly lost tracks, indicating the effectiveness of hard-soft attention network and group-based cost minimization for reliable multi-object tracking.

The rest of this paper is organized as follows. Section 2 discusses the related works on ReID and group-based tracking. The proposed hard-soft attention appearance and unary-binary cost tracking correlation method are introduced in Section 3. Section 4 presents the experimental results on ReID and MOTChallenge datasets. Finally, the conclusion is drawn in Section 5.

Section snippets

ReID appearance features

To obtain a more recognizable appearance model and try to accurately distinguish different targets, researchers often introduce ReID models into the tracking field. For the ReID model, there are mainly two mainstream methods: representation learning and discriminative distance metric learning. For representation learning, candidate characteristics under different cameras are obtained. The discriminative distance metric learning aims to maximize the matching accuracy by learning distance metric.

Proposed method

Our overall algorithm is shown in Fig. 2. Compared with the normal tracking-by-detection methods, we break it down into two major steps. first, we obtain roughly accurate tracking results through preliminary associations as the researchers always conduct. Besides, due to the occlusion we further adjust the results based on our unary-binary energy term to improve the rationality and stability of tracking. We expect to consider the single trajectory as well as the influence of neighbor nodes on

Experimental results

In this section, we present detailed information of parameter settings on ReID and MOT datasets. Besides, the comparison of results with other competitive methods and ablation experiments are discussed in the following content.

Conclusion

In this paper, the HSAN model is proposed to achieve competitive classification accuracy for different targets’ appearances by combining the hard and soft attention. The pose-guided hard attention module enhances the foreground information while the regional soft attention module reduces the background noise. Besides, we obtain the preliminary tracking result by using the Hungarian algorithm based on appearance-motion similarity, and group tracks and detections according to the matching of

CRediT authorship contribution statement

Yating Liu: Conceptualization, Methodology, Software, Validation, Investigation, Writing - original draft, Writing - review & editing. Xuesong Li: Methodology, Software, Investigation. Tianxiang Bai: Methodology, Investigation, Writing - original draft. Kunfeng Wang: Conceptualization, Investigation, Resources, Writing - review & editing, Project administration, Funding acquisition. Fei-Yue Wang: Supervision, Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported in part by the Intel Collaborative Research Institute for Intelligent and Automated Connected Vehicles (ICRI-IACV), and in part by the National Natural Science Foundation of China (62076020, U1811463).

Yating Liu received her B.Eng. degree from the Civil Aviation University of China in 2014. She is currently a Ph.D. student at the State Key Laboratory for Management and Control of Complex Systems, Institute of Automation, Chinese Academy of Sciences as well as University of Chinese Academy of Sciences. Her research interests include visual object tracking, machine learning, and intelligent transportation systems.

References (56)

  • H. Wu et al.

    Instance-aware representation learning and association for online multi-person tracking

    Pattern Recognition

    (2019)
  • J. Peng et al.

    Tpm: Multiple object tracking with tracklet-plane matching

    Pattern Recognition

    (2020)
  • Z. Kalal et al.

    Tracking-learning-detection

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2011)
  • P. Voigtlaender et al.

    Mots: Multi-object tracking and segmentation

  • P. Bergmann et al.

    Tracking without bells and whistles

  • Y. Sun et al.

    Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline)

  • G. Wang et al.

    Learning discriminative features with multiple granularities for person re-identification

  • J. Xu et al.

    Attention-aware compositional network for person re-identification

  • L. Wei et al.

    GLAD: Global-local-alignment descriptor for pedestrian retrieval

  • Z. Zheng et al.

    Pedestrian alignment network for large-scale person re-identification

    IEEE Transactions on Circuits and Systems for Video Technology

    (2018)
  • X. Zhang, H. Luo, X. Fan, W. Xiang, Y. Sun, Q. Xiao, W. Jiang, C. Zhang, J. Sun, AlignedReID: Surpassing human-level...
  • G. Wang et al.

    Exploit the connectivity: Multi-object tracking with TrackletNet

  • N. Jiang et al.

    Online inter-camera trajectory association exploiting person re-identification and camera topology

  • L. Feng et al.

    Tracking people by evolving social groups: An approach with social network perspective

  • Q. Wang et al.

    A probabilistic framework for tracking the formation and evolution of multi-vehicle groups in public traffic in the presence of observation uncertainties

    IEEE Transactions on Intelligent Transportation Systems

    (2018)
  • Y.-M. Song et al.

    Online multi-object tracking with GMPHD filter and occlusion group management

    IEEE Access

    (2019)
  • Y. Yuan, Y. Lu, Q. Wang, Tracking as a whole: Multi-target tracking by modeling group behavior with sequential...
  • L. Zhang et al.

    Structure preserving object tracking

  • X. Yan, A. Cheriyadat, S.K. Shah, Hierarchical group structures in multi-person tracking, in: 2014 22nd International...
  • V. Chari et al.

    On pairwise costs for network flow multi-object tracking

  • L. Lan et al.

    Interacting tracklets for multi-object tracking

    IEEE Transactions on Image Processing

    (2018)
  • S. Schulter et al.

    Deep network flow for multi-object tracking

  • H. Zhou et al.

    Deep continuous conditional random fields with asymmetric inter-object constraints for online multi-object tracking

    IEEE Transactions on Circuits and Systems for Video Technology

    (2018)
  • A. Dehghan et al.

    Binary quadratic programing for online tracking of hundreds of people in extremely crowded scenes

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (2018)
  • B. Yang, R. Nevatia, Multi-target tracking by online learning a CRF model of appearance and motion patterns,...
  • M. Ester, H.-P. Kriegel, J. Sander, X. Xu, et al., A density-based algorithm for discovering clusters in large spatial...
  • X. Li, Y. Liu, K. Wang, Y. Yan, F.-Y. Wang, A hybrid of hard and soft attention for person re-identification, in: 2019...
  • K. He et al.

    Deep residual learning for image recognition

  • Cited by (14)

    • Object-aware bounding box regression for online multi-object tracking

      2023, Neurocomputing
      Citation Excerpt :

      Multi-object tracking (MOT) is one of the classic computer vision tasks [1–5].

    • Camera-aware representation learning for person re-identification

      2023, Neurocomputing
      Citation Excerpt :

      The latter are known as metric learning based ReID methods, which design a more distinct training objective with various loss functions to extract the discriminant features. A brief introduction of these methods is as follows: Part based ReID methods [11–13,5,14–16,16] aggregate the global feature and several part-level features to obtain a stronger discrimination. In this schema, the global feature learns the representation of the whole human appearance, while the part-level features learn discriminative body regions.

    • IAMOT: Multi-object tracking with integrated heads and attention mechanism

      2022, Neurocomputing
      Citation Excerpt :

      Zhihong Sun et al. [27] combined the global and partial features to improve the measurement of similarities between 2 bounding boxes to overcome occlusions and noise. Liu Yating et al. [28] elaborated a Hard-Soft attention network to ameliorate ReID performance and acquire robust appearance target features. The above methods have indeed achieved higher accuracy, but they fail to meet the real time demand, since detection and tracking are separate tasks.

    • Online association by continuous-discrete appearance similarity measurement for multi-object tracking

      2022, Neurocomputing
      Citation Excerpt :

      Multi-object tracking (MOT) means tracking multiple objects simultaneously [1–5].

    View all citing articles on Scopus

    Yating Liu received her B.Eng. degree from the Civil Aviation University of China in 2014. She is currently a Ph.D. student at the State Key Laboratory for Management and Control of Complex Systems, Institute of Automation, Chinese Academy of Sciences as well as University of Chinese Academy of Sciences. Her research interests include visual object tracking, machine learning, and intelligent transportation systems.

    Xuesong Li received his Master degree in control theory and control engineering from the University of Chinese Academy of Sciences, Beijing, China, in 2020. He is now an engineer at the Key Laboratory of Information System Engineering, Nanjing 210007, Jiangsu, China. His research interests include visual object tracking, image processing, and machine learning.

    Tianxiang Bai received the B.S. degree from Zhejiang University, in 2013. He is currently pursuing the Ph.D. degree with the State Key Laboratory for Management and Control of Complex Systems, Institute of Automation, Chinese Academy of Sciences and the University of Chinese Academy of Sciences. His research interests include robotics, reinforcement learning, and unmanned aerial vehicles.

    Kunfeng Wang received his Ph.D. in control theory and control engineering from the Graduate University of Chinese Academy of Sciences, Beijing, China, in 2008. After that, he joined Institute of Automation, Chinese Academy of Sciences and became an Associate Professor at the State Key Laboratory for Management and Control of Complex Systems. From December 2015 to January 2017, he was a Visiting Scholar at the School of Interactive Computing, Georgia Institute of Technology, Atlanta, USA. In August 2019, he moved to Beijing University of Chemical Technology, as a Professor at the College of Information Science and Technology. His research interests include computer vision, machine learning, and intelligent unmanned systems.

    Fei-Yue Wang (S'87–M'89–SM'94–F'03) received his Ph.D. degree in computer and systems engineering from the Rensselaer Polytechnic Institute, Troy, NY, USA, in 1990. He joined The University of Arizona in 1990 and became a Professor and the Director of the Robotics and Automation Laboratory and the Program in Advanced Research for Complex Systems. In 1999, he founded the Intelligent Control and Systems Engineering Center at the Institute of Automation, Chinese Academy of Sciences (CAS), Beijing, China, under the support of the Outstanding Chinese Talents Program from the State Planning Council, and in 2002, was appointed as the Director of the Key Laboratory of Complex Systems and Intelligence Science, CAS. In 2011, he became the State Specially Appointed Expert and the Director of the State Key Laboratory for Management and Control of Complex Systems.

    His current research focuses on methods and applications for parallel intelligence, social computing, and knowledge automation. He is a fellow of INCOSE, IFAC, ASME, and AAAS. In 2007, he received the National Prize in Natural Sciences of China and became an Outstanding Scientist of ACM for his work in intelligent control and social computing. He received the IEEE ITS Outstanding Application and Research Awards in 2009 and 2011, respectively. In 2014, he received the IEEE SMC Society Norbert Wiener Award. Since 1997, he has been serving as the General or Program Chair of over 30 IEEE, INFORMS, IFAC, ACM, and ASME conferences. He was the President of the IEEE ITS Society from 2005 to 2007, the Chinese Association for Science and Technology, USA, in 2005, the American Zhu Kezhen Education Foundation from 2007 to 2008, the Vice President of the ACM China Council from 2010 to 2011, the Vice President and the Secretary General of the Chinese Association of Automation from 2008-2018. He was the Founding Editor-in-Chief (EiC) of the International Journal of Intelligent Control and Systems from 1995 to 2000, the IEEE ITS Magazine from 2006 to 2007, the IEEE/CAA JOURNAL OF AUTOMATICA SINICA from 2014-2017, and the China's Journal of Command and Control from 2015-2020. He was the EiC of the IEEE Intelligent Systems from 2009 to 2012, the IEEE TRANSACTIONS ON Intelligent Transportation Systems from 2009 to 2016, and is the EiC of the IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS since 2017, and the Founding EiC of China's Journal of Intelligent Science and Technology since 2019. Currently, he is the President of CAA's Supervision Council, IEEE Council on RFID, and Vice President of IEEE Systems, Man, and Cybernetics Society.

    View full text