Using Stacked Auto-encoder to Get Feature with Continuity and Distinguishability in Multi-object Tracking

Feng, Haoyang; Li, Xiaofeng; Liu, Peixin; Zhou, Ning

doi:10.1007/978-3-319-71607-7_31

Haoyang Feng¹⁶,
Xiaofeng Li¹⁶,
Peixin Liu¹⁶ &
…
Ning Zhou¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10666))

Included in the following conference series:

International Conference on Image and Graphics

2431 Accesses
3 Citations

Abstract

Good feature expression of targets plays an important role in multi-object tracking (MOT). Inspired by the self-learning concept of deep learning methods, an online feature extraction scheme is proposed in this paper, based on a conditional random field (CRF). The CRF model is transformed into a certain number of multi-scale stacked auto-encoders with a new loss function. Features obtained with our method contain both continuous and distinguishable characteristics of targets. The inheritance relationship of stacked auto-encoders between adjacent frames is implemented by an online process. Features extracted from our online scheme are applied to improve the network flow tracking model. Experiment results show that the features by our method achieve better performance compared with other handcrafted-features. The overall tracking performance are improved when our features are used in the MOT tasks.

You have full access to this open access chapter, Download conference paper PDF

Effective fusion of deep multitasking representations for robust visual tracking

Article 19 October 2021

Learning Spatial Fusion and Matching for Visual Object Tracking

Multi-object tracking using context-sensitive enhancement via feature fusion

Article 27 July 2023

Keywords

1 Introduction

Robust detection and tracking of video sequence is one of the key task of computer vision. Multiple object tracking (MOT) is one of the hot spot. It links multiple independent targets of interest in each frame with their precursors or successors through feature association methods into complete trajectories.

Recent years, with the great progress in object detection [5, 9], tracking-by-detection has become one of the mainstream in MOT tasks. The detection output in each frame is given as a prior knowledge at the first step. Then, some data association and global optimization methods are applied to connect the discrete detections into trajectories. One of the state-of-the-art global data association method is based on a network flow paradigm. Zhang et al. were the first one who formulated the MOT problem as finding a maximum posterior probability and transformed it into a min-cost network flow problem [13]. Pirsiavash et al. developed it and used a dynamic programming algorithm to improve efficiency [8].

However, as a sequence of images, higher-order information among adjacent frames was ignored. Traditional tracking methods are unable to achieve more robust results in complex scenes since they only extract static and superficial features. Considering about this, Butt and Collins added higher-order tracking smoothness constraints, such as constant velocity [3]. On the other hand, usual handcrafted-features are not efficient enough. New features extracted by a deep neural network are successfully applied in single target tracking [11, 14]. Different from single object tracking, in MOT tasks, different targets of same category are of interesting, such as pedestrian individuals in a surveillance application. It is not just to connect the same target, but also to distinguish others. However, traditional features and classification methods are not efficient for this purpose since they are usually based on characteristics of categories.

In this paper, we originally propose a new online feature extraction scheme which is appropriate for tracking-by-detection framework of MOT. The new scheme is based on a conditional random field (CRF) model whose potential functions are implemented by a certain number of multi-scale stacked auto-encoders. The loss function of the auto-encoders is formulated with constraints of the continuity and the distinguishability of targets. The continuity also implies the inheritance relationship of a stacked auto-encoders between frames and it relates to the high-order information of a trajectory. We apply our new features into a classical network flow method [8] on MOT and reach a better result which demonstrates the benefit of our feature extraction scheme.

The rest of the paper is as follows. Section 2 describes the CRF model of the online feature extraction scheme. Multi-scale stacked auto-encoder, the new loss function and the online approach are discussed in Sect. 3. Section 4 explains how we utilize the new feature into a network flow method and why it works. Section 5 presents experiments on the performance of obtained features and MOT tasks. Finally, conclusions are drawn in Sect. 6.

2 Conditional Random Field Model

Our online feature extraction scheme is based on a CRF model. Target detections at different frames, as conditions, are abstracted as an observation set $X=\{x_{1}^{1},x_{1}^{2},\cdots ,x_{k}^{j-1}, x_{k}^{j},x_{k}^{j+1},\cdots ,x_{m}^{n}\}$, where $x_{k}^{j}$ donates the jth detection at frame k. Correspondingly, Let $Y=\{y_{1}^{1},y_{1}^{2},\cdots ,y_{k}^{j-1},y_{k}^{j}, y_{k}^{j+1},\cdots ,y_{m}^{n}\}$ be the state set, where $y_{k}^{j}$ is the feature expression of $x_{k}^{j}$. The feature extraction problem for MOT can be formulated as a maximum conditional probability problem in CRF as follows,

$$\begin{aligned} \mathop {\mathrm{argmax}}\limits _{Y,\theta }\,{P(Y|X,\theta )} \end{aligned}$$

(1)

where $\theta $ represents all of the undetermined parameters in the model. The conditional probability is defined as a normalized product of all the potential functions based on their corresponding maximal cliques $\{y_{k-1}^{j},y_{k}^{j}\} $ for $k\in \{2,3,\cdots ,m\}$, or $\{y_{k}^{j},y_{k}^{l}\} $ for $j\ne l$, and it is given by,

$$\begin{aligned} p(Y|X,\theta )=\dfrac{1}{Z}\prod \limits _{j}\prod \limits _{k} (\psi _{1}(y_{k-1}^{j},y_{k}^{j})\prod \limits _{l\ne j}\psi _{2}(y_{k}^{j},y_{k}^{l})) \end{aligned}$$

(2)

where $\psi _{1}(y_{k-1}^{j},y_{k}^{j})$ and $\psi _{2}(y_{k}^{j},y_{k}^{l})$ are the potential functions of maximal cliques respectively, and Z is a normalization factor.

Figure 1 shows two different targets and their features at frame $t,t+1,\cdots ,t+r$ as a sketch. Two types of maximal cliques are displayed as red and blue edges respectively. The potential functions can be expressed in an exponential form,

$$\begin{aligned} \begin{aligned} \psi _{1}(y_{k-1}^{j},y_{k}^{j})=\exp (I(y_{k}^{j},x_{k}^{j}) +\lambda S(y_{k-1}^{j},y_{k}^{j},x_{k-1}^{j},x_{k}^{j})) \end{aligned} \end{aligned}$$

(3)

and,

$$\begin{aligned} \begin{aligned} \psi _{2}(y_{k}^{j},y_{k}^{l})=\exp (I(y_{k}^{j},x_{k}^{j}) + \mu D(y_{k}^{j},y_{k}^{l},x_{k}^{j},x_{k}^{l})) \end{aligned} \end{aligned}$$

(4)

where $I(y_{k}^{j},x_{k}^{j})$ represents the identity relationship between observations and features, $S(y_{k-1}^{j},y_{k}^{j},x_{k-1}^{j},x_{k}^{j})$ and $D(y_{k}^{j},y_{k}^{l},x_{k}^{j},x_{k}^{l})$ describe the sequential characteristic and distinguishable property of features respectively.

Realized that any nonlinear function can be fitted with a deep neural network, we utilize a multi-scale stacked auto-encoder $f_{k}^{j}$ to implement the product of the potential functions in (2) and to obtain the optimal extraction of the features.

$$\begin{aligned} \begin{aligned} \psi (y_{k-1}^{j},y_{k}^{j})\prod \limits _{l\ne j}\psi (y_{k}^{j},y_{k}^{l}) = f_{k}^{j} \end{aligned} \end{aligned}$$

(5)

3 Approach

We transform the maximum conditional probability problem on (1) to an online training process of a certain number of stacked auto-encoders by modeling the product of potential functions by (5). For each detection, an auto-encoder is trained according to a new loss function we proposed, which considers three of the relations in (3) and (4). An online training process is used which is initialized in the first several frames and runs iteratively in following frames. The structure of auto-encoder, our new loss function and the online training are presented below.

3.1 Auto-encoder Structure

Compared with many handcrafted-ones, features learned by an unsupervised model, especially a deep learning model, tells more about the fundamental or inner property of an object [6, 12]. Recent studies [10] have shown that stacked auto-encoder, as one of the widely used unsupervised deep learning model, is capable to grasp the most generic and intrinsic feature of the interest. In light of [14], the internal feature of each detection is extracted by a two-layer stacked auto-encoder similar in [11, 14].

As shown in Fig. 2, the two-layer stacked auto-encoder is modeled by one input layer (red) and two feature expressing layers (blue and yellow). At the input layer, different patches from each area given by its detection bounding box are reshaped into vectors with three RGB channels. The vector passes the first layer weight matrix and a nonlinear function to generate a feature responses of local characteristics. Then, the local responses are cascaded together and are sent into the second layer to learn the global information.

In [11], PCA whitening and pooling method is applied in the deep network to decrease the dimensionality of feature vectors. We adjust the number of neurons in each feature expressing layer to get computational efficiency and maintain a similar performance.

3.2 New Loss Function

We formulate a new loss function of the multi-scale stacked auto-encoder by adding two regulation terms as follows,

$$\begin{aligned} \begin{aligned} L=\mathop {\mathrm{argmin}}\limits _{W}&\left| \left| x_{i}-W^{T}Wx_{i} \right| \right| _{2}\\&+ \alpha \ \left| \left| h(Wx_{i}^{old})-h(Wx_{i}) \right| \right| _{2}\\&+\beta \sum _{j\ne i}^{N-1} \left| \left| h(Wx_{i})-h(Wx_{j}) \right| \right| _{2} \end{aligned} \end{aligned}$$

(6)

where $x_{i}$ donates the input vector of encoder, W the weight matrix, and h a sigmoid transportation. Both the first and the second layer auto-encoders are trained by this minimum function. The three terms are explained below.

Identity. The first term in Eq. (6) is the standard loss term of an auto-encoder, learning the intrinsic feature of a detection. It represents the identity relationship in Eqs. (3) and (4).

Continuity. In the second term in Eq. (6), $x_{i}^{old}$ and $x_{i}$ are inputs in the previous and current frame respectively. $\alpha $ is a weight parameter. This term guarantees the high similarity between features of the same target in two adjacent frames. In this way, the continuity of target in (3) is introduced in the output features.

The inheritance strategy of stacked auto-encoder also contribute to the continuity characteristic. In this strategy, The stacked auto-encoder on current detection inherits previously results from the nearest detection at the frame before, which has a great chance to be the same target compared with current detection. However, this constraint is not necessarily one hundred percent correct. A appropriate weight parameter $\alpha $ in (6) controls the feature similarity error from mistaken inheritance within an acceptable limit.

Distinguishability. In the third term, $x_{j}$ indicates the other targets in the same frame and N is the number of detections. In order to differentiate other targets, we set $\beta $ a negative number and treat it as a penalty term. It forces the extracted features between different targets to keep away from each other as much as possible. It represents the distinguishable property in Eq. (4).

3.3 Online Learning

In our online learning system, multi-scale stacked auto-encoders are trained by solving (6) with the Stochastic Gradient Descent (SGD) algorithm [2]. The first several frames are used to train the multi-scale stacked auto-encoders with randomly initialized weight matrix and they will reach a relatively stable state. At that time, the original input will pass the auto-encoder and obtain the final feature expression at current frame. Then through inheritance relation expounded above, the weight matrix of each auto-encoder is transmitted to the next frame and fine tuned by new comers. Thus, the feature expressions will be computed frame by frame.

4 New Feature in Network Flow

In this section, we are going to describe details on utilizing the new features extracted by our online system and explaination on improving MOT performance with solving network flow problem.

The min-cost network flow paradigm to solve MOT problem in [8] was formulated as,

$$\begin{aligned} \begin{aligned} \mathcal {F}^{*} =\mathop {\mathrm{argmin}}\limits _{\mathcal {F}}&\sum _{i}C_{in}(i)f_{in}(i)+\sum _{i}C_{out}(i)f_{out}(i) \\&+ \sum _{i}C_{det}(i)f_{det}(i)+\sum _{i,j}C_{t}(i,j)f_{t}(i,j) \end{aligned} \end{aligned}$$

(7)

where the total flow set $\mathcal {F}$ is a solution of all the indicator variables $f_{in}(i)$, $f_{out}(i)$, $f_{det}(i)$, $f_{t}(i,j)$ in the network. $f_{in}(i)$, $f_{out}(i)$ and $f_{det}(i)$ indicate whether detection i is a start or an end or belongs to a trajectory respectively. For each trajectory, $f_{t}(i,j)$ determines whether detection j is right behind detection i. Costs $C_{in}(i)$ and $C_{out}(i)$ define the probability of a target belongs to the start or end of one trajectory. The detection cost $C_{det}(i)$ is linked to the score that detection i was given by the detector. It represents a confidence of whether the target is a real one. $C_{t}(i,j)$, as the connection cost, describes the probability of two detections i and j belong to the same trajectory.

In [8], all the costs $C_{in}(i)$, $C_{out}(i)$, $C_{det}(i)$ and $C_{t}(i,j)$ are defined with off-line trained certain values. With the specificity of each video sequences, constant costs will end up with solutions which are not very robust. They can not adjust diverse kinds of MOT scenes. In addition, [8] only use some arbitrary or simply considered constraints to shrink the solution space down. For example, some edges from detection i to j in the network are deleted when detection j is not at the next frame or has no space overlap with detection i. These measures indeed speeds up the algorithm to find optimal flow effectively, but may ignore better result for the reason of over simple and tough constraints.

In our method, in order to make the model more accurately and self-adaptive to video sequences in difference of kinds, the detection similarity computed from new features is mapped into a variable cost which replaces $C_{t}(i,j)$ as a certain value originally. With high confidential detection similarity calculated from features with continuity and distinguishability, it will take less costs to associate same target and distinguish others. Furthermore, to find the optimal solution with higher reliability, we set a similarity threshold to cut down the searching space. Once the similarity computed from two new detections features is less than the threshold, the edge between two detections will be deleted. Location change, size change and overlap ratio are also considered into the determination. Each fact is assigned a particular weight to adjust the impact on the final decision. Besides, we do not insist that the two detections with an edge must be in adjacent frames, and relax the constraint to an appropriate frame gap. Thus, some lost then reappeared detections due to occlusion get their chances to connect with former trajectories.

5 Experiment

In this section, we first evaluate the feature quality from our online feature extraction scheme based on multi-scale stacked auto-encoder in TUD-Crossing video sequence [1], which is widely used to check the tracking performance in MOT tasks. The continuity and distinguishability of the extracted feature are revealed by comparing with some other handcrafted-features utilized in MOT. The influence of weight parameters in loss function (6) is also explained and analyzed. Finally, we apply our feature extraction into a typical network flow method which is called DP_NMS [8] as discussed in Sect. 4, and test the overall tracking performance on a series of video sequences on MOT2015 benchmark [7].

5.1 Continuity and Distinguishability

Figure 3 shows the detections in several frames in the TUD-Crossing. Our feature extraction scheme is applied in the video of 200 frames and similarity performance of features is evaluated. The similarity is measured with Bhattacharyya distance which is commonly used in video tracking [4].

Table 1 presents the similarity of features for the rightmost target between adjacent frames in frame 28, 29 and 30, as shown in Fig. 3. The last column computes the average similarity of this target in the whole video sequence. Clearly, the feature from our online learning method provides the highest similarity and keeps best consistent in consecutive frames. High similarity for the same targets means good continuity.

Table 1. Continuity of features

Full size table

Table 2 presents the similarity of features between the rightmost target and three other targets in frame 28, 29 and 30, respectively. The last column shows the average diversity which is the difference between similarity of identical targets and different targets. Our method achieves the highest diversity compared with two other handcrafted-features, so it provides the best distinguishability.

Table 2. Distinguishability of features

Full size table

Table 3 illustrates the influence of the weight parameters in loss function (6). The first row lists a suitable pair, where $\alpha =0.15$ and $\beta =-0.5$. When $\beta $ increases to $-0.1$, as the second row in the table, the diversity between different targets will decrease to a low level. Though increasing in $\alpha $ seems to raise it back, as shown in the third row, it can not adapt the situation in other frames, such as in Table 4.

Table 3. Influence of weight parameter

Full size table

Also shown in Table 4, at frame 32, there is an wrong inheritance of auto-encoder for the rightmost, which is unavoidable since the detections cannot be absolutely correct. When the weight factor $\alpha $ holds a leading role ($pair_{3}$ for example), the identity similarity will be on a high level and the two targets are not easy to be distinguished. It is good practice to keep $\alpha $ below a proper limit ($pair_{1}$ for example) to separate them apart and maintain the high continuity simultaneously.

Table 4. Wrong inheritance and weight parameters

Full size table

5.2 Tracking Perfomance

Table 5. Tracking performance

Full size table

We apply our feature extraction scheme in the DP_NMS model in [8], as expound in Sect. 4. Evaluations are carried out on several widely used MOT video sequences on MOT2015 benchmark for the original [8] and the improved method. The results are presented in Table 5. In the evaluation, our own detections and ground truths are used for PETS and TUD-Crossing, while public data from MOT2015 benchmark [7] are used for the other two videos. From the results, it is clear that our scheme promotes the performance of the MOT system effectively on most of the evaluating indicators on the classical network flow solution. And there is no doubt that the new detection features extracted from our scheme can also help to promote most of the MOT method based on tracking-by-detection paradigm.

6 Conclusion

An online feature extraction system built on a CRF model is proposed in this paper, which requires both the continuity and distinguishability. The CRF model is transformed into a certain number of multi-scale stacked auto-encoders with new loss function to get feature expression of targets. New features extracted from our online scheme are applied to improve the network flow tracking model. Experiments have shown better performance compared with other handcrafted-features and obtain a good performance promotion on MOT tasks with our feature extraction.

References

Andriluka, M., Roth, S., Schiele, B.: People-tracking-by-detection and people-detection-by-tracking. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008)
Google Scholar
Bottou, L.: Large-scale machine learning with stochastic gradient descent. In: Lechevallier, Y., Saporta, G. (eds.) Proceedings of COMPSTAT 2010, pp. 177–186. Physica-Verlag, Heidelberg (2010). https://doi.org/10.1007/978-3-7908-2604-3_16
Chapter Google Scholar
Butt, A.A., Collins, R.T.: Multi-target tracking by lagrangian relaxation to min-cost network flow. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1846–1853 (2013)
Google Scholar
Comaniciu, D., Ramesh, V., Meer, P.: Real-time tracking of non-rigid objects using mean shift. In: IEEE Conference on Computer Vision and Pattern Recognition, p. 2142 (2000)
Google Scholar
Forsyth, D.: Object detection with discriminatively trained part-based models. IEEE Trans. Pattern Anal. Mach. Intell. 32(9), 1627–1645 (2010)
Article Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates, Inc. (2012). http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
Lealtaix, L., Milan, A., Reid, I., Roth, S., Schindler, K.: Motchallenge 2015: towards a benchmark for multi-target tracking (2015)
Google Scholar
Pirsiavash, H., Ramanan, D., Fowlkes, C.C.: Globally-optimal greedy algorithms for tracking a variable number of objects. In: Computer Vision and Pattern Recognition, pp. 1201–1208 (2011)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39(6), 1137–1149 (2017)
Article Google Scholar
Vincent, P., Larochelle, H., Bengio, Y., Manzagol, P.A.: Extracting and composing robust features with denoising autoencoders. In: International Conference, pp. 1096–1103 (2008)
Google Scholar
Wang, L., Liu, T., Wang, G., Chan, K.L., Yang, Q.: Video tracking using learned hierarchical features. IEEE Trans. Image Process. 24(4), 1424–1435 (2015)
Article MathSciNet Google Scholar
Zeiler, M.D., Fergus, R.: Visualizing and understanding convolutional networks. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8689, pp. 818–833. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10590-1_53
Google Scholar
Zhang, L., Li, Y., Nevatia, R.: Global data association for multi-object tracking using network flows. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8 (2008)
Google Scholar
Zou, W.Y., Ng, A.Y., Zhu, S., et al.: Deep learning of invariant features via simulated fixations in video. In: Advances in Neural Information Processing Systems, vol. 25, pp. 3212–3220 (2012)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Communication and Information Engineering, University of Electronic Science and Technology of China, Chengdu, China
Haoyang Feng, Xiaofeng Li, Peixin Liu & Ning Zhou

Authors

Haoyang Feng
View author publications
You can also search for this author in PubMed Google Scholar
Xiaofeng Li
View author publications
You can also search for this author in PubMed Google Scholar
Peixin Liu
View author publications
You can also search for this author in PubMed Google Scholar
Ning Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Haoyang Feng .

Editor information

Editors and Affiliations

Beijing Jiaotong University, Beijing, China
Yao Zhao
Dalian University of Technology, Dalian, China
Xiangwei Kong
UNSW, Sydney, New South Wales, Australia
David Taubman

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Feng, H., Li, X., Liu, P., Zhou, N. (2017). Using Stacked Auto-encoder to Get Feature with Continuity and Distinguishability in Multi-object Tracking. In: Zhao, Y., Kong, X., Taubman, D. (eds) Image and Graphics. ICIG 2017. Lecture Notes in Computer Science(), vol 10666. Springer, Cham. https://doi.org/10.1007/978-3-319-71607-7_31

Download citation

DOI: https://doi.org/10.1007/978-3-319-71607-7_31
Published: 30 December 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-71606-0
Online ISBN: 978-3-319-71607-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Using Stacked Auto-encoder to Get Feature with Continuity and Distinguishability in Multi-object Tracking

Abstract

Similar content being viewed by others

Effective fusion of deep multitasking representations for robust visual tracking

Learning Spatial Fusion and Matching for Visual Object Tracking

Multi-object tracking using context-sensitive enhancement via feature fusion

Keywords

1 Introduction

2 Conditional Random Field Model