Keywords

1 Introduction

Robust detection and tracking of video sequence is one of the key task of computer vision. Multiple object tracking (MOT) is one of the hot spot. It links multiple independent targets of interest in each frame with their precursors or successors through feature association methods into complete trajectories.

Recent years, with the great progress in object detection [5, 9], tracking-by-detection has become one of the mainstream in MOT tasks. The detection output in each frame is given as a prior knowledge at the first step. Then, some data association and global optimization methods are applied to connect the discrete detections into trajectories. One of the state-of-the-art global data association method is based on a network flow paradigm. Zhang et al. were the first one who formulated the MOT problem as finding a maximum posterior probability and transformed it into a min-cost network flow problem [13]. Pirsiavash et al. developed it and used a dynamic programming algorithm to improve efficiency [8].

However, as a sequence of images, higher-order information among adjacent frames was ignored. Traditional tracking methods are unable to achieve more robust results in complex scenes since they only extract static and superficial features. Considering about this, Butt and Collins added higher-order tracking smoothness constraints, such as constant velocity [3]. On the other hand, usual handcrafted-features are not efficient enough. New features extracted by a deep neural network are successfully applied in single target tracking [11, 14]. Different from single object tracking, in MOT tasks, different targets of same category are of interesting, such as pedestrian individuals in a surveillance application. It is not just to connect the same target, but also to distinguish others. However, traditional features and classification methods are not efficient for this purpose since they are usually based on characteristics of categories.

In this paper, we originally propose a new online feature extraction scheme which is appropriate for tracking-by-detection framework of MOT. The new scheme is based on a conditional random field (CRF) model whose potential functions are implemented by a certain number of multi-scale stacked auto-encoders. The loss function of the auto-encoders is formulated with constraints of the continuity and the distinguishability of targets. The continuity also implies the inheritance relationship of a stacked auto-encoders between frames and it relates to the high-order information of a trajectory. We apply our new features into a classical network flow method [8] on MOT and reach a better result which demonstrates the benefit of our feature extraction scheme.

The rest of the paper is as follows. Section 2 describes the CRF model of the online feature extraction scheme. Multi-scale stacked auto-encoder, the new loss function and the online approach are discussed in Sect. 3. Section 4 explains how we utilize the new feature into a network flow method and why it works. Section 5 presents experiments on the performance of obtained features and MOT tasks. Finally, conclusions are drawn in Sect. 6.

2 Conditional Random Field Model

Our online feature extraction scheme is based on a CRF model. Target detections at different frames, as conditions, are abstracted as an observation set \(X=\{x_{1}^{1},x_{1}^{2},\cdots ,x_{k}^{j-1}, x_{k}^{j},x_{k}^{j+1},\cdots ,x_{m}^{n}\}\), where \(x_{k}^{j}\) donates the jth detection at frame k. Correspondingly, Let \(Y=\{y_{1}^{1},y_{1}^{2},\cdots ,y_{k}^{j-1},y_{k}^{j}, y_{k}^{j+1},\cdots ,y_{m}^{n}\}\) be the state set, where \(y_{k}^{j}\) is the feature expression of \(x_{k}^{j}\). The feature extraction problem for MOT can be formulated as a maximum conditional probability problem in CRF as follows,

Fig. 1.
figure 1

Conditional random field model. (Color figure online)

$$\begin{aligned} \mathop {\mathrm{argmax}}\limits _{Y,\theta }\,{P(Y|X,\theta )} \end{aligned}$$
(1)

where \(\theta \) represents all of the undetermined parameters in the model. The conditional probability is defined as a normalized product of all the potential functions based on their corresponding maximal cliques \(\{y_{k-1}^{j},y_{k}^{j}\} \) for \(k\in \{2,3,\cdots ,m\}\), or \(\{y_{k}^{j},y_{k}^{l}\} \) for \(j\ne l\), and it is given by,

$$\begin{aligned} p(Y|X,\theta )=\dfrac{1}{Z}\prod \limits _{j}\prod \limits _{k} (\psi _{1}(y_{k-1}^{j},y_{k}^{j})\prod \limits _{l\ne j}\psi _{2}(y_{k}^{j},y_{k}^{l})) \end{aligned}$$
(2)

where \(\psi _{1}(y_{k-1}^{j},y_{k}^{j})\) and \(\psi _{2}(y_{k}^{j},y_{k}^{l})\) are the potential functions of maximal cliques respectively, and Z is a normalization factor.

Figure 1 shows two different targets and their features at frame \(t,t+1,\cdots ,t+r\) as a sketch. Two types of maximal cliques are displayed as red and blue edges respectively. The potential functions can be expressed in an exponential form,

$$\begin{aligned} \begin{aligned} \psi _{1}(y_{k-1}^{j},y_{k}^{j})=\exp (I(y_{k}^{j},x_{k}^{j}) +\lambda S(y_{k-1}^{j},y_{k}^{j},x_{k-1}^{j},x_{k}^{j})) \end{aligned} \end{aligned}$$
(3)

and,

$$\begin{aligned} \begin{aligned} \psi _{2}(y_{k}^{j},y_{k}^{l})=\exp (I(y_{k}^{j},x_{k}^{j}) + \mu D(y_{k}^{j},y_{k}^{l},x_{k}^{j},x_{k}^{l})) \end{aligned} \end{aligned}$$
(4)

where \(I(y_{k}^{j},x_{k}^{j})\) represents the identity relationship between observations and features, \(S(y_{k-1}^{j},y_{k}^{j},x_{k-1}^{j},x_{k}^{j})\) and \(D(y_{k}^{j},y_{k}^{l},x_{k}^{j},x_{k}^{l})\) describe the sequential characteristic and distinguishable property of features respectively.

Realized that any nonlinear function can be fitted with a deep neural network, we utilize a multi-scale stacked auto-encoder \(f_{k}^{j}\) to implement the product of the potential functions in (2) and to obtain the optimal extraction of the features.

$$\begin{aligned} \begin{aligned} \psi (y_{k-1}^{j},y_{k}^{j})\prod \limits _{l\ne j}\psi (y_{k}^{j},y_{k}^{l}) = f_{k}^{j} \end{aligned} \end{aligned}$$
(5)

3 Approach

We transform the maximum conditional probability problem on (1) to an online training process of a certain number of stacked auto-encoders by modeling the product of potential functions by (5). For each detection, an auto-encoder is trained according to a new loss function we proposed, which considers three of the relations in (3) and (4). An online training process is used which is initialized in the first several frames and runs iteratively in following frames. The structure of auto-encoder, our new loss function and the online training are presented below.

3.1 Auto-encoder Structure

Compared with many handcrafted-ones, features learned by an unsupervised model, especially a deep learning model, tells more about the fundamental or inner property of an object [6, 12]. Recent studies [10] have shown that stacked auto-encoder, as one of the widely used unsupervised deep learning model, is capable to grasp the most generic and intrinsic feature of the interest. In light of [14], the internal feature of each detection is extracted by a two-layer stacked auto-encoder similar in [11, 14].

Fig. 2.
figure 2

Multi-scale stacked auto-encoder structure. (Color figure online)

As shown in Fig. 2, the two-layer stacked auto-encoder is modeled by one input layer (red) and two feature expressing layers (blue and yellow). At the input layer, different patches from each area given by its detection bounding box are reshaped into vectors with three RGB channels. The vector passes the first layer weight matrix and a nonlinear function to generate a feature responses of local characteristics. Then, the local responses are cascaded together and are sent into the second layer to learn the global information.

In [11], PCA whitening and pooling method is applied in the deep network to decrease the dimensionality of feature vectors. We adjust the number of neurons in each feature expressing layer to get computational efficiency and maintain a similar performance.

3.2 New Loss Function

We formulate a new loss function of the multi-scale stacked auto-encoder by adding two regulation terms as follows,

$$\begin{aligned} \begin{aligned} L=\mathop {\mathrm{argmin}}\limits _{W}&\left| \left| x_{i}-W^{T}Wx_{i} \right| \right| _{2}\\&+ \alpha \ \left| \left| h(Wx_{i}^{old})-h(Wx_{i}) \right| \right| _{2}\\&+\beta \sum _{j\ne i}^{N-1} \left| \left| h(Wx_{i})-h(Wx_{j}) \right| \right| _{2} \end{aligned} \end{aligned}$$
(6)

where \(x_{i}\) donates the input vector of encoder, W the weight matrix, and h a sigmoid transportation. Both the first and the second layer auto-encoders are trained by this minimum function. The three terms are explained below.

Identity. The first term in Eq. (6) is the standard loss term of an auto-encoder, learning the intrinsic feature of a detection. It represents the identity relationship in Eqs. (3) and (4).

Continuity. In the second term in Eq. (6), \(x_{i}^{old}\) and \(x_{i}\) are inputs in the previous and current frame respectively. \(\alpha \) is a weight parameter. This term guarantees the high similarity between features of the same target in two adjacent frames. In this way, the continuity of target in (3) is introduced in the output features.

The inheritance strategy of stacked auto-encoder also contribute to the continuity characteristic. In this strategy, The stacked auto-encoder on current detection inherits previously results from the nearest detection at the frame before, which has a great chance to be the same target compared with current detection. However, this constraint is not necessarily one hundred percent correct. A appropriate weight parameter \(\alpha \) in (6) controls the feature similarity error from mistaken inheritance within an acceptable limit.

Distinguishability. In the third term, \(x_{j}\) indicates the other targets in the same frame and N is the number of detections. In order to differentiate other targets, we set \(\beta \) a negative number and treat it as a penalty term. It forces the extracted features between different targets to keep away from each other as much as possible. It represents the distinguishable property in Eq. (4).

3.3 Online Learning

In our online learning system, multi-scale stacked auto-encoders are trained by solving (6) with the Stochastic Gradient Descent (SGD) algorithm [2]. The first several frames are used to train the multi-scale stacked auto-encoders with randomly initialized weight matrix and they will reach a relatively stable state. At that time, the original input will pass the auto-encoder and obtain the final feature expression at current frame. Then through inheritance relation expounded above, the weight matrix of each auto-encoder is transmitted to the next frame and fine tuned by new comers. Thus, the feature expressions will be computed frame by frame.

4 New Feature in Network Flow

In this section, we are going to describe details on utilizing the new features extracted by our online system and explaination on improving MOT performance with solving network flow problem.

The min-cost network flow paradigm to solve MOT problem in [8] was formulated as,

$$\begin{aligned} \begin{aligned} \mathcal {F}^{*} =\mathop {\mathrm{argmin}}\limits _{\mathcal {F}}&\sum _{i}C_{in}(i)f_{in}(i)+\sum _{i}C_{out}(i)f_{out}(i) \\&+ \sum _{i}C_{det}(i)f_{det}(i)+\sum _{i,j}C_{t}(i,j)f_{t}(i,j) \end{aligned} \end{aligned}$$
(7)

where the total flow set \(\mathcal {F}\) is a solution of all the indicator variables \(f_{in}(i)\), \(f_{out}(i)\), \(f_{det}(i)\), \(f_{t}(i,j)\) in the network. \(f_{in}(i)\), \(f_{out}(i)\) and \(f_{det}(i)\) indicate whether detection i is a start or an end or belongs to a trajectory respectively. For each trajectory, \(f_{t}(i,j)\) determines whether detection j is right behind detection i. Costs \(C_{in}(i)\) and \(C_{out}(i)\) define the probability of a target belongs to the start or end of one trajectory. The detection cost \(C_{det}(i)\) is linked to the score that detection i was given by the detector. It represents a confidence of whether the target is a real one. \(C_{t}(i,j)\), as the connection cost, describes the probability of two detections i and j belong to the same trajectory.

In [8], all the costs \(C_{in}(i)\), \(C_{out}(i)\), \(C_{det}(i)\) and \(C_{t}(i,j)\) are defined with off-line trained certain values. With the specificity of each video sequences, constant costs will end up with solutions which are not very robust. They can not adjust diverse kinds of MOT scenes. In addition, [8] only use some arbitrary or simply considered constraints to shrink the solution space down. For example, some edges from detection i to j in the network are deleted when detection j is not at the next frame or has no space overlap with detection i. These measures indeed speeds up the algorithm to find optimal flow effectively, but may ignore better result for the reason of over simple and tough constraints.

In our method, in order to make the model more accurately and self-adaptive to video sequences in difference of kinds, the detection similarity computed from new features is mapped into a variable cost which replaces \(C_{t}(i,j)\) as a certain value originally. With high confidential detection similarity calculated from features with continuity and distinguishability, it will take less costs to associate same target and distinguish others. Furthermore, to find the optimal solution with higher reliability, we set a similarity threshold to cut down the searching space. Once the similarity computed from two new detections features is less than the threshold, the edge between two detections will be deleted. Location change, size change and overlap ratio are also considered into the determination. Each fact is assigned a particular weight to adjust the impact on the final decision. Besides, we do not insist that the two detections with an edge must be in adjacent frames, and relax the constraint to an appropriate frame gap. Thus, some lost then reappeared detections due to occlusion get their chances to connect with former trajectories.

5 Experiment

In this section, we first evaluate the feature quality from our online feature extraction scheme based on multi-scale stacked auto-encoder in TUD-Crossing video sequence [1], which is widely used to check the tracking performance in MOT tasks. The continuity and distinguishability of the extracted feature are revealed by comparing with some other handcrafted-features utilized in MOT. The influence of weight parameters in loss function (6) is also explained and analyzed. Finally, we apply our feature extraction into a typical network flow method which is called DP_NMS [8] as discussed in Sect. 4, and test the overall tracking performance on a series of video sequences on MOT2015 benchmark [7].

Fig. 3.
figure 3

TUD-Crossing detection result.

5.1 Continuity and Distinguishability

Figure 3 shows the detections in several frames in the TUD-Crossing. Our feature extraction scheme is applied in the video of 200 frames and similarity performance of features is evaluated. The similarity is measured with Bhattacharyya distance which is commonly used in video tracking [4].

Table 1 presents the similarity of features for the rightmost target between adjacent frames in frame 28, 29 and 30, as shown in Fig. 3. The last column computes the average similarity of this target in the whole video sequence. Clearly, the feature from our online learning method provides the highest similarity and keeps best consistent in consecutive frames. High similarity for the same targets means good continuity.

Table 1. Continuity of features

Table 2 presents the similarity of features between the rightmost target and three other targets in frame 28, 29 and 30, respectively. The last column shows the average diversity which is the difference between similarity of identical targets and different targets. Our method achieves the highest diversity compared with two other handcrafted-features, so it provides the best distinguishability.

Table 2. Distinguishability of features

Table 3 illustrates the influence of the weight parameters in loss function (6). The first row lists a suitable pair, where \(\alpha =0.15\) and \(\beta =-0.5\). When \(\beta \) increases to \(-0.1\), as the second row in the table, the diversity between different targets will decrease to a low level. Though increasing in \(\alpha \) seems to raise it back, as shown in the third row, it can not adapt the situation in other frames, such as in Table 4.

Table 3. Influence of weight parameter

Also shown in Table 4, at frame 32, there is an wrong inheritance of auto-encoder for the rightmost, which is unavoidable since the detections cannot be absolutely correct. When the weight factor \(\alpha \) holds a leading role (\(pair_{3}\) for example), the identity similarity will be on a high level and the two targets are not easy to be distinguished. It is good practice to keep \(\alpha \) below a proper limit (\(pair_{1}\) for example) to separate them apart and maintain the high continuity simultaneously.

Table 4. Wrong inheritance and weight parameters

5.2 Tracking Perfomance

Table 5. Tracking performance

We apply our feature extraction scheme in the DP_NMS model in [8], as expound in Sect. 4. Evaluations are carried out on several widely used MOT video sequences on MOT2015 benchmark for the original [8] and the improved method. The results are presented in Table 5. In the evaluation, our own detections and ground truths are used for PETS and TUD-Crossing, while public data from MOT2015 benchmark [7] are used for the other two videos. From the results, it is clear that our scheme promotes the performance of the MOT system effectively on most of the evaluating indicators on the classical network flow solution. And there is no doubt that the new detection features extracted from our scheme can also help to promote most of the MOT method based on tracking-by-detection paradigm.

6 Conclusion

An online feature extraction system built on a CRF model is proposed in this paper, which requires both the continuity and distinguishability. The CRF model is transformed into a certain number of multi-scale stacked auto-encoders with new loss function to get feature expression of targets. New features extracted from our online scheme are applied to improve the network flow tracking model. Experiments have shown better performance compared with other handcrafted-features and obtain a good performance promotion on MOT tasks with our feature extraction.