Keywords

1 Introduction

Detection, localization and identification of people are some of the hardest problems in video-surveillance research field with an increasing interest in last years. Currently, the use of physical and behavioral traits has been introduced as a recognition technique which is known as Soft Biometrics [1]. This can be used to filter large amounts of data or to identify and re-identify individuals.

This paper focuses on determining human height, which is a geometric information that can be automatically estimated in scenarios from video-surveillance. Unlike biometric features, this information can be obtained from videos taken at long distances where people walk in any directions. Height estimation of people can be used both as Soft Biometrics and as a person tracking feature. The first case eliminates some possible subjects having considerably different heights than the target, and focus on determining distinctive identification features. It can be used for temporal and spatial correspondence analysis for person tracking.

A significant number of works tackle the problem of object height estimation. Some of them are based on calibrated cameras and they use their intrinsic and extrinsic parameters for estimation [2, 3]. On the other hand, there are methods based on uncalibrated cameras [4,5,6,7,8]. Generally, these methods perform a camera’s auto-calibration based on scene geometry or object tracking [6, 9]. The main problem of these methods is that they require information about extrinsic parameters of the camera or measures in 3D world as reference for the estimation.

We propose a new method for estimating real pedestrian height using uncalibrated cameras. This work is based on the algorithm of Richardson et al. [10] who proposed a method to estimate the image horizon from multiple tracked objects and an estimation of relative height. Our differences w.r.t that work, and main contributions are: (1) we introduce an algorithm to evaluate the human silhouette quality within the processes of horizon detection and height estimation, and (2) a new method to estimate real human height, that obtains the real height value of pedestrians in uncalibrated cameras without using any prior knowledge about the scene geometry or camera parameters.

2 Estimation of the Horizon and Relative Height

An important step for object height estimation is the image horizon estimation. For this task, Richardson et al. [10] established the condition that the camera X-axis must be parallel to the ground plane of the 3D-world, which implies that the horizon can be uniquely defined by a single value in the image Y-axis. Nevertheless, if this condition is not met, they proposed a method to find the angle between the camera’s X-axis and the ground plane in order to rotate the video. Also a flat ground plane is assumed. Under this condition, when an object moves in a way that its position in the image Y-axis ascends, it becomes smaller. The image horizon is defined as the position where the moving object becomes infinitesimally small.

Richardson et al. [10] demonstrated that a linear relationship exists between the vertical image position (\(y_0\)) and its height (\(h_0\)) on the image (See Fig. 1b). In other words, for a moving object in the scene, the equation \(y=mh+n\) (height equation) is met. Parameters mn from this equation can be estimated by performing a linear regression from a set of values \(\{(h_i,y_i)\}\) extracted from the track of a moving object (See Fig. 1a). In the equation, n is the vertical position in the image where the object’s height becomes zero. Therefore, n represents the horizon according to the definition (See Fig. 1b).

Fig. 1.
figure 1

(a) Multiple instances of a tracked person in a video scene. For each instance, a line is plotted representing the size of the object and their position (the lower point of the line). (b) Graphic of size vs. position shows the measurements extracted for the persons and their linear relationship.

The information obtained from multiple tracked objects is necessary for the horizon estimation since the tracking points from an object can be partially affected by occlusion, segmentation, identity changes and others. A robust method based on the Hough transform is presented in [10], as a voting procedure. In the height equation the value of n obtained for each object is considered as a vote. A histogram is created with these values on the y axis of the image. The resulting probability function for the horizon will have a sharp peak at \(y_h\) (the most voted value).

If the image horizon (\(y_h\)) is found, for a pair (\(y_i, h_i\)) from a person, it is possible to estimate this person’s height in pixels for all possible positions using the line that joins (\(0, y_h\)) to (\(y_i, h_i\)) (height line). Being A and B two objects at the scene, a relationship between objects’ height and the slope of their height lines can be established: A is higher than B if and only if the line’s slope of A is bigger than the line’s slope of B. Note that, if A is larger than B (See Fig. 2b) then for two points \(p_a (x_A, y_1)\) and \(p_b (x_B, y_1)\) at \(y_1\) position, the relationship \((y_1 - y_h )/x_B < (y_1 - y_h )/x_A\) is met. The backward implication is similar. The slopes of these lines are a good estimation value of its relative height within the scene.

Fig. 2.
figure 2

(a) Instances from two pedestrians of different sizes at the same position. (b) Graphic of height vs position showing the relationship between the height and position for these two pedestrians.

3 Silhouette Quality Process

In the work of Richardson et al. [10], all the points extracted from each moving object in the video, are used to estimate the horizon and height. They consider several types of objects: cars, people, animals, among others. People are the only interest in our work. Also, as was mentioned before, the points might be affected by segmentation and tracking problems which cause significant variations on the line parameters. For those reasons, in the present work we propose to introduce a process that allows to select silhouettes belonging to people. The process consists on a shape matching algorithm to compare the frame silhouettes to a set of prototypes. As a result, depending on the matching score, it is possible to select several person’s silhouettes with good quality. In our case, a good quality silhouette must be well segmented (with no missing body parts). For horizon and height estimation we only consider these good silhouettes.

3.1 Shape Matching Algorithm

Currently, there are many works for shape matching. The selected method is known as Turning Angle [11]. This is an efficient method with good results, as it is shown in the survey conducted in [12]. Also, in our case, since the persons are always standing, this method can be even faster by omitting the starting point selection in the matching process. In this method, a description of the shape is created based on its contour. The angle between the shape slope at a given point and the X-axis is calculated for each contour point in an ordered way. The shape representation is defined as the sequence of contour angles (Fig. 3). To compare two representations the Dynamic Time Warping method is used.

Fig. 3.
figure 3

Representation of the Turning Angle method.

Fig. 4.
figure 4

Prototypes examples

3.2 Prototype Selection

The silhouettes from CASIA-B database [13] were used for prototype selection. In this database there are people walking in 11 different angles under controlled environments. We selected 20 people ensuring the presence of men and women. Then, an algorithm for clustering was used to split the silhouette sets into groups according to their similarity and we selected its representative element as prototype. The algorithm used was k-medoids, since it is simple, fast and provides a representative object for each cluster. The distance among silhouettes used for the k-medoids algorithm was the edit distance. It was calculated over Freeman chains extracted from silhouettes contours. We selected 50 prototypes (Fig. 4). This number was chosen as a trade-off between having enough samples and faster processing times.

4 Estimation of the Real Height

The real height estimation proposed in this work is based on the relative height sampling and its comparison with a known height distribution. For example, it is well known that the Cuban population heights has a normal distribution with \(\mu = 1.68\,\mathrm{m}\) for men and \(\mu = 1.56\,\mathrm{m}\) for women [14]. The overall mean is 1.61 m.

Once the image horizon is found, we propose to store a certain amount of relative heights calculated from good silhouettes. Then, the normal distribution parameters are calculated. Under the assumption that relative heights follow the same distribution that real height, we propose to make a distribution match by their means. The real height of a person will be the corresponding value in the real height distribution given that person’s relative height.

5 Experimental Results

There is not a unified criterion in literature for evaluating the results in this topic. Moreover, there are no public datasets or real height data to compare with other works. Authors usually present the results in their own databases under controlled environments [5, 15].

Fig. 5.
figure 5

Results of the proposed method on a video from PETS 2009 dataset [16]. (a) Results when not using the silhouette quality step and (b) results using the silhouette quality step. The green line represents the horizon estimation and the pink marks to the left of each image are the horizon votes. Bounding boxes for each tracked object are shown. The bounding box is blue if the blob silhouette is selected for estimation (good human silhouette) and pink if not. (Best viewed in color)

The accuracy in the horizon estimation is very subjective, the exact value depends on a technical opinion. Nevertheless, in the majority of the analyzed videos, the estimated value falls within a neighborhood close to the real apparent horizon. The influence of introducing the silhouette’s quality process can be observed in Fig. 5. The horizon estimation in Fig. 5b shows better results than Fig. 5a and the vote distribution is more compact. Note that, in the case of Fig. 5b, the bounding boxes corresponding to two superposed persons, or to an incomplete person, were classified as bad, and not selected for estimation.

To test the height estimation we captured several videos under uncontrolled scenarios. The video scene is a parking lot where people walk towards the camera (See Fig. 6). The camera resolution is \(625 \times 500\). To estimate the horizon and relative height distribution we considered the first 40 people in the video. Then, the height estimations of other 30 people were compared with their measured heights. People were measured wearing shoes. The results are shown in Table 1 split into men and women.

Fig. 6.
figure 6

Video sequence from our dataset and our height estimation

Table 1. Results of the estimation algorithm (1) for men and (2) for women: In table: E.H is estimated height; M.H. is measured height and Diff is the difference between them

The automatic estimations for the 30 people are in a difference range of 4 cm w.r.t. the real value. This corresponds to a \(100\%\) of accuracy if we consider an error of \({\pm }5\,\mathrm{cm}\). Also in the range of difference of 1 cm there are 16 out of 30 correct measurements (\(53\%\)). The range of height estimation was [1.60–1.82]. The overall mean error was 1.53 cm. The mean error for men was 1.35 cm and for women 1.90 cm. In our model, 90% of the estimated values variability is explained by the linear relation with observed values. The F statistic value is 166.81 and is statistically significant with p = 0.0001. There is evidence to refuse the null hypothesis and thus it is possible to set up a model from linear regression. This method was tested in a PC Intel Core I7 with 8 GB RAM. The processing time of 1 frame was 25 ms, therefore, we are able to process 40 frames per second, achieving real time conditions.

In order to understand these results we provide further considerations. First, human walking is a cyclic set of different poses. When legs are open the height is lower than when they are closed. For estimation we considered the values corresponding to the pose where the person appears more straight. Second, the height of blobs is affected by footwear and hairstyle, that explains why the error in women is a slightly larger than in men. Some of the women used for testing were wearing high shoes or up hairstyles. The measurement of the real height and the capture of the test videos were done in different dates. Despite the quality method, some errors can occur. If parts of a silhouette are selected as good, it decreases the distribution mean and tends to overestimate the heights.

The main limitation of the present work is that it needs a large number of people for a good approximation of the population height distribution. If the number is too small then there is no guarantee that the sample mean will be close to the population mean. Nevertheless, the distribution parameters computation can be updated with the relative height of each new tracked person that appears in the scene. Also, the view angle must be oblique enough to notice differences on the silhouette height when its position changes.

6 Conclusions

In this work we presented a method to estimate the height of pedestrians in uncalibrated cameras. The method does not require information about the scene or camera parameters. The height estimation was based on existing techniques, but this work incorporates a step for evaluating human silhouette quality. This helps to overcome some segmentation problems that affect the results. The accuracy of the proposed method has been experimentally shown in outdoor environments, under uncontrolled lighting conditions. An error mean of 1.53 cm was achieved for 30 test subjects from our database. We believe that real height estimation can be a very useful feature for people tracking, and in future work it can be used as a step for camera calibration.