Refinement of human silhouette segmentation in omni-directional indoor videos

https://doi.org/10.1016/j.cviu.2014.06.011Get rights and content

Highlights

  • A methodology for refining the segmentation of human silhouette is proposed.

  • Video is acquired indoors from a ceiling-based omni-directional camera.

  • A calibrated model of the camera combined with geometry based reasoning are used.

  • Significant improvement in reducing false positive human activity segmentation is achieved.

  • The algorithm works in real time with input from any fisheye camera.

Abstract

In this paper, we present a methodology for refining the segmentation of human silhouettes in indoor videos acquired by fisheye cameras. This methodology is based on a fisheye camera model that employs a spherical optical element and central projection. The parameters of the camera model are determined only once (during calibration), using the correspondence of a number of user-defined landmarks, both in real world coordinates and on a captured video frame. Subsequently, each pixel of the video frame is inversely mapped to the direction of view in the real world and the relevant data are stored in look-up tables for fast utilization in real-time video processing. The proposed fisheye camera model enables the inference of possible real world positions and conditionally the height and width of a segmented cluster of pixels in the video frame. In this work we utilize the proposed calibrated camera model to achieve a simple geometric reasoning that corrects gaps and mistakes of the human figure segmentation, detects segmented human silhouettes inside and outside the room and rejects segmentation that corresponds to non-human activity. Unique labels are assigned to each refined silhouette, according to their estimated real world position and appearance and the trajectory of each silhouette in real world coordinates is estimated. Experimental results are presented for a number of video sequences, in which the number of false positive pixels (regarding human silhouette segmentation) is substantially reduced as a result of the application of the proposed geometry-based segmentation refinement.

Introduction

The field of human activity monitoring based on cameras has gained significant interest during the last years in the context of developing ambient assisted living environments. A number of approaches exist in the research literature, based either on 3D human models, or on local image descriptors, exploiting both spatial and temporal information. Detailed reviews of this field exist in [1], [2], [3], [4]. The majority of the approaches to the problem of vision-based recognition of human motion and action, utilize descriptors from segmented human silhouettes. For instance in [3] 14 publications are listed that use silhouettes as abstraction level for vision based human motion capture. In the survey presented in [1], it is stated that segmented silhouettes are used for human motion detection using 3D human models of image descriptors, but segmentation artifacts limit the performance of the corresponding methods. In [2] the use of silhouettes for human action recognition using image descriptors, as well as space volumes, is surveyed through a number of listed works. These algorithms perform well with datasets that include very short videos of single humans performing a single task at each time. Examples of these datasets include the INRIA XMAS [5], the Weizmann [6], the KTH [7], the CMU MoBo database [8] and the Human EVA [9]. In these video segments, no other action is usually visible in the background, thus the segmentation of the human silhouettes is rather easy, defect free and unambiguous. Only few databases exist that contain challenging video sequences (changing illumination in background, existence of multiple persons close to each other or interacting), such as the HOHA database [10].

It is obvious from the above discussion that the improvement of the human silhouettes segmentation will result in improving the quality of human motion/action recognition. The contribution of this work is a methodology that isolates the segmented human silhouettes from fisheye video sequences inside a designated area, from other irrelevant segmentation and eliminates artifacts and outliers. The video data used in this research are acquired indoors from a fixed fisheye camera, installed at the ceiling of the living environment. The proposed algorithm uses a novel fisheye camera model that enables reasoning based on real world geometry to correct and enhance the segmentation output of the human figures. The proposed algorithm does not make use of models of the required object (human silhouette), or local image features.

The input of the proposed algorithm is the result of the initial video segmentation based on background modeling and subtraction. Several methodologies for background modeling exist in literature. For instance in [11] the background model is simply defined as the previous frame and global thresholding is employed to extract the foreground. This method is very simple to implement, but it is prone to a number of segmentation errors. Background can be modeled by median filtering [12] of a predefined number of last frames that are held in a buffer. This approach requires significant computational and memory resources and cannot be executed in real time. In order to alleviate the increased computational requirements, a class of recursive background modeling algorithms is proposed in literature. These algorithms use an incremental update of the background. Simple and efficient members of this class of algorithms are the approximation of the median filtering method [13] and the running Gaussian average method [14]. A comprehensive review of background modeling algorithms for foreground detection is presented in [15]. A popular methodology is the Mixture of Gaussians, initially described for video sequences by Stauffer and Grimson [16], according to which, the values of each pixel are modeled as a lineal combination of weighted Gaussian probability distributions. However, this method is also computationally expensive. In this work we have performed video segmentation using the illumination-sensitive background modeling approximation, as originally described in [17] and modified in [18], although any other video segmentation algorithm can be employed. This algorithm was selected due to its simplicity and its efficient handling of illumination changes.

The video sequences used in this work are captured by a hemispheric camera, also known as omni-directional, or fisheye camera with 180° field of view (FoV). The use of this type of cameras is increasing in robotic and in video surveillance applications [19], [20], due to the fact that they allow constant monitoring of all directions with a single camera. In [21], [22], [23] the calibration of fisheye camera is reported using high degree polynomials to emulate the strong deformation introduced by the fisheye lens, radial and/or tangential. In [24] the authors present a methodology for correcting the distortions induced by the fisheye lens. In [25], [26] a well established calibration for omni-directional cameras is proposed, which utilizes a standard chess pattern imaged at arbitrary orientations, without requiring point input by the user. In this paper we present a very efficient camera model that extends our previously proposed fisheye model that used only 3 parameters [27] with exhaustive search calibration. Subsequently, we utilize the proposed inverse fisheye model to refine the segmentation of moving humans and eliminate non-human activity, such as window reflections, sudden changes of illumination, small objects, moving doors, etc., as well as human silhouette outside the designated room. The position of a human silhouette (or any other segmented foreground object) is estimated accurately under the assumption that its base (area of the surface touching the floor) is small, such as a standing, walking or sitting the human. Finally, the refined segmented silhouettes are uniquely labeled using spatiotemporal information concerning both real world position and RGB appearance and the trajectory of each silhouette in real world coordinates is also estimated.

The rest of the paper is organized as follows. In Section 2, the overall architecture is presented, while the forward and inverse modeling of the fisheye camera is also described. The proposed methodology for the refinement of the human silhouette segmentation using reasoning based on the geometric relation between the binary connected components is presented in Section 2.4. Initial experimental results are presented in Section 3, whereas the proposed algorithm and the future work are discussed in the last concluding Section 4.

Section snippets

Overall description and block diagram of the proposed methodology

The main characteristic of the fisheye camera is the ability to cover a field of view of 180°. The proposed methodology is based on a parametric model of image formation, so that any real-world point (x, y, z) can be associated with a frame pixel (i, j). Furthermore a pixel (i, j) in the video frame can be associated with the direction of view, defined by two angles: the azimuth θ and the elevation φ. The parameters of the fisheye camera model are determined only once (calibration), using manually

Results

During the experiments video sequences were acquired using the Mobotix Q24 hemispheric camera, which was installed on the ceiling of the imaged university room. The pixilation of each frame is 480 × 640, the frame rate was set to 25 fps and the duration of each video was 45 s to 2 min. Specific details are given below for each video sequence:

  • Video 1: single person entering the supervised room, exits the room from a different location. Activity: turning the light on, door opening and closing,

Conclusions

In this paper we presented a methodology that identifies the segmentation of human silhouettes from fisheye video sequences, against other irrelevant segmentation, or versus segmentation of silhouettes outside a predefined area. The proposed algorithm is based on clues of real world geometry, derived from a calibrated model of a fisheye camera.

Regarding the complexity of the proposed methodology the performed experiments have shown that the number of connected binary objects in the segmented

Acknowledgments

The authors would like to thank the European Union (European Social Fund ESF) and Greek national funds through the Operational Program “Education and Lifelong Learning” of the National Strategic Reference Framework (NSRF) – Research Funding Program: Thalis Interdisciplinary Research in Affective Computing for Biological Activity Recognition in Assistive Environments for financially supporting this work.

References (37)

  • I. Laptev et al.

    Learning Realistic Human Actions from Movies

    in: IEEE Conf on Computer Vision and Pattern Recognition

    (2008)
  • J. Willems, G. Debard, B. Bonroy, B. Vanrumste and T. Goedemé 2009, How to detect human fall in video? An overview, In...
  • R. Cucchiara et al.

    Detecting moving objects, ghosts, and shadows in video streams

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2003)
  • N. McFarlane et al.

    Segmentation and tracking of piglets in images

    Mach. Vision Appl.

    (1995)
  • C. Wren et al.

    Pfinder: real-time tracking of the human body

    IEEE Trans. Pattern Anal. Mach. Intell.

    (1997)
  • T. Bouwmans et al.

    Background modeling using mixture of gaussians for foreground detection – a survey

    Recent Pat. Comput. Sci.

    (2008)
  • C. Stauffer, W. Grimson, Adaptive background mixture models for real-time tracking, in: Proceedings of the conference...
  • F.C. Cheng et al.

    Implementation of illumination-sensitive background modeling approach for accurate moving object detection

    IEEE Trans. Boardcasting

    (2011)
  • Cited by (14)

    • Computer vision for assistive technologies

      2017, Computer Vision and Image Understanding
      Citation Excerpt :

      In the most common taxonomy, human activities are composed by a sequence of primitive displacements named actions (if they involve the movements of the full body as in the case of walking, sitting, jumping, bedding, etc.) or gestures (if they involve the displacement of only one or more parts of the human body as for example pointing at objects, shaking head, clapping hands, head nodding, etc.). Many of the computer vision approaches addressing this problem, employ descriptors extracted from preliminary detected human silhouettes or human body parts (e.g., the upper body limbs and the hands) (Delibasis et al., 2014). The detection and tracking of the human silhouettes and human body parts follows the general operative framework adopted for object detection and tracking (see Section 3.2).

    • Real-Time fall detection using uncalibrated fisheye cameras

      2020, IEEE Transactions on Cognitive and Developmental Systems
    • Feature extraction and pattern recognition from fisheye images in the spatial domain

      2018, VISIGRAPP 2018 - Proceedings of the 13th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications
    • Shape reconstruction using fisheye and projective cameras

      2016, IST 2016 - 2016 IEEE International Conference on Imaging Systems and Techniques, Proceedings
    View all citing articles on Scopus

    This paper has been recommended for acceptance by Nikos Paragios.

    View full text