Skip to main content
Log in

Direct Model-Based Tracking of 3D Object Deformations in Depth and Color Video

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

The tracking of deformable objects using video data is a demanding research topic due to the inherent ambiguity problems, which can only be solved using additional assumptions about the deformation. Image feature points, commonly used to approach the deformation problem, only provide sparse information about the scene at hand. In this paper a tracking approach for deformable objects in color and depth video is introduced that does not rely on feature points or optical flow data but employs all the input image information available to find a suitable deformation for the data at hand. A versatile NURBS based deformation space is defined for arbitrary complex triangle meshes, decoupling the object surface complexity from the complexity of the deformation. An efficient optimization scheme is introduced that is able to calculate results in real-time (25 Hz). Extensive synthetic and real data tests of the algorithm and its features show the reliability of this approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • Alizadeh, F., & Goldfarb, D. (2001). Second-order cone programming. Mathematical Programming, 95, 3–51.

    Article  MathSciNet  Google Scholar 

  • Auger, A., Brockhoff, D., & Hansen, N. (2010). Benchmarking the (1,4)-CMA-ES with mirrored sampling and sequential selection on the noisy BBOB-2010 testbed. In GECCO workshop on Black-Box optimization benchmarking (BBOB’2010) (pp. 1625–1632). New York: ACM.

    Google Scholar 

  • Bardinet, E., Cohen, L. D., & Ayache, N. (1998). A parametric deformable model to fit unstructured 3d data. Computer Vision and Image Understanding, 71(1), 39–54.

    Article  Google Scholar 

  • Bartczak, B., & Koch, R. (2009). Dense depth maps from low resolution time-of-flight depth and high resolution color views. In Lecture notes in computer science: Vol. 5876. ISVC (2) (pp. 228–239). Berlin: Springer.

    Google Scholar 

  • Bartoli, A., & Zisserman, A. (2004). Direct estimation of non-rigid registration. In British machine vision conference.

    Google Scholar 

  • Bascle, B., & Blake, A. (1998). Separability of pose and expression in facial tracking and animation. In Proceedings of the sixth international conference on computer vision, ICCV ’98 (p. 323). Washington: IEEE Comput. Soc.

    Google Scholar 

  • Bregler, C., Hertzmann, A., & Biermann, H. (2000). Recovering non-rigid 3d shape from image streams. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2690–2696). Washington: IEEE Comput. Soc.

    Google Scholar 

  • Cagniar, C., Boyer, E., & Ilic, S. (2009). Iterative mesh deformation for dense surface tracking. In 12th international conference on computer vision workshops.

    Google Scholar 

  • Cai, Q., Gallup, D., Zhang, C., & Zhang, Z. (2010). 3d deformable face tracking with a commodity depth camera. Camera, 6313(2), 229–242.

    Google Scholar 

  • Chen, S. E., & Williams, L. (1993). View interpolation for image synthesis. In Proceedings of the 20th annual conference on computer graphics and interactive techniques, SIGGRAPH’93 (pp. 279–288). New York: ACM.

    Chapter  Google Scholar 

  • Cohen, L. D., & Cohen, I. (1991). Finite element methods for active contour models and balloons for 2d and 3d images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15, 1131–1147.

    Article  Google Scholar 

  • Costeira, J., & Kanade, T. (1994). A multi-body factorization method for motion analysis (Tech. Rep. CMU-CS-TR-94-220). Computer Science Department, Pittsburgh, PA.

  • de Aguiar, E., Theobalt, C., Stoll, C., & Seidel, H. P. (2007). Marker-less deformable mesh tracking for human shape and motion capture. In IEEE international conference on computer vision and pattern recognition (CVPR), Minneapolis, USA (pp. 1–8). New York: IEEE Press.

    Google Scholar 

  • de Aguiar, E., Stoll, C., Theobalt, C., Ahmed, N., Seidel, H. P., & Thrun, S. (2008). Performance capture from sparse multi-view video. In ACM transactions on graphics, Proc. of ACM SIGGRAPH (Vol. 27).

    Google Scholar 

  • Del Bue, A., & Agapito, L. (2006). Non-rigid stereo factorization. International Journal of Computer Vision, 66, 193–207.

    Article  Google Scholar 

  • Del Bue, A., Smeraldi, F., & Agapito, L. (2007). Non-rigid structure from motion using ranklet-based tracking and non-linear optimization. Image and Vision Computing, 25(3), 297–310.

    Article  Google Scholar 

  • Delingette, H., Hebert, M., & Ikeuchi, K. (1991). Deformable surfaces: a free-form shape representation. In Geometric methods in computer vision: Vol. 1570. Proc. SPIE (pp. 21–30).

    Google Scholar 

  • Fayad, J., Del Bue, A., Agapito, L., & Aguiar, P. (2009). Non-rigid structure from motion using quadratic deformation models. In British machine vision conference (BMVC), London, UK.

    Google Scholar 

  • Fayad, J., Agapito, L., & Bue, A. D. (2010). Piecewise quadratic reconstruction of non-rigid surfaces from monocular sequences. In Proceedings of the 11th European conference on computer vision: Part IV, ECCV’10 (pp. 297–310). Berlin: Springer.

    Google Scholar 

  • Hansen, N. (2006). The CMA evolution strategy: a comparing review. In Towards a new evolutionary computation. Advances on estimation of distribution algorithms (pp. 75–102). Berlin: Springer.

    Chapter  Google Scholar 

  • Hartley, R. I., & Zisserman, A. (2000). Multiple view geometry in computer vision. Cambridge: Cambridge University Press. ISBN:0521623049.

    MATH  Google Scholar 

  • Hilsmann, A., & Eisert, P. (2009). Realistic cloth augmentation in single view video. In Vision, modeling, and visualization workshop 2009, Braunschweig, Germany.

    Google Scholar 

  • Horn, B. K. P., & Harris, J. G. (1991). Rigid body motion from range image sequences. CVGIP. Image Understanding, 53, 1–13.

    Article  MATH  Google Scholar 

  • Jaklič, A., Leonardis, A., & Solina, F. (2000). Computational imaging and vision: Vol. 20. Segmentation and recovery of superquadrics. Dordrecth: Kluwer. ISBN 0-7923-6601-8.

    MATH  Google Scholar 

  • Jordt, A., & Koch, R. (2011). Fast tracking of deformable objects in depth and colour video. In Proceedings of the British machine vision conference, BMVC 2011. British Machine Vision Association.

    Google Scholar 

  • Kim, Y. M., Theobalt, C., Diebel, J., Kosecka, J., Micusik, B., & Thrun, S. (2009). Multi-view image and tof sensor fusion for dense 3d reconstruction. In IEEE workshop on 3-D digital imaging and modeling (3DIM), Kyoto, Japan (pp. 1542–1549). New York: IEEE Press.

    Google Scholar 

  • Koch, R. (1993). Dynamic 3-d scene analysis through synthesis feedback control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(6), 556–568.

    Article  Google Scholar 

  • Mcinerney, T., & Terzopoulos, D. (1993). A finite element model for 3d shape reconstruction and nonrigid motion tracking. In 4th international conference on in computer vision, ICCV (pp. 518–523).

    Google Scholar 

  • Muñoz, E., Buenaposada, J. M., & Baumela, L. (2009). A direct approach for efficiently tracking with 3d morphable models. In ICCV (pp. 1615–1622). New York: IEEE Press.

    Google Scholar 

  • Netravali, A., & Salz, J. (1985). Algorithms for estimation of three-dimensional motion. AT & T Bell Laboratories Technical Journal, 64, 2.

    Google Scholar 

  • Osher, S., & Sethian, J. A. (1988). Fronts propagating with curvature dependent speed: algorithms based on Hamilton-Jacobi formulations. Journal of Computational Physics, 79(1), 12–49.

    Article  MathSciNet  MATH  Google Scholar 

  • Ostermeier, A., & Hansen, N. (1999). An evolution strategy with coordinate system invariant adaptation of arbitrary normal mutation distributions within the concept of mutative strategy parameter control. In Proceedings of the genetic and evolutionary computation conference (GECCO) (pp. 902–909). San Mateo: Morgan Kaufmann.

    Google Scholar 

  • Piegl, L., & Tiller, W. (1997). The NURBS book (2nd ed.). Berlin: Springer.

    Book  Google Scholar 

  • Pilet, J., Lepetit, V., & Fua, P. (2008). Fast non-rigid surface detection, registration and realistic augmentation. International Journal of Computer Vision, 76, 109–122.

    Article  Google Scholar 

  • Rosenhahn, B., Kersting, U., Powell, K., Klette, R., Klette, G., & Seidel, H. P. (2007). A system for articulated tracking incorporating a clothing model. Machine Vision and Applications, 18, 25–40.

    Article  Google Scholar 

  • Russell, C., Fayad, J., & Agapito, L. (2011). Energy based multiple model fitting for non-rigid structure from motion. In IEEE conference on computer vision and pattern recognition.

    Google Scholar 

  • Salzmann, M., Hartley, R., & Fua, P. (2007). Convex optimization for deformable surface 3-d tracking. In ICCV’07 (pp. 1–8).

    Google Scholar 

  • Salzmann, M., Lepetit, V., & Fua, P. (2007). Deformable surface tracking ambiguities. In IEEE international conference on computer vision and pattern recognition (CVPR).

    Google Scholar 

  • Schiller, I., Beder, C., & Koch, R. (2008). Calibration of a PMD camera using a planar calibration object together with a multi-camera setup. In The international archives of the photogrammetry, remote sensing and spatial information sciences, Beijing, China (Vol. XXXVII, pp. 297–302). XXI. Part B3a, ISPRS Congress.

    Google Scholar 

  • Shen, S., Zheng, Y., & Liu, Y. (2008). Deformable surface stereo tracking-by-detection using second order cone programming. In International conference on computer vision and pattern recognition (CVPR) (pp. 1–4). New York: IEEE Press.

    Google Scholar 

  • Shen, S., Ma, W., Shi, W., & Liu, Y. (2010). Convex optimization for nonrigid stereo reconstruction. IEEE Transactions on Image Processing, 19, 782–794.

    Article  MathSciNet  Google Scholar 

  • Shotton, J., Fitzgibbon, A. W., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., & Blake, A. (2011). Real-time human pose recognition in parts from single depth images. In CVPR (pp. 1297–1304). New York: IEEE Press.

    Chapter  Google Scholar 

  • Stanford, L. T., Hertzmann, A., & Bregler, C. (2003). Learning non-rigid 3d shape from 2d motion. In Proceedings of the 17th annual conference on neural information processing systems (NIPS) (pp. 1555–1562). Cambridge: MIT Press.

    Google Scholar 

  • Taylor, J., Jepson, A. D., & Kutulakos, K. N. (2010). In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2761–2768).

    Google Scholar 

  • Tomasi, C., & Kanade, T. (1992). Shape and motion from image streams under orthography: a factorization method. International Journal of Computer Vision, 9, 137–154.

    Article  Google Scholar 

  • Torresani, L., Yang, D. B., Alexander, E. J., & Bregler, C. (2001). Tracking and modeling non-rigid objects with rank constraints. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 493–500).

    Google Scholar 

  • Vedula, S., Baker, S., Collins, R., & Kanada, T. (1999). Three-dimensional scene flow. In Proceedings of the 7th international conference on computer vision, ICCV (pp. 722–726). New York: IEEE Press.

    Chapter  Google Scholar 

  • Yamamoto, M., Boulanger, P., Beraldin, J. A., & Rioux, M. (1993). Direct estimation of range flow on deformable shape from a video rate range camera. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(1), 82–89. doi:10.1109/34.184776.

    Article  Google Scholar 

  • Zhang, Z. (1994). Iterative point matching for registration of free-form curves and surfaces. International Journal of Computer Vision, 13(2), 119–152. doi:10.1007/BF01427149.

    Article  Google Scholar 

  • Zhu, J., Hoi, S. C., Xu, Z., & Lyu, M. R. (2008). An effective approach to 3d deformable surface tracking. In Proceedings of the 10th European conference on computer vision: Part III, ECCV ’08 (pp. 766–779). Berlin: Springer.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Andreas Jordt.

Additional information

This work was supported by the EU Interreg A4 Project “Intelligent Robots for Handling flexible Objects” (IRFO), 33-1.2-09.

Appendices

Appendix A: Occlusion Handling

The ability of the algorithm to cope with holes in the input images (see Sect. 3.2) allows to implement a straight forward occlusion handling: pixel, supposed to occlude the actual object can be removed from the input data without leaving unfeasible data to the main tracking algorithm. Given the current movement of each vertex and under the assumption that it’s speed will at maximum be doubled in the current frame, it is possible to calculate a 3D region and a corresponding image area for each vertex, in which it (or its projection respectively) can be found in the current frame. Given the color error f c for the vertices, we assume that most of the vertices have a color fit twice as high as the color error or lower.

These assumptions allow to formulate a two step occlusion classifier. In a first step, each pixel in the depth and color image is assigned one of these states:

  1. (1)

    Not classified

  2. (2)

    Depth and color are in the vicinity of at least one vertex

  3. (3)

    Only the depth value is in the vicinity of an object vertex but the color value does not fit

  4. (4)

    The pixel seems to be part of an occluding object

Starting with every pixel set to state (1), for every vertex v in V and every pixel [x,y] in the vicinity (as defined above) of the projection of v, the following rules are processed:

  • If [x,y] is within the three dimensional vicinity of v according to its recent movement and the color of v is in the vicinity of the color at [x,y], then set the pixel state of [x,y] to (2).

  • If the state of [x,y] is not (2) and [x,y] is within the three dimensional vicinity of v but the color of v is not in the vicinity of the color at [x,y], then set the pixel state of [x,y] to (3).

  • If the state of [x,y] is not (2) the depth value of [x,y] is outside the vicinity of v such that it is in front of v, then set the pixel state of [x,y] to (4).

This procedure computes a classification into certain object pixels (2), uncertain pixels (3) and certain occlusion pixels (4). In a second step, every pixel classified as (3) in the vicinity of a pixel classified as (4) is also set to (4). Finally, every pixel classified as (4) is removed from the input data.

Appendix B: Training of Color/Depth Weights

A rather clumsy property of Jordt and Koch (2011) are the manually chosen weights for color and depth errors. Because of the novel formulation of (10) and (13), the depth error function does not have to be weighted, so the color weight λ c (see Eq. (13)) is the only hyper parameter in this method. The selection of color/depth weights is a general problem that appears in the fusion of different errors from various domains to one fitness value, due to the lack of a common metric. Without additional information, a depth error value can not be compared to an error value in the color space. A fused error function simply adding up these error values in their current domain is likely to disregard one of the domains at hand.

A common tool to handle this problem is to define a certainty measure in each domain via the variance of a Gaussian distribution (Kim et al. 2009) or a cost function derived from it (Bartczak and Koch 2009). Once, every color and depth measurement is equipped with a cost or a distribution, a sound solution can be formulated by calculating the maximum likelihood of the given measurements or the cost minimum respectively. Although based on statistics, it can be shown that in the end an optimum calculated that way is equivalent to the minimum of the weighted squared error functions, given the correct weight.

Though it is possible to define reasonable variances for the Kinect color and depth measurements, the input data in the experiments of Sect. 6 show that it is more appropriate to weight color and depth information depending on the input data rather than defining static variances. Hence, we will adapt the weight λ c according to the input data.

f c (⋅) and f d (⋅) both have a linear characteristic in respect to the number of “falsely matched” pixels and both functions yield 0 for a perfect fit and perfect input data. That means, that for the perfect deformation P , the error caused by noise in the depth and color image is f c (P ) and f d (P ). Following the principle of calculating a maximum likelihood based on variances calculated by these noise ratios, the resulting weight λ c should equal \(\frac {f_{d}(P^{*})}{f_{c}(P^{*})}\).

Although this definition can be statistically deduced, it is based on the assumption of P being the perfect match, i.e. for every frame it is assumed that the preceding frames matched correctly. So in a situation as depicted in Fig. 7 (left column), in which the color information is valuable whereas the depth information is rather useless, f c (P ) is likely to yield increased error values, even for a good solution P , and f d (⋅) remains constantly low. This would cause λ c to decrease, leading to a fit much less influenced by the color, causing λ c to decrease further. A similar example for an increasing λ c can be found in Fig. 18, depicting the tracking of a white object on white background.

To counter this behavior, a second aspect is considered in addition to the noise levels: The information content of the error values. Due to the optimization process, f has already been evaluated in the vicinity of P . Let \(\mathcal{P}\) be the set of individuals used by CMA-ES to calculate P and let

$$ \begin{array}{@{}l} \displaystyle\mathrm{inf}_d(\mathcal{P}) := \sum_{P \in\mathcal{P}} \frac {f_d(P)}{\vert P \vert f_d(P^*)};\\[6mm] \displaystyle\mathrm{inf}_c(\mathcal{P}) := \sum _{P \in\mathcal{P}} \frac{f_c(P)}{\vert P \vert f_c(P^*)}. \end{array} $$
(15)

The resulting inf c and inf d are a measure for how valuable the error values were in determining the currently best fit in respect to their “noise levels”. Hence, the ratio \(\frac {\mathrm{inf}_{c}}{\mathrm{inf}_{d}}\) is a measure on how valuable the color error is in relation to the depth error. Hence the color weight is updated after each iteration, such that:

$$ \lambda_c \rightarrow\frac{n-1}{n} \lambda_c + \frac{1}{n} \frac {\mathrm{inf}_c}{\mathrm{inf}_d}, $$
(16)

for n being the number of the current frame.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jordt, A., Koch, R. Direct Model-Based Tracking of 3D Object Deformations in Depth and Color Video. Int J Comput Vis 102, 239–255 (2013). https://doi.org/10.1007/s11263-012-0572-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-012-0572-1

Keywords