Direct Model-Based Tracking of 3D Object Deformations in Depth and Color Video

Jordt, Andreas; Koch, Reinhard

doi:10.1007/s11263-012-0572-1

Direct Model-Based Tracking of 3D Object Deformations in Depth and Color Video

Published: 28 September 2012

Volume 102, pages 239–255, (2013)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Andreas Jordt¹ &
Reinhard Koch¹

1863 Accesses
19 Citations
Explore all metrics

Abstract

The tracking of deformable objects using video data is a demanding research topic due to the inherent ambiguity problems, which can only be solved using additional assumptions about the deformation. Image feature points, commonly used to approach the deformation problem, only provide sparse information about the scene at hand. In this paper a tracking approach for deformable objects in color and depth video is introduced that does not rely on feature points or optical flow data but employs all the input image information available to find a suitable deformation for the data at hand. A versatile NURBS based deformation space is defined for arbitrary complex triangle meshes, decoupling the object surface complexity from the complexity of the deformation. An efficient optimization scheme is introduced that is able to calculate results in real-time (25 Hz). Extensive synthetic and real data tests of the algorithm and its features show the reliability of this approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Decomposition Betters Tracking Everything Everywhere

Subpixel-Precise Tracking of Rigid Objects in Real-Time

Modal Space: A Physics-Based Model for Sequential Estimation of Time-Varying Shape from Monocular Video

Article 22 June 2016

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Alizadeh, F., & Goldfarb, D. (2001). Second-order cone programming. Mathematical Programming, 95, 3–51.
Article MathSciNet Google Scholar
Auger, A., Brockhoff, D., & Hansen, N. (2010). Benchmarking the (1,4)-CMA-ES with mirrored sampling and sequential selection on the noisy BBOB-2010 testbed. In GECCO workshop on Black-Box optimization benchmarking (BBOB’2010) (pp. 1625–1632). New York: ACM.
Google Scholar
Bardinet, E., Cohen, L. D., & Ayache, N. (1998). A parametric deformable model to fit unstructured 3d data. Computer Vision and Image Understanding, 71(1), 39–54.
Article Google Scholar
Bartczak, B., & Koch, R. (2009). Dense depth maps from low resolution time-of-flight depth and high resolution color views. In Lecture notes in computer science: Vol. 5876. ISVC (2) (pp. 228–239). Berlin: Springer.
Google Scholar
Bartoli, A., & Zisserman, A. (2004). Direct estimation of non-rigid registration. In British machine vision conference.
Google Scholar
Bascle, B., & Blake, A. (1998). Separability of pose and expression in facial tracking and animation. In Proceedings of the sixth international conference on computer vision, ICCV ’98 (p. 323). Washington: IEEE Comput. Soc.
Google Scholar
Bregler, C., Hertzmann, A., & Biermann, H. (2000). Recovering non-rigid 3d shape from image streams. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2690–2696). Washington: IEEE Comput. Soc.
Google Scholar
Cagniar, C., Boyer, E., & Ilic, S. (2009). Iterative mesh deformation for dense surface tracking. In 12th international conference on computer vision workshops.
Google Scholar
Cai, Q., Gallup, D., Zhang, C., & Zhang, Z. (2010). 3d deformable face tracking with a commodity depth camera. Camera, 6313(2), 229–242.
Google Scholar
Chen, S. E., & Williams, L. (1993). View interpolation for image synthesis. In Proceedings of the 20th annual conference on computer graphics and interactive techniques, SIGGRAPH’93 (pp. 279–288). New York: ACM.
Chapter Google Scholar
Cohen, L. D., & Cohen, I. (1991). Finite element methods for active contour models and balloons for 2d and 3d images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15, 1131–1147.
Article Google Scholar
Costeira, J., & Kanade, T. (1994). A multi-body factorization method for motion analysis (Tech. Rep. CMU-CS-TR-94-220). Computer Science Department, Pittsburgh, PA.
de Aguiar, E., Theobalt, C., Stoll, C., & Seidel, H. P. (2007). Marker-less deformable mesh tracking for human shape and motion capture. In IEEE international conference on computer vision and pattern recognition (CVPR), Minneapolis, USA (pp. 1–8). New York: IEEE Press.
Google Scholar
de Aguiar, E., Stoll, C., Theobalt, C., Ahmed, N., Seidel, H. P., & Thrun, S. (2008). Performance capture from sparse multi-view video. In ACM transactions on graphics, Proc. of ACM SIGGRAPH (Vol. 27).
Google Scholar
Del Bue, A., & Agapito, L. (2006). Non-rigid stereo factorization. International Journal of Computer Vision, 66, 193–207.
Article Google Scholar
Del Bue, A., Smeraldi, F., & Agapito, L. (2007). Non-rigid structure from motion using ranklet-based tracking and non-linear optimization. Image and Vision Computing, 25(3), 297–310.
Article Google Scholar
Delingette, H., Hebert, M., & Ikeuchi, K. (1991). Deformable surfaces: a free-form shape representation. In Geometric methods in computer vision: Vol. 1570. Proc. SPIE (pp. 21–30).
Google Scholar
Fayad, J., Del Bue, A., Agapito, L., & Aguiar, P. (2009). Non-rigid structure from motion using quadratic deformation models. In British machine vision conference (BMVC), London, UK.
Google Scholar
Fayad, J., Agapito, L., & Bue, A. D. (2010). Piecewise quadratic reconstruction of non-rigid surfaces from monocular sequences. In Proceedings of the 11th European conference on computer vision: Part IV, ECCV’10 (pp. 297–310). Berlin: Springer.
Google Scholar
Hansen, N. (2006). The CMA evolution strategy: a comparing review. In Towards a new evolutionary computation. Advances on estimation of distribution algorithms (pp. 75–102). Berlin: Springer.
Chapter Google Scholar
Hartley, R. I., & Zisserman, A. (2000). Multiple view geometry in computer vision. Cambridge: Cambridge University Press. ISBN:0521623049.
MATH Google Scholar
Hilsmann, A., & Eisert, P. (2009). Realistic cloth augmentation in single view video. In Vision, modeling, and visualization workshop 2009, Braunschweig, Germany.
Google Scholar
Horn, B. K. P., & Harris, J. G. (1991). Rigid body motion from range image sequences. CVGIP. Image Understanding, 53, 1–13.
Article MATH Google Scholar
Jaklič, A., Leonardis, A., & Solina, F. (2000). Computational imaging and vision: Vol. 20. Segmentation and recovery of superquadrics. Dordrecth: Kluwer. ISBN 0-7923-6601-8.
MATH Google Scholar
Jordt, A., & Koch, R. (2011). Fast tracking of deformable objects in depth and colour video. In Proceedings of the British machine vision conference, BMVC 2011. British Machine Vision Association.
Google Scholar
Kim, Y. M., Theobalt, C., Diebel, J., Kosecka, J., Micusik, B., & Thrun, S. (2009). Multi-view image and tof sensor fusion for dense 3d reconstruction. In IEEE workshop on 3-D digital imaging and modeling (3DIM), Kyoto, Japan (pp. 1542–1549). New York: IEEE Press.
Google Scholar
Koch, R. (1993). Dynamic 3-d scene analysis through synthesis feedback control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(6), 556–568.
Article Google Scholar
Mcinerney, T., & Terzopoulos, D. (1993). A finite element model for 3d shape reconstruction and nonrigid motion tracking. In 4th international conference on in computer vision, ICCV (pp. 518–523).
Google Scholar
Muñoz, E., Buenaposada, J. M., & Baumela, L. (2009). A direct approach for efficiently tracking with 3d morphable models. In ICCV (pp. 1615–1622). New York: IEEE Press.
Google Scholar
Netravali, A., & Salz, J. (1985). Algorithms for estimation of three-dimensional motion. AT & T Bell Laboratories Technical Journal, 64, 2.
Google Scholar
Osher, S., & Sethian, J. A. (1988). Fronts propagating with curvature dependent speed: algorithms based on Hamilton-Jacobi formulations. Journal of Computational Physics, 79(1), 12–49.
Article MathSciNet MATH Google Scholar
Ostermeier, A., & Hansen, N. (1999). An evolution strategy with coordinate system invariant adaptation of arbitrary normal mutation distributions within the concept of mutative strategy parameter control. In Proceedings of the genetic and evolutionary computation conference (GECCO) (pp. 902–909). San Mateo: Morgan Kaufmann.
Google Scholar
Piegl, L., & Tiller, W. (1997). The NURBS book (2nd ed.). Berlin: Springer.
Book Google Scholar
Pilet, J., Lepetit, V., & Fua, P. (2008). Fast non-rigid surface detection, registration and realistic augmentation. International Journal of Computer Vision, 76, 109–122.
Article Google Scholar
Rosenhahn, B., Kersting, U., Powell, K., Klette, R., Klette, G., & Seidel, H. P. (2007). A system for articulated tracking incorporating a clothing model. Machine Vision and Applications, 18, 25–40.
Article Google Scholar
Russell, C., Fayad, J., & Agapito, L. (2011). Energy based multiple model fitting for non-rigid structure from motion. In IEEE conference on computer vision and pattern recognition.
Google Scholar
Salzmann, M., Hartley, R., & Fua, P. (2007). Convex optimization for deformable surface 3-d tracking. In ICCV’07 (pp. 1–8).
Google Scholar
Salzmann, M., Lepetit, V., & Fua, P. (2007). Deformable surface tracking ambiguities. In IEEE international conference on computer vision and pattern recognition (CVPR).
Google Scholar
Schiller, I., Beder, C., & Koch, R. (2008). Calibration of a PMD camera using a planar calibration object together with a multi-camera setup. In The international archives of the photogrammetry, remote sensing and spatial information sciences, Beijing, China (Vol. XXXVII, pp. 297–302). XXI. Part B3a, ISPRS Congress.
Google Scholar
Shen, S., Zheng, Y., & Liu, Y. (2008). Deformable surface stereo tracking-by-detection using second order cone programming. In International conference on computer vision and pattern recognition (CVPR) (pp. 1–4). New York: IEEE Press.
Google Scholar
Shen, S., Ma, W., Shi, W., & Liu, Y. (2010). Convex optimization for nonrigid stereo reconstruction. IEEE Transactions on Image Processing, 19, 782–794.
Article MathSciNet Google Scholar
Shotton, J., Fitzgibbon, A. W., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., & Blake, A. (2011). Real-time human pose recognition in parts from single depth images. In CVPR (pp. 1297–1304). New York: IEEE Press.
Chapter Google Scholar
Stanford, L. T., Hertzmann, A., & Bregler, C. (2003). Learning non-rigid 3d shape from 2d motion. In Proceedings of the 17th annual conference on neural information processing systems (NIPS) (pp. 1555–1562). Cambridge: MIT Press.
Google Scholar
Taylor, J., Jepson, A. D., & Kutulakos, K. N. (2010). In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2761–2768).
Google Scholar
Tomasi, C., & Kanade, T. (1992). Shape and motion from image streams under orthography: a factorization method. International Journal of Computer Vision, 9, 137–154.
Article Google Scholar
Torresani, L., Yang, D. B., Alexander, E. J., & Bregler, C. (2001). Tracking and modeling non-rigid objects with rank constraints. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 493–500).
Google Scholar
Vedula, S., Baker, S., Collins, R., & Kanada, T. (1999). Three-dimensional scene flow. In Proceedings of the 7th international conference on computer vision, ICCV (pp. 722–726). New York: IEEE Press.
Chapter Google Scholar
Yamamoto, M., Boulanger, P., Beraldin, J. A., & Rioux, M. (1993). Direct estimation of range flow on deformable shape from a video rate range camera. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(1), 82–89. doi:10.1109/34.184776.
Article Google Scholar
Zhang, Z. (1994). Iterative point matching for registration of free-form curves and surfaces. International Journal of Computer Vision, 13(2), 119–152. doi:10.1007/BF01427149.
Article Google Scholar
Zhu, J., Hoi, S. C., Xu, Z., & Lyu, M. R. (2008). An effective approach to 3d deformable surface tracking. In Proceedings of the 10th European conference on computer vision: Part III, ECCV ’08 (pp. 766–779). Berlin: Springer.
Google Scholar

Download references

Author information

Authors and Affiliations

Multimedia Information Processing Group, University of Kiel, Hermann-Rodewald-Str. 3, 24098, Kiel, Germany
Andreas Jordt & Reinhard Koch

Authors

Andreas Jordt
View author publications
You can also search for this author inPubMed Google Scholar
Reinhard Koch
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Andreas Jordt.

Additional information

This work was supported by the EU Interreg A4 Project “Intelligent Robots for Handling flexible Objects” (IRFO), 33-1.2-09.

Appendices

Appendix A: Occlusion Handling

The ability of the algorithm to cope with holes in the input images (see Sect. 3.2) allows to implement a straight forward occlusion handling: pixel, supposed to occlude the actual object can be removed from the input data without leaving unfeasible data to the main tracking algorithm. Given the current movement of each vertex and under the assumption that it’s speed will at maximum be doubled in the current frame, it is possible to calculate a 3D region and a corresponding image area for each vertex, in which it (or its projection respectively) can be found in the current frame. Given the color error f _c for the vertices, we assume that most of the vertices have a color fit twice as high as the color error or lower.

These assumptions allow to formulate a two step occlusion classifier. In a first step, each pixel in the depth and color image is assigned one of these states:

(1)
Not classified
(2)
Depth and color are in the vicinity of at least one vertex
(3)
Only the depth value is in the vicinity of an object vertex but the color value does not fit
(4)
The pixel seems to be part of an occluding object

Starting with every pixel set to state (1), for every vertex v in V and every pixel [x,y] in the vicinity (as defined above) of the projection of v, the following rules are processed:

If [x,y] is within the three dimensional vicinity of v according to its recent movement and the color of v is in the vicinity of the color at [x,y], then set the pixel state of [x,y] to (2).
If the state of [x,y] is not (2) and [x,y] is within the three dimensional vicinity of v but the color of v is not in the vicinity of the color at [x,y], then set the pixel state of [x,y] to (3).
If the state of [x,y] is not (2) the depth value of [x,y] is outside the vicinity of v such that it is in front of v, then set the pixel state of [x,y] to (4).

This procedure computes a classification into certain object pixels (2), uncertain pixels (3) and certain occlusion pixels (4). In a second step, every pixel classified as (3) in the vicinity of a pixel classified as (4) is also set to (4). Finally, every pixel classified as (4) is removed from the input data.

Appendix B: Training of Color/Depth Weights

A rather clumsy property of Jordt and Koch (2011) are the manually chosen weights for color and depth errors. Because of the novel formulation of (10) and (13), the depth error function does not have to be weighted, so the color weight λ _c (see Eq. (13)) is the only hyper parameter in this method. The selection of color/depth weights is a general problem that appears in the fusion of different errors from various domains to one fitness value, due to the lack of a common metric. Without additional information, a depth error value can not be compared to an error value in the color space. A fused error function simply adding up these error values in their current domain is likely to disregard one of the domains at hand.

A common tool to handle this problem is to define a certainty measure in each domain via the variance of a Gaussian distribution (Kim et al. 2009) or a cost function derived from it (Bartczak and Koch 2009). Once, every color and depth measurement is equipped with a cost or a distribution, a sound solution can be formulated by calculating the maximum likelihood of the given measurements or the cost minimum respectively. Although based on statistics, it can be shown that in the end an optimum calculated that way is equivalent to the minimum of the weighted squared error functions, given the correct weight.

Though it is possible to define reasonable variances for the Kinect color and depth measurements, the input data in the experiments of Sect. 6 show that it is more appropriate to weight color and depth information depending on the input data rather than defining static variances. Hence, we will adapt the weight λ _c according to the input data.

f _c(⋅) and f _d(⋅) both have a linear characteristic in respect to the number of “falsely matched” pixels and both functions yield 0 for a perfect fit and perfect input data. That means, that for the perfect deformation P ^∗, the error caused by noise in the depth and color image is f _c(P ^∗) and f _d(P ^∗). Following the principle of calculating a maximum likelihood based on variances calculated by these noise ratios, the resulting weight λ _c should equal $\frac {f_{d}(P^{*})}{f_{c}(P^{*})}$.

Although this definition can be statistically deduced, it is based on the assumption of P ^∗ being the perfect match, i.e. for every frame it is assumed that the preceding frames matched correctly. So in a situation as depicted in Fig. 7 (left column), in which the color information is valuable whereas the depth information is rather useless, f _c(P ^∗) is likely to yield increased error values, even for a good solution P ^∗, and f _d(⋅) remains constantly low. This would cause λ _c to decrease, leading to a fit much less influenced by the color, causing λ _c to decrease further. A similar example for an increasing λ _c can be found in Fig. 18, depicting the tracking of a white object on white background.

To counter this behavior, a second aspect is considered in addition to the noise levels: The information content of the error values. Due to the optimization process, f has already been evaluated in the vicinity of P ^∗. Let $\mathcal{P}$ be the set of individuals used by CMA-ES to calculate P ^∗ and let

$$ \begin{array}{@{}l} \displaystyle\mathrm{inf}_d(\mathcal{P}) := \sum_{P \in\mathcal{P}} \frac {f_d(P)}{\vert P \vert f_d(P^*)};\\[6mm] \displaystyle\mathrm{inf}_c(\mathcal{P}) := \sum _{P \in\mathcal{P}} \frac{f_c(P)}{\vert P \vert f_c(P^*)}. \end{array} $$

(15)

The resulting inf_c and inf_d are a measure for how valuable the error values were in determining the currently best fit in respect to their “noise levels”. Hence, the ratio $\frac {\mathrm{inf}_{c}}{\mathrm{inf}_{d}}$ is a measure on how valuable the color error is in relation to the depth error. Hence the color weight is updated after each iteration, such that:

$$ \lambda_c \rightarrow\frac{n-1}{n} \lambda_c + \frac{1}{n} \frac {\mathrm{inf}_c}{\mathrm{inf}_d}, $$

(16)

for n being the number of the current frame.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jordt, A., Koch, R. Direct Model-Based Tracking of 3D Object Deformations in Depth and Color Video. Int J Comput Vis 102, 239–255 (2013). https://doi.org/10.1007/s11263-012-0572-1

Download citation

Received: 04 November 2011
Accepted: 07 September 2012
Published: 28 September 2012
Issue Date: March 2013
DOI: https://doi.org/10.1007/s11263-012-0572-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Direct Model-Based Tracking of 3D Object Deformations in Depth and Color Video

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Decomposition Betters Tracking Everything Everywhere

Subpixel-Precise Tracking of Rigid Objects in Real-Time

Modal Space: A Physics-Based Model for Sequential Estimation of Time-Varying Shape from Monocular Video

Explore related subjects

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendices

Appendix A: Occlusion Handling

Appendix B: Training of Color/Depth Weights

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now