Abstract
The tracking of deformable objects using video data is a demanding research topic due to the inherent ambiguity problems, which can only be solved using additional assumptions about the deformation. Image feature points, commonly used to approach the deformation problem, only provide sparse information about the scene at hand. In this paper a tracking approach for deformable objects in color and depth video is introduced that does not rely on feature points or optical flow data but employs all the input image information available to find a suitable deformation for the data at hand. A versatile NURBS based deformation space is defined for arbitrary complex triangle meshes, decoupling the object surface complexity from the complexity of the deformation. An efficient optimization scheme is introduced that is able to calculate results in real-time (25 Hz). Extensive synthetic and real data tests of the algorithm and its features show the reliability of this approach.




















Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Alizadeh, F., & Goldfarb, D. (2001). Second-order cone programming. Mathematical Programming, 95, 3–51.
Auger, A., Brockhoff, D., & Hansen, N. (2010). Benchmarking the (1,4)-CMA-ES with mirrored sampling and sequential selection on the noisy BBOB-2010 testbed. In GECCO workshop on Black-Box optimization benchmarking (BBOB’2010) (pp. 1625–1632). New York: ACM.
Bardinet, E., Cohen, L. D., & Ayache, N. (1998). A parametric deformable model to fit unstructured 3d data. Computer Vision and Image Understanding, 71(1), 39–54.
Bartczak, B., & Koch, R. (2009). Dense depth maps from low resolution time-of-flight depth and high resolution color views. In Lecture notes in computer science: Vol. 5876. ISVC (2) (pp. 228–239). Berlin: Springer.
Bartoli, A., & Zisserman, A. (2004). Direct estimation of non-rigid registration. In British machine vision conference.
Bascle, B., & Blake, A. (1998). Separability of pose and expression in facial tracking and animation. In Proceedings of the sixth international conference on computer vision, ICCV ’98 (p. 323). Washington: IEEE Comput. Soc.
Bregler, C., Hertzmann, A., & Biermann, H. (2000). Recovering non-rigid 3d shape from image streams. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2690–2696). Washington: IEEE Comput. Soc.
Cagniar, C., Boyer, E., & Ilic, S. (2009). Iterative mesh deformation for dense surface tracking. In 12th international conference on computer vision workshops.
Cai, Q., Gallup, D., Zhang, C., & Zhang, Z. (2010). 3d deformable face tracking with a commodity depth camera. Camera, 6313(2), 229–242.
Chen, S. E., & Williams, L. (1993). View interpolation for image synthesis. In Proceedings of the 20th annual conference on computer graphics and interactive techniques, SIGGRAPH’93 (pp. 279–288). New York: ACM.
Cohen, L. D., & Cohen, I. (1991). Finite element methods for active contour models and balloons for 2d and 3d images. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15, 1131–1147.
Costeira, J., & Kanade, T. (1994). A multi-body factorization method for motion analysis (Tech. Rep. CMU-CS-TR-94-220). Computer Science Department, Pittsburgh, PA.
de Aguiar, E., Theobalt, C., Stoll, C., & Seidel, H. P. (2007). Marker-less deformable mesh tracking for human shape and motion capture. In IEEE international conference on computer vision and pattern recognition (CVPR), Minneapolis, USA (pp. 1–8). New York: IEEE Press.
de Aguiar, E., Stoll, C., Theobalt, C., Ahmed, N., Seidel, H. P., & Thrun, S. (2008). Performance capture from sparse multi-view video. In ACM transactions on graphics, Proc. of ACM SIGGRAPH (Vol. 27).
Del Bue, A., & Agapito, L. (2006). Non-rigid stereo factorization. International Journal of Computer Vision, 66, 193–207.
Del Bue, A., Smeraldi, F., & Agapito, L. (2007). Non-rigid structure from motion using ranklet-based tracking and non-linear optimization. Image and Vision Computing, 25(3), 297–310.
Delingette, H., Hebert, M., & Ikeuchi, K. (1991). Deformable surfaces: a free-form shape representation. In Geometric methods in computer vision: Vol. 1570. Proc. SPIE (pp. 21–30).
Fayad, J., Del Bue, A., Agapito, L., & Aguiar, P. (2009). Non-rigid structure from motion using quadratic deformation models. In British machine vision conference (BMVC), London, UK.
Fayad, J., Agapito, L., & Bue, A. D. (2010). Piecewise quadratic reconstruction of non-rigid surfaces from monocular sequences. In Proceedings of the 11th European conference on computer vision: Part IV, ECCV’10 (pp. 297–310). Berlin: Springer.
Hansen, N. (2006). The CMA evolution strategy: a comparing review. In Towards a new evolutionary computation. Advances on estimation of distribution algorithms (pp. 75–102). Berlin: Springer.
Hartley, R. I., & Zisserman, A. (2000). Multiple view geometry in computer vision. Cambridge: Cambridge University Press. ISBN:0521623049.
Hilsmann, A., & Eisert, P. (2009). Realistic cloth augmentation in single view video. In Vision, modeling, and visualization workshop 2009, Braunschweig, Germany.
Horn, B. K. P., & Harris, J. G. (1991). Rigid body motion from range image sequences. CVGIP. Image Understanding, 53, 1–13.
Jaklič, A., Leonardis, A., & Solina, F. (2000). Computational imaging and vision: Vol. 20. Segmentation and recovery of superquadrics. Dordrecth: Kluwer. ISBN 0-7923-6601-8.
Jordt, A., & Koch, R. (2011). Fast tracking of deformable objects in depth and colour video. In Proceedings of the British machine vision conference, BMVC 2011. British Machine Vision Association.
Kim, Y. M., Theobalt, C., Diebel, J., Kosecka, J., Micusik, B., & Thrun, S. (2009). Multi-view image and tof sensor fusion for dense 3d reconstruction. In IEEE workshop on 3-D digital imaging and modeling (3DIM), Kyoto, Japan (pp. 1542–1549). New York: IEEE Press.
Koch, R. (1993). Dynamic 3-d scene analysis through synthesis feedback control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(6), 556–568.
Mcinerney, T., & Terzopoulos, D. (1993). A finite element model for 3d shape reconstruction and nonrigid motion tracking. In 4th international conference on in computer vision, ICCV (pp. 518–523).
Muñoz, E., Buenaposada, J. M., & Baumela, L. (2009). A direct approach for efficiently tracking with 3d morphable models. In ICCV (pp. 1615–1622). New York: IEEE Press.
Netravali, A., & Salz, J. (1985). Algorithms for estimation of three-dimensional motion. AT & T Bell Laboratories Technical Journal, 64, 2.
Osher, S., & Sethian, J. A. (1988). Fronts propagating with curvature dependent speed: algorithms based on Hamilton-Jacobi formulations. Journal of Computational Physics, 79(1), 12–49.
Ostermeier, A., & Hansen, N. (1999). An evolution strategy with coordinate system invariant adaptation of arbitrary normal mutation distributions within the concept of mutative strategy parameter control. In Proceedings of the genetic and evolutionary computation conference (GECCO) (pp. 902–909). San Mateo: Morgan Kaufmann.
Piegl, L., & Tiller, W. (1997). The NURBS book (2nd ed.). Berlin: Springer.
Pilet, J., Lepetit, V., & Fua, P. (2008). Fast non-rigid surface detection, registration and realistic augmentation. International Journal of Computer Vision, 76, 109–122.
Rosenhahn, B., Kersting, U., Powell, K., Klette, R., Klette, G., & Seidel, H. P. (2007). A system for articulated tracking incorporating a clothing model. Machine Vision and Applications, 18, 25–40.
Russell, C., Fayad, J., & Agapito, L. (2011). Energy based multiple model fitting for non-rigid structure from motion. In IEEE conference on computer vision and pattern recognition.
Salzmann, M., Hartley, R., & Fua, P. (2007). Convex optimization for deformable surface 3-d tracking. In ICCV’07 (pp. 1–8).
Salzmann, M., Lepetit, V., & Fua, P. (2007). Deformable surface tracking ambiguities. In IEEE international conference on computer vision and pattern recognition (CVPR).
Schiller, I., Beder, C., & Koch, R. (2008). Calibration of a PMD camera using a planar calibration object together with a multi-camera setup. In The international archives of the photogrammetry, remote sensing and spatial information sciences, Beijing, China (Vol. XXXVII, pp. 297–302). XXI. Part B3a, ISPRS Congress.
Shen, S., Zheng, Y., & Liu, Y. (2008). Deformable surface stereo tracking-by-detection using second order cone programming. In International conference on computer vision and pattern recognition (CVPR) (pp. 1–4). New York: IEEE Press.
Shen, S., Ma, W., Shi, W., & Liu, Y. (2010). Convex optimization for nonrigid stereo reconstruction. IEEE Transactions on Image Processing, 19, 782–794.
Shotton, J., Fitzgibbon, A. W., Cook, M., Sharp, T., Finocchio, M., Moore, R., Kipman, A., & Blake, A. (2011). Real-time human pose recognition in parts from single depth images. In CVPR (pp. 1297–1304). New York: IEEE Press.
Stanford, L. T., Hertzmann, A., & Bregler, C. (2003). Learning non-rigid 3d shape from 2d motion. In Proceedings of the 17th annual conference on neural information processing systems (NIPS) (pp. 1555–1562). Cambridge: MIT Press.
Taylor, J., Jepson, A. D., & Kutulakos, K. N. (2010). In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2761–2768).
Tomasi, C., & Kanade, T. (1992). Shape and motion from image streams under orthography: a factorization method. International Journal of Computer Vision, 9, 137–154.
Torresani, L., Yang, D. B., Alexander, E. J., & Bregler, C. (2001). Tracking and modeling non-rigid objects with rank constraints. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 493–500).
Vedula, S., Baker, S., Collins, R., & Kanada, T. (1999). Three-dimensional scene flow. In Proceedings of the 7th international conference on computer vision, ICCV (pp. 722–726). New York: IEEE Press.
Yamamoto, M., Boulanger, P., Beraldin, J. A., & Rioux, M. (1993). Direct estimation of range flow on deformable shape from a video rate range camera. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15(1), 82–89. doi:10.1109/34.184776.
Zhang, Z. (1994). Iterative point matching for registration of free-form curves and surfaces. International Journal of Computer Vision, 13(2), 119–152. doi:10.1007/BF01427149.
Zhu, J., Hoi, S. C., Xu, Z., & Lyu, M. R. (2008). An effective approach to 3d deformable surface tracking. In Proceedings of the 10th European conference on computer vision: Part III, ECCV ’08 (pp. 766–779). Berlin: Springer.
Author information
Authors and Affiliations
Corresponding author
Additional information
This work was supported by the EU Interreg A4 Project “Intelligent Robots for Handling flexible Objects” (IRFO), 33-1.2-09.
Appendices
Appendix A: Occlusion Handling
The ability of the algorithm to cope with holes in the input images (see Sect. 3.2) allows to implement a straight forward occlusion handling: pixel, supposed to occlude the actual object can be removed from the input data without leaving unfeasible data to the main tracking algorithm. Given the current movement of each vertex and under the assumption that it’s speed will at maximum be doubled in the current frame, it is possible to calculate a 3D region and a corresponding image area for each vertex, in which it (or its projection respectively) can be found in the current frame. Given the color error f c for the vertices, we assume that most of the vertices have a color fit twice as high as the color error or lower.
These assumptions allow to formulate a two step occlusion classifier. In a first step, each pixel in the depth and color image is assigned one of these states:
-
(1)
Not classified
-
(2)
Depth and color are in the vicinity of at least one vertex
-
(3)
Only the depth value is in the vicinity of an object vertex but the color value does not fit
-
(4)
The pixel seems to be part of an occluding object
Starting with every pixel set to state (1), for every vertex v in V and every pixel [x,y] in the vicinity (as defined above) of the projection of v, the following rules are processed:
-
If [x,y] is within the three dimensional vicinity of v according to its recent movement and the color of v is in the vicinity of the color at [x,y], then set the pixel state of [x,y] to (2).
-
If the state of [x,y] is not (2) and [x,y] is within the three dimensional vicinity of v but the color of v is not in the vicinity of the color at [x,y], then set the pixel state of [x,y] to (3).
-
If the state of [x,y] is not (2) the depth value of [x,y] is outside the vicinity of v such that it is in front of v, then set the pixel state of [x,y] to (4).
This procedure computes a classification into certain object pixels (2), uncertain pixels (3) and certain occlusion pixels (4). In a second step, every pixel classified as (3) in the vicinity of a pixel classified as (4) is also set to (4). Finally, every pixel classified as (4) is removed from the input data.
Appendix B: Training of Color/Depth Weights
A rather clumsy property of Jordt and Koch (2011) are the manually chosen weights for color and depth errors. Because of the novel formulation of (10) and (13), the depth error function does not have to be weighted, so the color weight λ c (see Eq. (13)) is the only hyper parameter in this method. The selection of color/depth weights is a general problem that appears in the fusion of different errors from various domains to one fitness value, due to the lack of a common metric. Without additional information, a depth error value can not be compared to an error value in the color space. A fused error function simply adding up these error values in their current domain is likely to disregard one of the domains at hand.
A common tool to handle this problem is to define a certainty measure in each domain via the variance of a Gaussian distribution (Kim et al. 2009) or a cost function derived from it (Bartczak and Koch 2009). Once, every color and depth measurement is equipped with a cost or a distribution, a sound solution can be formulated by calculating the maximum likelihood of the given measurements or the cost minimum respectively. Although based on statistics, it can be shown that in the end an optimum calculated that way is equivalent to the minimum of the weighted squared error functions, given the correct weight.
Though it is possible to define reasonable variances for the Kinect color and depth measurements, the input data in the experiments of Sect. 6 show that it is more appropriate to weight color and depth information depending on the input data rather than defining static variances. Hence, we will adapt the weight λ c according to the input data.
f c (⋅) and f d (⋅) both have a linear characteristic in respect to the number of “falsely matched” pixels and both functions yield 0 for a perfect fit and perfect input data. That means, that for the perfect deformation P ∗, the error caused by noise in the depth and color image is f c (P ∗) and f d (P ∗). Following the principle of calculating a maximum likelihood based on variances calculated by these noise ratios, the resulting weight λ c should equal \(\frac {f_{d}(P^{*})}{f_{c}(P^{*})}\).
Although this definition can be statistically deduced, it is based on the assumption of P ∗ being the perfect match, i.e. for every frame it is assumed that the preceding frames matched correctly. So in a situation as depicted in Fig. 7 (left column), in which the color information is valuable whereas the depth information is rather useless, f c (P ∗) is likely to yield increased error values, even for a good solution P ∗, and f d (⋅) remains constantly low. This would cause λ c to decrease, leading to a fit much less influenced by the color, causing λ c to decrease further. A similar example for an increasing λ c can be found in Fig. 18, depicting the tracking of a white object on white background.
To counter this behavior, a second aspect is considered in addition to the noise levels: The information content of the error values. Due to the optimization process, f has already been evaluated in the vicinity of P ∗. Let \(\mathcal{P}\) be the set of individuals used by CMA-ES to calculate P ∗ and let
The resulting inf c and inf d are a measure for how valuable the error values were in determining the currently best fit in respect to their “noise levels”. Hence, the ratio \(\frac {\mathrm{inf}_{c}}{\mathrm{inf}_{d}}\) is a measure on how valuable the color error is in relation to the depth error. Hence the color weight is updated after each iteration, such that:
for n being the number of the current frame.
Rights and permissions
About this article
Cite this article
Jordt, A., Koch, R. Direct Model-Based Tracking of 3D Object Deformations in Depth and Color Video. Int J Comput Vis 102, 239–255 (2013). https://doi.org/10.1007/s11263-012-0572-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-012-0572-1