Unsupervised skeleton extraction and motion capture from 3D deformable matching
Introduction
Unsupervised object skeleton extraction is an active research topic in computer vision, as it can potentially improve performance in various CV applications, such as 3D motion capture, 3D pose estimation, activity recognition, 3D object tracking, etc.
Previous approaches extract articulated object skeletons from videos [1], [2], [3], [4], [5], [6], motion capture data [7], [8], and static object models [9], [10], [11], [12], [13]. However, all these approaches have their intrinsic limitations, such as “difficult to reflect the 3D object motion”, “need some markers on the object”, “the extracted skeleton and recorded motion are inaccurate in the 3D coordinate”, etc. Nevertheless, with the fast development of 3D hardware, the Kinect as a new kind of 3D sensor has received the increasing attention for solving traditional computer vision problems. It is developed by Microsoft for the Xbox 360 video game platform, and it can collect both the RGB video stream and the depth sensing stream at a frame rate of 30 Hz (as shown in Fig. 1(a1–2,b1–2)). Obviously, the Kinect provides us another choice: Why not extract object skeletons and capture motion directly from 3D point cloud sequences? Therefore, the purpose of this paper is to develop a novel approach that can extract articulated object skeletons and capture object motion directly from 3D point cloud sequences without any prior information, such as the object type and etc.
Our approach mainly contains three steps, as illustrated in Fig. 2. At first, we utilize a coarse-to-fine MRF-based 3D non-rigid matching to track all the raw 3D points (as shown in Fig. 2(a–c)). Then, we utilize the spectral clustering to group these point trajectories into different body segments (as shown in Fig. 2(d)). Finally, we utilize a graph model to determine the connections between the body segments (as shown in Fig. 2(e)). In addition, the extracted skeleton can also be applied to the motion capture (as shown in Fig. 2).
The proposed approach has the following key features that make it advantageous over previous ones: (1) our system does not require markers in the data collection, which not only saves the human labor, but also has merits in learning the object's unknown structure, as shown in Fig. 1(a1 and a2). In contrast, for the marker-based approach, people usually locate markers on some key parts of the articulated object (such as the joints and body segment centers), according to their subjective understanding. The subjective understanding can bring priori errors to the unknown structure learning. (2) Compared to the static-model-based approaches, our approach utilizes the motion information to obtain accurate object segments. Moreover, it does not require well-constructed 3D models. (3) Different from the video-based 3D reconstruction, the Kinect directly provides the object's spatial structure without any shape deformation assumptions. What's more, in order to obtain the body segments' motion, we can directly track all the raw 3D points without feature point extraction. Therefore, the 3D point tracking does not suffer from monotonous colors and illumination changes as the video-based tracking.
The main contributions of this paper can be summarized as follows: (1) to our best knowledge, this is the first work that extracts skeletons of complex articulated objects directly from 3D point cloud sequences without prior information by point-level tracking. Our algorithm provides a global segmentation of the object body segments, which is robust to small intra-segment deformation. (2) We propose an efficient coarse-to-fine framework to track the 3D points based on the MRF Deformation Model. The coarse-to-fine strategy greatly reduces the tracking's time/memory cost. To our best knowledge, this is the first work to track all the raw 3D points of a deformable object without any transformation assumptions.
The rest of this paper is organized as follows: the related work is briefly reviewed in the following section. Section 3 presents the coarse-to-fine 3D point tracking and Section 4 presents the skeleton extraction. The experiments and results are presented in Section 5. Finally, the paper is concluded in Section 6.
Section snippets
Related work
Many previous approaches extract skeletons from videos [1], [2], [3], [4], [5], [6]. Refs. [1], [2], [3], [4] utilize the KLT tracker [15] to get feature trajectories. However, these methods require sufficient feature points on some key parts of the object for good performance, and the image-based tracking may suffer from illumination changes.
Some vision-based methods utilize multiple cameras to obtain a dense 3D point cloud sequence of the object, and then extract the object skeleton from the
Coarse-to-fine tracking 3D points
The first stage of our system is to generate the 3D point trajectories by tracking each 3D point over frames (as shown in Fig. 3). The point tracking is based on the multi-frame 3D non-rigid matching. The matching-based tracking is not achieved in a Markov process, so it avoids tracking error accumulation over frames. To match two specific frames, we extend the image-based MRF Deformation Model proposed in [22], [23]. The time/memory cost to match a large number of 3D points is intolerable.
Skeleton extraction
The second stage of our system is to extract the dynamic skeleton using 3D point trajectories generated in Section 3. The marker-based skeleton extraction approach proposed in [7], [4] is modified to adapt to the 3D point cloud data.
Body segmentation: The first step of skeleton extraction clusters the points into rigid body segments (Fig. 6(a)). The body segments reflect the body's skeletal structure. In an ideal rigid body, any two points should keep a constant distance over frames and their
Experiment
To show the performance of our system, we used the Kinect to collect four 3D point cloud sequences, and extracted four skeletons from these sequences respectively. Sequence 1 is a man, Sequence 2 is a man holding two cones, Sequence 3 is a box chain and Sequence 4 is a vacuum cleaner. Meanwhile, we further collected the 3D data of a human upper body by using both the Kinect (Sequence 5) and the marker-based motion capture system (marker trajectories), in order to compare our method with the
Conclusion and discussion
The paper presents a novel skeleton extraction algorithm by using the 3D point sequence, and the extracted skeletons can be easily used for the motion capture. Generally, the unsupervised learned skeleton successfully reflects the object's true articulated structure. Although the unsupervisedly extracted skeleton cannot be as accurate as the model-based pose estimation, our proposed system can discover the dynamic topological structure of the “unknown” object.
The integral geodesic distance is
Acknowledgments
Thank Prof. Hajime Asama, Mr. Yuki Ishikawa, and Mr. Qi An in the Asama Laboratory, University of Tokyo, for their help in data collection. This work was supported by a Grant-in-Aid for Young Scientists (23700192) and Strategic Project to Support the Formation of Research Bases at Private Universities: Matching Fund Subsidy from MEXT (Ministry of Education,Culture,Sports,Science and Technology), 2008-2012. This work was also partially funded by Microsoft Research.
Quanshi Zhang received his B.S. degree in Machine Intelligence from Peking University, China, in 2009 and M.S. degree from Center for Spatial Information Science, University of Tokyo, Japan, in 2011. From 2011 to now, he is a Ph.D. candidate in Center for Spatial Information science, University of Tokyo. His research is mainly in artificial intelligence, computer vision, and robotics, especially on the construction of the visual object knowledge base, 3D point cloud processing, and knowledge
References (27)
- et al.
Delaunay conforming iso-surface, skeleton extraction and noise removal
Comput. Geomet.
(2001) - et al.
Efficient mrf deformation model for non-rigid image matching
Comput. Vis. Image Understand.
(2008) - et al.
Unsupervised learning of skeletons from motion
Eur. Conf. Comput. Vis.
(2008) - et al.
Learning articulated structure and motion
Int. J. Comput. Vis.
(2010) - et al.
A factorization-based approach for articulated nonrigid shape, motion and kinematic chain recovery from video
IEEE Trans. Pattern Anal. Mach. Intell.
(2008) - et al.
Automatic kinematic chain building from feature trajectories of articulated objects
IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.
(2006) - et al.
Articulated structure from motion by factorization
IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.
(2005) - et al.
Building models of animals from video
IEEE Trans. Pattern Anal. Mach. Intell.
(2006) - et al.
Skeletal parameter estimation from optical motion capture data
IEEE Comput. Soc. Conf. Comput. Vis. Pattern Recognit.
(2005) - et al.
Learning kinematic models for articulated objects
Proc. Int. Conf. Artif. Intell.
(2009)
Curve skeleton extraction from incomplete point cloud
Proc. ACM SIGGRAPH
Skeleton extraction by mesh contraction
Proc. ACM SIGGRAPH
Harmonic skeleton for realistic character animation
ACM SIGGRAPH Sympos. Comput. Animat.
Cited by (37)
Novel data fusion strategy for human gait analysis using multiple kinect sensors
2021, Biomedical Signal Processing and ControlCitation Excerpt :On the other hand, single-frame-based approaches are not that easy as they do not take any assumptions for time coherence. Skeleton tracking algorithms are grouped into models based on single view [26–28] and multiple views [29,30]. Masse et al. [31] presented augmentation of human joint positions, though based on a multi-sensor approach but involving only one Kinect sensor.
3D articulated skeleton extraction using a single consumer-grade depth camera
2019, Computer Vision and Image UnderstandingCitation Excerpt :We show the visual comparisons between our approach and the state of the art techniques (Method I–III) on various objects (full body: Figs. 11–13(a–f), upper body: Fig. 13(g–l), hand: 13(m–n, q–r), lower body: 13(o–p), arm: 13(s–t), and fish: 13(u–v)). Our approach produces substantially higher quality skeletons, compared with Method III (Zhang et al., 2013). Our method can even learn better skeletons (Figs. 11 and 12) than Method II (Kirk et al., 2005), despite their good results are probably mainly due to quality marker input.
Space-time representation of people based on 3D skeletal data: A review
2017, Computer Vision and Image UnderstandingFull body movements recognition - unsupervised learning approach with heuristic R-GDL method
2015, Digital Signal Processing: A Review JournalTransfer4D: A Framework for Frugal Motion Capture and Deformation Transfer
2023, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
Quanshi Zhang received his B.S. degree in Machine Intelligence from Peking University, China, in 2009 and M.S. degree from Center for Spatial Information Science, University of Tokyo, Japan, in 2011. From 2011 to now, he is a Ph.D. candidate in Center for Spatial Information science, University of Tokyo. His research is mainly in artificial intelligence, computer vision, and robotics, especially on the construction of the visual object knowledge base, 3D point cloud processing, and knowledge mining.
Xuan Song received his B.S. degree in Information Engineering from Jilin University, China, in 2005 and Ph.D. degree in Signal and Information Processing from Peking University, China, in 2010. From 2010 to 2012, he joined Center for Spatial Information Science, The University of Tokyo as a Post-Doctoral Researcher. In 2012, he was promoted to Project Assistant Professor with the Center for Spatial Information Science, The University of Tokyo. His research is mainly in artificial intelligence, computer vision, and robotics, especially on intelligent system design, multi-target tracking, sensor fusion, abnormality detection, and with application to intelligent surveillance.
Xiaowei Shao received his B.E. and Ph.D. degrees in Electronic Engineering and Information Science from the University of Science and Technology of China in 1999 and 2006, respectively. From 2006 to 2008 he worked as a Researcher in the Center for Spatial Information Science, University of Tokyo, Japan, and as a Project Assistant Professor from 2008 to 2012. Since April 2012 he has been a Project Associate Professor at the same university. His research interests include machine vision, pattern recognition, surveillance, target tracking, and spatial data processing.
Ryosuke Shibasaki was born in Fukuoka, Japan. He received his B.S., M.S., and Doctoral degrees in Civil Engineering from the University of Tokyo, Tokyo, Japan, in 1980, 1982, and 1987, respectively. From 1982 to 1988, he was with the Public Works Research Institute, Ministry of Construction. From 1988 to 1991, he was an Associate Professor in the Civil Engineering Department, University of Tokyo. In 1991, he joined the Institute of Industrial Science, University of Tokyo. In 1998, he was promoted to a Professor in the Center for Spatial Information Science, University of Tokyo. His research interests covers three-dimensional data acquisition for GIS, conceptual modeling for spatial objects, and agent-based microsimulation in a GIS environment.
Huijing Zhao received her B.S. degree in Computer Science from Peking University, Beijing, China, in 1991 and the M.E. and Ph.D. degrees in Civil Engineering from the University of Tokyo, Tokyo, Japan, in 1996 and 1999, respectively. From 1991 to 1994, she was with Peking University, where she was involved with a project developing a graphic information system platform. In 2003, after several years of postdoctoral research with the University of Tokyo, she became a Visiting Associate Professor with the Center for Spatial Information Science. In 2007, she joined Peking University as a Professor with the Key Laboratory of Machine Perception (MOE), and the School of Electronics Engineering and Computer Science. Her research interests include machine perception, intelligent vehicles, and spatial data handling.