Abstract:
Accurate acquisition of 3-D human joint poses holds significant implications for tasks such as human action recognition. Monocular single-frame 2-D -to-3-D pose estimatio...Show MoreMetadata
Abstract:
Accurate acquisition of 3-D human joint poses holds significant implications for tasks such as human action recognition. Monocular single-frame 2-D -to-3-D pose estimation focuses on establishing the correspondence between 2-D human pose in a single image and their 3-D spatial pose, delegating the preliminary task of 2-D pose estimation to models better suited for processing pixel information. The intricacy of 2-D -to-3-D pose estimation resides in modeling the spatial constraints among joints. To better learn the structure between joints, this article proposes the SPGformer algorithm, constructed with stacked serial–parallel GCN-encoder (SPGEncoder) modules. This module forms a dual-branch framework composed of transformer encoders (Encoders) and graph-oriented encoders (GraEncoders). We recover concealed depth values from the 2-D coordinates of joints, inputting them into the joint branch of the SPGEncoder. In parallel, we take the connection features of joints in the image as vector branch input. The proposed GraEncoder module integrates a learnable graph convolutional network (GCN) prior to the Encoder, enabling the learning of a broader spectrum of joint connections within the confines of skeletal linkage. Furthermore, this article presents a methodology for calculating the 3-D absolute pose of the root node, filling a research gap for applications requiring precise human position. This nonlearnable, plug-and-play method has been validated on the Human3.6M dataset. The SPGformer algorithm outperforms state-of-the-art methods on both the Human3.6M and MPI-INF-3DHP datasets.
Published in: IEEE Transactions on Instrumentation and Measurement ( Volume: 73)