1 Introduction

Human pose estimation (HPE) means to locate body parts from input images. It serves as a fundamental tool for several practical applications such as action recognition, human-computer interaction and video surveillance [1]. The most recent HPE systems have adopted convolutional neural networks (CNNs) [2,3,4] as their backbones and yielded drastic improvements on standard benchmarks [5,6,7,8,9]. However, they are still prone to fail when there exist ambiguities caused by overlapping parts, nearby persons and clutter backgrounds, e.g., Fig. 1.

Fig. 1.
figure 1

Pairs of pose predictions obtained by an eight-stack hourglass network [5] (left) and our approach (right). Some wrong part localizations are highlighted by green ellipses. By exploiting compositionality of human bodies, our approach is able to reduce low-level ambiguities in pose estimations. See Fig. 8 for more examples (Color figure online)

Fig. 2.
figure 2

(a) A typical compositional model of a human body. The pose is estimated via two stages: bottom-up inference followed by top-down refinement. (b) Each tensor represents score maps of several parts. An SLIS function aggregates information from input score maps on a spatially local support to predict output score maps. (c) Overview of our deeply learned compositional model. The orange and green arrows respectively denote SLIS functions modeled by CNNs in bottom-up and top-down stages. The colored rectangles on the left side denote predicted score maps of parts at different semantic levels while the heat maps on the right side represent their corresponding ground truth in the training phase (Color figure online)

One promising way to tackle these difficulties is to exploit the compositionality [10, 11] of human bodies. It means to represent a whole body as a hierarchy of parts and subparts, which satisfy some articulation constraints. This kind of hierarchical structure enables us to capture high-order relationships among parts and characterize an exponential number of plausible poses [12]. Based on this principle, compositional modelsFootnote 1 [13, 14] infer poses via two stages, as illustrated in Fig. 2(a). In the bottom-up stage, states of higher-level parts are recursively predicted from states of their child parts. In the top-down stage, states of lower-level parts are refined by their parents’ states updated one step earlier. Such global adjustments enable pose estimations to optimally meet the relational constraints and thus reduce low-level image ambiguities. In the last decade, compositional models have been adopted in several HPE systems [12, 15,16,17,18,19] and shown superior performances over their flat counterparts.

However, there are problems with existing compositional models designed for HPE [12, 15,16,17,18,19]. First, they often assume a Gaussian distribution on the subpart-part displacement with the subpart’s anchor position being its mean. While simplifying both their inference and learning [20], this assumption generally does not hold in real scenarios, e.g., distributions of joints visualized in [21,22,23]. Thus, we argue it is incapable to characterize the complex compositional relationships among body parts. Second, a set of discrete type variables are often used to model the compatibility among parts. They not only include the orientation and scale of a part but also span semantic classes (a straight versus bended arm). As the distinct types of a part can be as many as the different combinations of all its children’s types, state spaces for higher-level parts can be exponentially large. This makes both computation and storage demanding. Third, when the compositional structure has loops, approximate inference algorithms must be used. As a result, both the learning and testing will be adversely affected.

To address these issues, this paper introduces a novel framework, termed as Deeply Learned Compositional Model (DLCM), for HPE. We first show each bottom-up/top-down inference step of general compositional models is indeed an instantiation of a generalized process we call spatially local information summarization (SLIS). As shown in Fig. 2(b), it aggregates information from input score mapsFootnote 2 on a spatially local support to predict output score maps. In this paper, we exploit CNNs to model this process due to their capability to approximate inference functions via spatially local connections. As a result, DLCMs can learn more sophisticated and realistic compositional patterns within human bodies. To avoid potentially large state spaces, we propose to use state variables to only denote locations and embed the type information into score maps. Specially, we use bone segments to represent a part and supervise its score map in the training phase. This novel representation not only compactly encodes the orientation, scale and shape of a part, but also reduces both computation and space complexities. Figure 2(c) provides an overview of a DLCM. We evaluate the proposed approach on three HPE benchmarks. With significantly less parameters and lower computational complexities, it outperforms state-of-the-art methods.

In summary, the novelty of this paper is as follows:

  • To the best of our knowledge, this is the first attempt to explicitly learn the hierarchical compositionality of visual patterns via deep neural networks. As a result, DLCMs are capable to characterize the complex and realistic compositional relationships among body parts.

  • We propose a novel part representation. It encodes the orientation, scale and shape of each part compactly and avoids their potentially large state spaces.

  • Compared with prior deep neural networks, e.g., CNNs, designed for HPE, our model has a hierarchical compositional structure and bottom-up/top-down inference stages across multiple semantic levels. We show in the experiments that the compositional nature of DLCMs helps them resolve the ambiguities that appear in bottom-up pose predictions.

2 Related Work

Compositional Models. Compositionality has been studied in several lines of vision research [13, 14, 24, 25] and exploited in tasks like HPE [12, 15,16,17,18,19, 26], semantic segmentation [27] and object detection [28]. However, prior compositional models adopt simple and unrealistic relational modeling, e.g., pairwise potentials based on Gaussian distributions. They are incapable to model complex compositional patterns. Our approach attempts to address this difficulty by learning the compositional relationships among body parts via the powerful CNNs. In addition, we exploit a novel part representation to compactly encode the scale, orientation and shape of each part and avoid their potentially large state spaces.

CNN-Based HPE. All state-of-the-art HPE systems take CNNs as their main building block [5,6,7, 9, 29]. Newell et al. [5] introduce a novel hourglass module to process and consolidate features across all scales to best capture the various spatial relationships associated with the body. Yang et al. [7] combine CNNs and the expressive deformable mixture of parts [30] to enforce the spatial and appearance consistency among body parts. Hu and Ramanan [29] unroll the inference process of hierarchical rectified Gaussians as bidirectional architectures that also reason with top-down feedback. Instead of predicting body joint positions directly, Sun et al. [31] regress the coordinate shifts between joint pairs to encode their interactions. It is worth noting that none of these methods decomposes entities as hierarchies of meaningful and reusable parts or infers across different semantic levels. Our approach differs from them in that: (1) It has a hierarchical compositional network architecture; (2) CNNs are used to learn the compositional relationships among body parts; (3) Its inference consists of both bottom-up and top-down stages across multiple semantic levels; (4) It exploits a novel part representation to supervise the training of CNNs.

Bone-Based Part Representations. Some prior works [32, 33] use heat maps of limbs between each pair of adjacent joints as supervisions of deep neural networks. Their motivation is that modeling pairs of joints helps capture additional body constraints and correlations. Different with them, our bone-based part representation has (1) a hierarchical compositional structure and (2) multiple semantic levels. It is designed to (1) tightly encode the scale, orientation and shape of a part, (2) avoid exponentially large state spaces for higher-level parts and (3) guide CNNs to learn the compositionality of human bodies.

3 Our Approach

We first make a brief introduction to general compositional models (Sect. 3.1). Their inference steps are generalized as SLIS functions and modeled with CNNs (Sect. 3.2). We then describe our novel bone-based part representation (Sect. 3.3). Finally, the deeply learned compositional models are detailed in Sect. 3.4.

3.1 Compositional Models

A compositional model is defined on a hierarchical graph, as shown in Fig. 3. It is characterized by a 4-tuple \(({\mathcal {V}}, {\mathcal {E}}, \phi ^{and}, \phi ^{leaf})\), which specifies its graph structure \(({\mathcal {V}}, {\mathcal {E}})\) and potential functions \((\phi ^{and}, \phi ^{leaf})\). We consider two types of nodesFootnote 3: \({\mathcal {V}}={\mathcal {V}}^{and}\cup {\mathcal {V}}^{leaf}\). And-nodes \({\mathcal {V}}^{and}\) model the composition of subparts into higher-level parts. Leaf nodes \({\mathcal {V}}^{leaf}\) model primitives, i.e., the lowest-level parts. We call And-nodes at the highest level as root nodes. \({\mathcal {E}}\) denotes graph edges. In this section, we first illustrate our idea using the basic compositional model shown in Fig. 3(a), which does not share parts and considers only pairwise relationships, and then extend it to the general one, as shown in Fig. 3(b).

Fig. 3.
figure 3

Example compositional models (a) without and (b) with part sharing and higher-order cliques

A state variable \(w_u\) is associated to each node/part \(u\in {\mathcal {V}}\). For HPE, it can be the position \(p_u\) and type \(t_u\) of this part: \(w_u = \{p_u, t_u\}\). As a motivating example, Yang and Ramanan [30] use types to represent orientations, scales and semantic classes (a straight versus bended arm) of parts.

Let \(\varOmega \) denote the set of all state variables in the model. The probability distribution over \(\varOmega \) is of the following Gibbs form:

$$\begin{aligned} p(\varOmega |{\mathbf {I}}) = \frac{1}{Z}\exp \{-E(\varOmega , {\mathbf {I}}) \} \end{aligned}$$
(1)

where \({\mathbf {I}}\) is the input image, \(E(\varOmega , {\mathbf {I}})\) is the energy and Z is the partition function. For convenience, we use a score function \(S(\varOmega )\), defined as the negative energy, to specify the model and omit \({\mathbf {I}}\). Without part sharing and higher-order potentials, it can be written as:

$$\begin{aligned} S(\varOmega )\equiv -E(\varOmega , {\mathbf {I}}) =\sum _{u \in {\mathcal {V}}^{leaf}} \phi ^{leaf}_u(w_u, {\mathbf {I}}) + \sum _{u \in {\mathcal {V}}^{and}} \sum _{v\in ch(u)} \phi ^{and}_{u,v}(w_u, w_v) \end{aligned}$$
(2)

where ch(u) denotes the set of children of node u. The two terms are potential functions corresponding to Leaf and nodes, respectively. The first term acts like a detector: it determines how likely the primitive modeled by Leaf-node u is present at location \(p_u\) and of type \(t_u\). The second term models the state compatibility between a subpart v and its parent u.

Thanks to the tree structure, the optimal states \(\varOmega ^*\) for an input image \({\mathbf {I}}\) can be computed efficiently via dynamic programming. We call this process the compositional inference. It is consisted of two stages. In the bottom-up stage, the maximum score, i.e., \(\max _\varOmega S(\varOmega )\), can be calculated recursively as:

$$\begin{aligned} (\text {Leaf})~ S_u^{\uparrow }(w_u)= & {} \phi _u^{leaf}(w_u, {\mathbf {I}}) \end{aligned}$$
(3)
$$\begin{aligned} (\text {And})~ S^{\uparrow }_u(w_u)= & {} \sum _{v\in ch(u)}\max _{w_v}[\phi _{u,v}^{and}(w_u, w_v) + S_v^{\uparrow }(w_v)] \end{aligned}$$
(4)

where \(S_u^\uparrow (w_u)\) is the maximum score of the subgraph formed by node u and all its descendants, with root node u taking state \(w_u\), and is computed recursively by Eq. (4), with boundary conditions provided by Eq. (3). The recursion begins from the Leaf-level and goes up until root nodes are reached. As a function, \(S_u^\uparrow (w_u)\) assigns each possible state of part u a score. It can also be considered as a tensor/map, each entry of which is indexed by the part’s state and valued by the corresponding score. Thus, we also call \(S_u^\uparrow (w_u)\) the score map of part u.

In the top-down stage, we recursively invert Eq. (4) to obtain the optimal states of child nodes that yield the maximum score:

$$\begin{aligned} (\text {Root}) ~~~\quad w_u^*= & {} {{\mathrm{\arg \!\max }}}_{w_u}S_u^{\downarrow }(w_u)\equiv {{\mathrm{\arg \!\max }}}_{w_u}S_u^{\uparrow }(w_u) \end{aligned}$$
(5)
$$\begin{aligned} (\text {Non-root})~ w_v^*= & {} {{\mathrm{\arg \!\max }}}_{w_v}S^{\downarrow }_v(w_v) \equiv {{\mathrm{\arg \!\max }}}_{w_v} [\phi _{u,v}^{and}(w_u^*, w_v) + S_v^{\uparrow }(w_v)] \end{aligned}$$
(6)

where node u in Eq. (6) is the unique parent of node v, i.e., \(\{u\}=pa(v)\), \(S_u^{\uparrow }(w_u)\) and \(S_v^{\uparrow }(w_v)\) are acquired from the bottom-up stage, \(S^{\downarrow }_u(w_u)\) and \(S^{\downarrow }_v(w_v)\) are respectively refined score maps of nodes u and v. Specially, \(w_u^*\) and \(w_v^*\) are respectively optimal states of parts u and v, and are computed recursively by Eq. (6), with boundary conditions provided by Eq. (5). The recursion begins from root nodes and goes down until the Leaf-level is reached.

Fig. 4.
figure 4

Illustration of input-output relationships between child and parent score maps in the compositional inference. In this example, node u has two children \(v_1\) and \(v_2\). (a) In the bottom-up stage, the score map of a higher-level part is a function of its children’s score maps. (b) In the top-down stage, the score map of a lower-level part is refined by its parent’s score map updated one step earlier

3.2 Spatially Local Information Summarization

From Eq. (6), \(S^{\downarrow }_v(w_v)\) for non-root nodes is defined as:

$$\begin{aligned} S^{\downarrow }_v(w_v) = \phi _{u,v}^{and}(w_u^*, w_v) + S_v^{\uparrow }(w_v) \end{aligned}$$
(7)

where \(\{u\}=pa(v)\), \(w_u^* = {{\mathrm{\arg \!\max }}}_{w_u}S_u^{\downarrow }(w_u)\). We can write the bottom-up (BU) and top-down (TD) recursive equations, i.e., Eqs. (4) and (7), together as

$$\begin{aligned} (\text {BU})~ S^{\uparrow }_u(w_u)= & {} \sum _{v\in ch(u)}\max _{w_v}[\phi _{u,v}^{and}(w_u, w_v) + S_v^{\uparrow }(w_v)] \end{aligned}$$
(8)
$$\begin{aligned} (\text {TD})~S^{\downarrow }_v(w_v)= & {} \sum _{w_u}\phi _{u,v}^{and}(w_u, w_v)\bar{S}_u^{\downarrow }(w_u) + S_v^{\uparrow }(w_v) \end{aligned}$$
(9)

where \(\bar{S}_u^{\downarrow }(w_u)\) is the hard-thresholded version of \({S}_u^{\downarrow }(w_u)\): \(\bar{S}_u^{\downarrow }(w_u)\) equals to 1 if \(w_u=w_u^*\) and 0 otherwise. As illustrated in Fig. 4, these two equations intuitively demonstrate how score maps are propagated upwards and downwards in the inference process, which finally gives us the globally optimal states, i.e., \(\varOmega ^*\), of the compositional model.

In both equations, there exist summation and/or maximization operations over state variables, e.g., \(\sum _{v\in ch(u)}\max _{w_v}\) and \(\sum _{w_u}\), as well as between score maps. They can be considered as average and maximum poolings. In the literature of statistical learning [34], pooling means to combine features in a way that preserves task-related information while removing irrelevant details, leads to more compact representations, and better robustness to noise and clutter. In the compositional inference, score maps of some parts are combined to get relevant information about the states of other related parts. This analogy leads us to think of Eqs. (8) and (9) as different kinds of information summarization.

Fig. 5.
figure 5

(a) Illustration of the SLIS function in the compositional inference. Each cube denotes a score map corresponding to a part or subpart. Each entry in the output/right score map is obtained by aggregating information from the input/left score maps on a local spatial support. (b) Illustration of bone-based part representations. First row: the right lower arm, right upper arm, right arm and left arm of a person. Second row: right or left legs of different persons

Since child and parent parts should not be far apart in practice, it is unnecessary to search them within the whole image [14, 35, 36]. Thus, it is reasonable to constrain their relative displacements to be within a small range: \(p_v-p_u \in \mathbb {D}_{uv}\), e.g., \(\mathbb {D}_{uv}=[-50,50]\times [-50,50]\). For compositional models, this constraint can be enforced by setting \(\phi _{u,v}^{and}(w_u, w_v)=0\) if \(p_v-p_u\notin \mathbb {D}_{uv}\). Consequently, for each entry of the score maps on the LHS of Eqs. (8) and (9), only information within a local spatial region is summarized on the RHS, as the mapping shown in Fig. 5(a). Note this mapping is also location-invariant because the spatial compatibility between parts u and v with types \(t_u\) and \(t_v\) only depends on their relative locations and is unrelated to their global coordinates in the image space.

Our analysis indicates both recursive equations can be considered as different instantiations of a more generalized process, which aggregates information on a local spatial support and is location-invariant. We call this process spatially local information summarization (SLIS) and illustrate it in Fig. 5(a). In the bottom-up stage, the score map of a higher-level part \(S^{\uparrow }_u(w_u)\) is an SLIS function of their children’s score maps \(\{S_v^{\uparrow }(w_v)\}_{v\in ch(u)}\). In the top-down stage, the score map of a lower-level part \(S^{\downarrow }_v(w_v)\) is an SLIS function of its parent’s score map \({S}_u^{\downarrow }(w_u)\) as well as its own score map estimated in the bottom-up stage \(S_v^{\uparrow }(w_v)\).

Model SLIS Functions with CNNs. In this paper, we exploit CNNs to model our SLIS functions for two reasons. First, CNNs aggregate information on a local spatial support using location-invariant parameters. Second, CNNs are known for their capability to approximate inference functions. By learning them from data, we expect the SLIS functions are capable to infer the sophisticated compositional relationships within real human bodies. Specifically, we replace Eqs. (8) and (9) with:

$$\begin{aligned}&\text {(BU)}~ S^{\uparrow }_u(w_u) = {\mathbf {c}}_{u}^\uparrow \big ( \{S_v^{\uparrow }(w_v)\}_{v\in ch(u)}; \varTheta _{u}^\uparrow \big ) \end{aligned}$$
(10)
$$\begin{aligned}&\text {(TD)}~ S^{\downarrow }_v(w_v) = {\mathbf {c}}_{v}^\downarrow \big ( {S}_u^{\downarrow }(w_u), S_v^{\uparrow }(w_v); \varTheta _{v}^\downarrow \big ) \end{aligned}$$
(11)

where \({\mathbf {c}}_{u}^\uparrow \) and \({\mathbf {c}}_{v}^\downarrow \) are CNN mappings with \(\varTheta _{u}^\uparrow \) and \(\varTheta _{v}^\downarrow \) being their respective collections of convolutional kernels. Since the bottom-up and top-down SLIS functions are different, their corresponding kernels should also be different.

Part Sharing and Higher-Order Potentials. We now consider a more general compositional model, as shown in Fig. 3(b). With part sharing and higher-order potentials, the score function is

$$\begin{aligned} S(\varOmega )&=\sum _{u \in {\mathcal {V}}^{leaf}} \phi ^{leaf}_u(w_u, {\mathbf {I}}) + \sum _{u \in {\mathcal {V}}^{and}} \phi ^{and}_{u}(w_u, \{w_v\}_{v\in ch(u)}) \end{aligned}$$
(12)

where \(\phi ^{and}_{u}(w_u, \{w_v\}_{v\in ch(u)})\) denotes the higher-order potential function measuring the state compatibility among part u and its child parts \(\{v:v\in ch(u)\}\).

Due to the existence of loops and child sharing, states of all parts at one level should be estimated/refined jointly from all parts at a lower/higher level. By exploiting the update rules of dynamic programming [25], similar derivations (available in the supplementary material) indicate that we can approximate the SLIS functions as follows:

$$\begin{aligned}&\text {(BU)}~ \{S^{\uparrow }_u(w_u)\}_{u\in {\mathcal {V}}^L} = {\mathbf {c}}_{L}^\uparrow \big ( \{S_v^{\uparrow }(w_v)\}_{v\in {\mathcal {V}}^{L-1}}; \varTheta _{L}^\uparrow \big ) \end{aligned}$$
(13)
$$\begin{aligned}&\text {(TD)}~ \{S^{\downarrow }_v(w_v)\}_{v\in {\mathcal {V}}^{L-1}} = {\mathbf {c}}_{L-1}^\downarrow \big (\{S^{\downarrow }_u(w_u)\}_{u\in {\mathcal {V}}^L}, \{S^{\uparrow }_v(w_v)\}_{v\in {\mathcal {V}}^{L-1}};\varTheta _{L-1}^\downarrow \big ) \end{aligned}$$
(14)

where L indexes the semantic level, \({\mathcal {V}}^L\) denotes the set of nodes at the Lth level, \(\varTheta _{L}^\uparrow \) and \(\varTheta _{L-1}^\downarrow \) are convolutional kernels. In the bottom-up stage, score maps at a higher level are jointly estimated from all score maps at one level lower. In the top-down stage, score maps at a lower level are jointly refined by all score maps at one level higher as well as their initial estimations in the bottom-up stage.

3.3 Bone-Based Part Representation

Another problem with existing compositional models is that the type space for higher-level parts are potentially large. For example, if we have N types for both the left lower leg and left upper leg, there can be \(O(N^2)\) types for the whole left leg and \(O(N^4)\) types for the composition of left and right legs. As a result, the type dimensions of score maps \(S^{\uparrow }_u(w_u)\) and \(S^{\downarrow }_u(w_u)\) would be very high, which makes both storage and computation demanding. To address this issue, we propose to embed the type information into score maps and use state variables to only denote locations. As shown in Fig. 5(b), we represent each part with its bones, which are generated by putting Gaussian kernels along the part segments. They are then taken as the ground truth of score maps \(S^{\uparrow }_u(w_u)\) and \(S^{\downarrow }_u(w_u)\) when training neural networks. Specifically, for each point on the line segments of a part, we generate a heat map with a 2D Gaussian (std=1 pixel) centered at it. Then, a single heat map is formed by taking the maximum value from these heat maps at each position.

Our novel part representation has several advantages. First, score maps are now 2-D matrices with no type dimension instead of 3-D tensors. This reduces space and computation complexities in score map predictions. Second, the bones compactly encode orientations, scales and shapes of parts, as shown in Fig. 5(b). We no longer need to discretize them via clustering [12, 15,16,17,18,19, 26]. One weakness of this representation is that the ends of parts are indistinguishable. To solve this problem, we augment score maps of higher-level parts with score maps of their endsFootnote 4. In this way, all important information of parts can be retained.

Fig. 6.
figure 6

(a) The compositional structure of a human body used in our experiments. It has three semantic levels, which include 16, 12 and 6 parts, respectively. Assume all the children sharing a common parent are linked to each other. (b) Network architecture of the proposed DLCM. Maps in the rectangles are short for score maps

3.4 Deeply Learned Compositional Model (DLCM)

Motivated by the reasoning above, our Deeply Learned Compositional Model (DLCM) exploits CNNs to learn the compositionality of human bodies for HPE. Figure 6(b) shows an example network based on Eqs. (13) and (14). It has a hierarchical compositional architecture and bottom-up/top-down inference stages. In the bottom-up stage, score maps of target joints are first regressed directly from the image observations, as with existing CNN-based HPE methods. Then, score maps of higher-level parts are recursively estimated from those of their children. In the top-down stage, score maps of lower-level parts are recursively refined using their parents’ score maps as well as their own score maps estimated in the bottom-up stage. Similar as [37], a Mean Squared Error (MSE) loss is applied to compare predicted score maps with the ground truth. In this way, we can guide the network to learn the compositional relationships among body parts. Some examples of score maps predicted by our DLCM in the bottom-up and top-down stages can be found in Fig. 8(a).

4 Experiments

4.1 Implementation Details

The proposed DLCM is a general framework and can be instantiated with any compositional body structures and CNN modules. In the experiments, we use a similar compositional structure as that in [12] but include higher-order cliques and part sharing. As shown in Fig. 6(a), it has three semantic levels, which include 16, 12 and 6 parts, respectively. Assume all children sharing a common parent are linked to each other. The whole human body is not included here since it has negligible effect on overall performances, while complicating the model.

For two reasons, we exploit the hourglass module [5] to instantiate the CNN blocks in Fig. 6(b). First, the hourglass module extends the fully convolutional network [38] by processing and consolidating features across multiple scales. This enables it to capture the various spatial relationships associated with the input score maps. Second, the eight-stack hourglass network [5], formed by sequentially stacking eight hourglass modules, has achieved state-of-the-art results on several HPE benchmarks. It serves as a suitable baseline to test the effectiveness of the proposed approach. To instantiate a DLCM with three semantic levels, we need five hourglass modules, i.e., the five CNN blocks in Fig. 6(b). Newell et al. [5] add the intermediate features used to predict part score maps back to these predictions via skip connections before they are fed into the next hourglass. We follow this design in our implementation and find it helps reduce overfitting.

Our approach is evaluated on three HPE benchmark datasets of increasing difficulties: FLIC [39], Leeds Sports Poses (LSP) [40] and MPII Human Pose [21]. The FLIC dataset is composed of 5003 images (3987 for training, 1016 for testing) taken from films. The images are annotated on the upper body with most figures facing the camera. The extended LSP dataset consists of 11k training images and 1k testing images from sports activities. As a common practice [6, 9, 41], we train the network by including the MPII training samples. A few joint annotations in the LSP dataset are on the wrong side. We manually correct them. The MPII dataset consists of around 25k images with 40k annotated samples (28k for training, 11k for testing). The images cover a wide range of everyday human activities and a great variety of full-body poses. Following [5, 42], 3k samples are taken as a validation set to tune the hyper-parameters.

Table 1. Comparisons of PCK@0.2 scores on the FLIC testing set
Table 2. Comparisons of PCK@0.2 scores on the LSP testing set

Each input image is cropped around the target person according to the annotated body position and scale. They are then resized to \(256 \times 256\) pixels. Data augmentation based on affine transformation [48, 50] is used to reduce overfitting. We implement DLCMsFootnote 5 using Torch [51] and optimize them via RMSProp [52] with batch size 16. The learning rate is initialized as \(2.5\times 10^{-4}\) and then dropped by a factor of 10 after the validation accuracy plateaus. In the testing phase, we run both the original input and a flipped version of a six-scale image pyramid through the network and average the estimated score maps together [49]. The final prediction is the maximum activating location of the score map for a given joint predicted by the last CNN module.

4.2 Evaluation

Metrics. Following previous work, we use the Percentage of Correct Keypoints (PCK) [21] as the evaluation metric. It calculates the percentage of detections that fall within a normalized distance of the ground truth. For LSP and FLIC, the distance is normalized by the torso size, and for MPII, by a fraction of the head size (referred to as PCKh).

Table 3. Comparisons of PCKh@0.5 scores on the MPII testing set
Table 4. Comparisons of parameter and operation numbers

Accuracies. Tables 1, 2 and 3 respectively compare the performances of our 3-level DLCM and the most recent state-of-the-art HPE methods on FLIC, LSP and MPII datasets. Our approach clearly outperforms the eight-stack hourglass network [5], especially on some challenging joints. On the FLIC dataset, it achieves 1.5% improvement on wrist and halves the overall error rate (from 2% to 1%). On the MPII dataset, it achieves 2.6%, 2.0%, 1.7%, 1.6% and 1.4% improvements on ankle, knee, hip, wrist and elbow, respectively. On all three datasets, our approach achieves superior performance to the state-of-the-art methods.

Complexities. Table 4 compares the complexities of our 3-level DLCM with the eight-stack hourglass network [5] as well as the current state-of-the-art method [49]. Obviously, using only five hourglass modules instead of eight [5, 49], our model has significantly less parameters and lower computational complexities. Specially, the prior top-performing method [49] on the benchmarks has 74% more parameters and needs 37% more GFLOPS.

Summary. From Tables 1, 2, 3 and 4, we can see that with significantly less parameters and lower computational complexities, the proposed approach has an overall superior performance to the state-of-the-art methods.

4.3 Component Analysis

We analyze the effectiveness of each component in DLCMs on MPII validation set. Mean PCKh@0.5 over hard joints, i.e., ankle, knee, hip, wrist and elbow, is used as the evaluation metric. A DLCM with two semantic levels is taken as the basic model. Model (i), \(i\in \{1,2,3,4,5\}\), denotes one of the five variants of the basic model shown in Fig. 7(a).

Fig. 7.
figure 7

(a) Component analysis on MPII validation set. See Sect. 4.3 for details. (b) Qualitative results obtained by our approach on the MPII (top row) and LSP (bottom row) testing sets

To see the importance of compositional architectures, we successively remove the top-down lateral connections and compositional part supervisions, which leads to Model (1) and Model (2). Figure 7(a) indicates that both variants, especially the second one, perform worse than the basic model.

In Model (3), we replace bone-based part representations in the basic model with conventional part representations, i.e., cubes in Fig. 5(a). Following [12], we use K-means to cluster each of the 12 higher-level parts into N types. Since a part sample is assigned to one type, only 1 of its N score map channels is nonzero (with a Gaussian centered at the part location). We have tested \(N=15\) [12] and \(N=30\) and reported the better result. As shown in Fig. 7(a), the novel bone-based part representation significantly outperforms the conventional one.

Finally, we explore whether using more semantic levels in a DLCM can boost its performance. Model (4) is what we have used in Sect. 4.2. Model (5) has 4 semantic levels. The highest-level part is the whole human body. Its ground truth bone map is the composition (location-wise maximum) of its children’s bone maps. Figure 7(a) shows that the 3-level DLCM performs much better than the 2-level model. However, with 38% more parameters and 27% more GFLOPS, the 4-level DLCM only marginally outperforms the 3-level model.

Fig. 8.
figure 8

(a) Score maps obtained by our method on some unseen images in the bottom-up (BU) and top-down (TD) inference stages. The five columns correspond to the five inference steps in Fig. 6(b). Due to space limit, only score maps corresponding to one of the six level-2 parts are displayed for the example at each row. From top to bottom, the level-2 parts are left leg, right leg, left arm, left leg and right arm, respectively. Within each sub-figure, parts of the same level are ordered by their distances to the body center. (b) Some examples showing that a 3-level DLCM (bottom row) is able to resolve the ambiguities that appear in bottom-up pose predictions of an 8-stack hourglass network (top row). Wrong part localizations are highlighted by green ellipses (Color figure online)

4.4 Qualitative Results

Figure 7(b) displays some pose estimation results obtained by our approach. Figure 8(a) visualizes some score maps obtained by our method in the bottom-up (BU) and top-down (TD) inference stages. The evolution of these score maps demonstrates how the learned compositionality helps resolve the low-level ambiguities that appear in high-level pose estimations. The uncertain bottom-up estimations of the left ankle, right ankle and right elbow respectively in the first, second and fifth examples are resolved by the first-level compositions. In some more challenging cases, one level of composition is not enough to resolve the ambiguities, e.g., the bottom-up predictions of the left lower arm in the third example and the left lower leg in the fourth example. Thanks to the hierarchical compositionality, their uncertainties can be reduced by the higher-level relational models. Figure 8(b) shows that our DLCM can resolve the ambiguities that appear in bottom-up pose predictions of an 8-stack hourglass network.

5 Conclusion

This paper exploits deep neural networks to learn the complex compositional patterns within human bodies for pose estimation. We also propose a novel bone-based part representation to avoid potentially large state spaces for higher-level parts. Experiments demonstrate the effectiveness and efficiency of our approach.