1 Introduction

Maximum a-posteriori (MAP) inference in graphical models, and especially in random fields defined over image domains, is one of the most useful tools in computer vision and related fields. Sometimes non-linear optimization is the method of choice for best efficiency, if all potentials defining the objective are of parametric (or better convex) shape. On the other hand, if the potentials are of non-parametric shape (or highly non-convex), then well established methods such as loopy belief propagation (BP) or its convex variants are the method of choice. BP and related algorithms are highly successful and relatively efficient if the underlying state space is rather small, but suffer from two limitations in case of large state spaces: first, the intrinsic message passing step has at least linear but generally superlinear run-time complexity in terms of the state space size. Thus, the runtime of these methods does not scale well with the number of states. Further, even computing all terms (potentials) needed to define the problem instance can already be prohibitive. Second, the memory consumption grows linearly with the state space size, since belief propagation requires to maintain messages for each state. These shortcomings of BP are well known, and several approaches proposed in the literature (which will be reviewed in Sect. 1.1) attempt to overcome these limitations.

Fig. 1.
figure 1

(a)–(d) Extracted primal solutions for a stereo correspondence problem after 2, 4, 8 and 16 passes over the image. (e) Primal solution obtained by full scale MAP inference. (f)–(i) Extracted primal solutions for optical flow problems after 2, 4, 8 and 16 passes. (j) Result after 256 iterations (22 s) as approximation to intractable full scale MAP inference. [Best viewed on screen.]

In this work we derive an algorithm similar in spirit to particle belief propagation (PBP, and its more recent flavors) by considering the following nested optimization problem: for a given inference problem determine the surrogate problem (represented by its potentials) among the set of “easy” problems (for which MAP inference is efficient) that is most similar to the given one. This is a nested problem since both the surrogate problem and the MAP solution need to be determined jointly. One contribution of this work is, that the joint search over potentials and MAP solutions can be carried out efficiently, and the resulting algorithm has a striking resemblence with PBP by alternating state selection and message passing steps. In contrast to PBP, which is rooted in approximating intractable distributions, our method aims directly to find an accurate MAP solution.

We also show that our proposed method is always convergent if applied on problems defined on tree graphs, and experimentally we found convergent behavior of its loopy extension to lattices. To our knowledge no such guarantee is available for any version of particle-based BP.

We illustrate the performance of the method in Fig. 1 on dense correspondence problems, where the evolution of label assignments returned by our method after the given numbers of passes over the image are shown. It can be seen that after 16 passes (<2 s CPU runtime) the label assignment has essentially converged. More important than the run-time is memory consumption: in the optical flow scenario our method requires about 32 MB (which compares to 16 GB—or 694 MB if the data terms are computed on the fly—required by a message passing method exploiting the specific product label structure [20]). This makes our approach also appealing for embedded and mobile implementations.

1.1 Related Work

There is a huge amount of literature related to MAP inference algorithms and efficient implementations, and we focus on methods specifically addressing large state spaces. A number of approaches approximate full-scale messages required in belief propagation by sets of particles to cope with large or even continuous state spaces, e.g. [3, 12, 14, 16, 17, 24]. These methods augment MAP inference with a stochastic particle selection step to approximate otherwise intractable distributions (or message functions) arising in the propagation step. They differ largely in how new candidates states (particles) are sampled and filtered, and some approaches are specifically designed for dense correspondence estimation. While these methods show good performance in practice, only few (and often weak) quantitative results are known for message passing in a stochastic context (e.g. [9, 15, 16]).

“Active MAP inference” [18] explicitly addresses the case when the computation of unary potentials is the main run-time bottleneck: unary potentials can be queried during MAP inference, and states selected for instantiating the unary potentials are ranked based on aggregated belief. In order to rank “unseen” states repeated MAP inference is necessary. In contrast to that particle max-product methods (and our approach) interleave MAP inference and candidate state scoring and is therefore very efficient for large state spaces. A different nested approach for MAP inference with large or continuous state spaces is considered in [17, 27], but in contrast to this work the surrogate problems are constructed to be lower bounds for the target inference problem. [21] mentions, but does not explore further a primal “majorize-minimize” framework, which shares the use of majorizing potentials (but in a different context) with our approach.Footnote 1

1.2 Background

This section presents some material on MAP inference and the underlying linear programming relaxation. A labeling or MAP inference problem is determining the optimal label \(x_s \in \mathcal{L}\) assigned at each node \(s \in \mathcal V\), where the objective is over unary terms and pairwise termsFootnote 2,

$$\begin{aligned} {\mathbf x}^*&{\mathop {=}\limits ^{\text {def}}}\arg \min _{{\mathbf x}} \sum _{s \in \mathcal V} \theta _s(x_s) + \sum _{(s,t) \in \mathcal{E}} \theta _{st}(x_s, x_t) = \arg \min _{{\mathbf x}} \sum _\alpha \theta _\alpha ({\mathbf x}_\alpha ), \end{aligned}$$
(1)

where \({\mathbf x} = (x_s)_{s \in \mathcal V} \in \mathcal{L}^{|\mathcal{V}|}\), and \(\mathcal E\) is a problem-specific set of undirected edges. \(\theta \) are the potentials or costs for assigning particular states to nodes and edges. We will use \(\alpha \in \mathcal{V} \cup \mathcal{E}\) to index unary and pairwise cliques in a unified way. The above label assignment problem is generally intractable to solve, and one highly successful approach to approximately solve this problem is to employ the corresponding linear programming (LP) relaxation (see e.g. [26])Footnote 3,

$$\begin{aligned} E_{\text {MAP}}(\tau ; \theta )&{\mathop {=}\limits ^{\text {def}}}\sum _{\alpha , {\mathbf x}_\alpha } \theta _\alpha ({\mathbf x}_\alpha ) \tau _\alpha ({\mathbf x}_\alpha ) \qquad \text {s.t. } \tau _s(x_s) = \sum _{{\mathbf x}_\alpha \setminus x_s} \tau _\alpha ({\mathbf x}_\alpha ), \end{aligned}$$
(2)

\(\sum _{x_s} \tau _s(x_s) = 1\) and \(\tau _\alpha ({\mathbf x}_\alpha ) \ge 0\). The unknowns \(\{\tau _s\}_{s \in \mathcal V}\) and \(\{\tau _{st}\}_{(s,t) \in \mathcal E}\) are “one-hot” encodings of the assigned labels, e.g. if \(\tau ^*\) is the optimal solution of \(E_{\text {MAP}}\) and the relaxation is tight, then \(\tau _s(x_s)\) is ideally 1 iff state \(x_s\) is the optimal label at node s, and 0 otherwise. The constraints linking \(\tau _s\) and \(\tau _\alpha \) are usually called marginalization constraints, and the unit sum constraint is typically referred as normalization constraint. The linear program in Eq. 2 is not unique, since redundant non-negativity and normalization constraints can be added to \(E_{\text {MAP}}\) without affecting the optimal solution or value. Consequently, different duals are considered in the literature. A particular dual program is the following,

$$\begin{aligned} E_{\text {MAP}}^*(\lambda ; \theta )&= \sum _\alpha \min _{{\mathbf x}_\alpha } \left\{ \theta _\alpha ({\mathbf x}_\alpha ) - \sum \nolimits _{s \in \alpha } \lambda _{\alpha \rightarrow s}(x_s) \right\} \end{aligned}$$
(3)

subject to \(\sum _{\alpha \ni s} \lambda _{\alpha \rightarrow s}(x_s) = 0\) for all s and \(x_s\). Recall that cliques \(\alpha \) range over nodes and edges. The dual unknowns \(\{\lambda _{\alpha \rightarrow s}\}_{s \in \mathcal{V}, \alpha \ni s}\) are the Lagrange multipliers for the marginalization constraints, which are often termed “messages” [7, 22, 23, 26].

2 A Nested Approach to MAP Inference

This section describes our framework designed for MAP inference in large state spaces. First, we present a generic, high-level view, which will be made more concrete to obtain efficient implementable algorithms in subsequent sections.

For given potentials \(\theta \) and \(\phi \) we will write \(\theta \preceq \phi \) if \(\theta _\alpha ({\mathbf x}_\alpha ) \le \phi _\alpha ({\mathbf x}_\alpha )\) for all cliques \(\alpha \) and clique states \({\mathbf x}_\alpha \). Hence, \(\phi \succeq \theta \) is an element-wise upper bound of \(\theta \). Let \(\varPhi \subseteq \{ \phi : \phi \succeq \theta \}\) be a set of potentials upper bounding \(\theta \). It will be described later in more detail, but in general the potentials \(\phi \in \varPhi \) should also allow very efficient minimization of the MAP objective \(E_{\text {MAP}}(\cdot ; \phi )\) (compared to \(\theta \)). Let

$$\begin{aligned} v^*(\theta ) := \min _{\tau } E_{\text {MAP}}(\tau ; \theta ) \end{aligned}$$
(4)

be the optimal value of interest (i.e. the optimal value for the MAP problem instance we want to solve). We introduce

$$\begin{aligned} \overline{v}(\varPhi ) := \min _{\phi \in \varPhi } \min _{\tau } E_{\text {MAP}}(\tau ; \phi ), \end{aligned}$$
(5)

i.e. the smallest MAP objective obtained by any potential \(\phi \in \varPhi \). Since by assumption \(\phi \succeq \theta \), we have \(v^*(\theta ) \le \overline{v}(\varPhi )\). Thus, the goal is to find a \(\phi ^* \in \varPhi \) such that the optimal value \(\min _{\tau } E_{\text {MAP}}(\tau ; \phi ^*)\) is closest to \(v^*(\theta )\).

Let us rephrase the meaning of \(\overline{v}(\varPhi )\): if \(\varPhi \) contains potentials that permit efficient inference, the task is to determine \(\phi ^* \in \varPhi \) (and the corresponding primal solution \({\mathbf x}^*\)) with the optimal value nearest to the one of the original problem, \(v^*(\theta )\). In that sense \(\phi ^*\) is the most faithful surrogate for \(\theta \) among the elements in \(\varPhi \). Since \(\overline{v}(\varPhi )\) is defined as optimal value of a nested optimization problem, the requirement is that the additional cost of optimizing over \(\phi \in \varPhi \) is more than compensated by the efficient computation of \(E_{\text {MAP}}(\cdot ; \phi )\).

figure a

A generic descent algorithm that is guaranteed not to increase the maintained estimate \(\hat{v}\) of \(v^*(\varPhi )\) is given in Algorithm 1. The algorithm iteratively probes new potentials \(\phi ' \in \varPhi \) and replaces the current one \(\phi \), if the optimal value \(\breve{v} = \min _\tau E_{\text {MAP}}(\tau ; \phi ')\) is smaller than \(\hat{v} = \min _\tau E_{\text {MAP}}(\tau ; \phi )\). Note that computing \(\hat{v}\) and \(\breve{v}\) will be usually performed indirectly by a message passing or belief propagation algorithm, which iteratively or in closed form determine a set of messages certifying optimality of the returned solution. The generic algorithm Algorithm 1 leaves many design choices open, which will influence its practical performance (and even tractability):

  • How should \(\varPhi \) be designed? Elements \(\phi \in \varPhi \) need to allow very efficient MAP inference, and they need to be bounded from below by the given potentials \(\theta \). Further, it must be easy to select elements from \(\varPhi \). Below we describe a natural choice for \(\varPhi \) in detail.

  • Given the current value of \(\phi \), how should the next \(\phi ' \in \varPhi \) be selected? In general the choice of \(\phi '\) given \(\phi \) is application specific. In our setting \(\phi '\) will differ from \(\phi \) only in potentials associated with a currently active node s, i.e. \(\phi '_\alpha ({\mathbf x}_\alpha ) = \phi _\alpha ({\mathbf x}_\alpha )\) for all cliques \(\alpha \) not containing a node s. Thus, the minimization w.r.t. \(\phi \in \varPhi \) is therefore made compositional by successively optimizing only over \((\phi _\alpha )_{\alpha \ni s}\) for a node s currently in focus. As we will see later in Sect. 3, the BP messages needed to compute \(v^*(\phi )\) also provide useful information on how to select \(\phi '\). Further details on the transition \(\phi \leadsto \phi '\) are given in Sects. 3 and 5.

  • How can we efficiently compute \(\breve{v}\)? In Sect. 3 we will show that \(\breve{v}\) can be computed very efficiently if \(\phi \) and \(\phi '\) differ only locally. In this setting most BP messages used to determine \(\hat{v}\) in the min-sum BP algorithm can be reused to compute \(\breve{v}\). In Sect. 3 it is shown that Algorithm 1 (together with our choice of \(\varPhi \) and the transition \(\phi \leadsto \phi '\)) condenses to alternating between locally modifying \(\phi \) and local message passing steps.

The following sections specify more implementation details to convert the abstract meta-algorithm in Algorithm 1 into a practical implementation.

Design of \(\varPhi \) : The requirement is that elements \(\phi \in \varPhi \) need to allow efficient computation of \(v^*(\phi ) = \min _\tau E_{\text {MAP}}(\tau ; \phi )\) in order for the nested algorithm (Algorithm 1) to be competitive in run-time (compared to directly inferring \(\min _\tau E_{\text {MAP}}(\tau ; \theta )\)). Intuitively, elements in \(\varPhi \) shall be “uninformative” for most (clique) states in order to allow significant runtime and memory savings for computing and representing messages. One choice of \(\varPhi \) that allows very efficient MAP inference is given as follows: the set \(\varPhi _K\) consists of elements which are indexed by the cartesian product of “resident sets”Footnote 4 \({\mathbf R} = (R_s)_{s \in V}\), where \(R_s \subseteq \mathcal{L}\) has exactly K elements (with \(K \in \{1, \cdots , |\mathcal{L}|\}\) being a user-specified parameter):

$$\begin{aligned} \varPhi _K {\mathop {=}\limits ^{\text {def}}}\left\{ \theta \left[ {\mathbf R}\right] : \forall s: R_s \subseteq \mathcal{L} : |R_s| = K \right\} \end{aligned}$$
(6)

with

$$\begin{aligned} \theta _\alpha [R_\alpha ]({\mathbf x}_\alpha )&{\mathop {=}\limits ^{\text {def}}}{\left\{ \begin{array}{ll} \theta _\alpha ({\mathbf x}_\alpha ) &{} \text {if } \forall s \in \alpha : x_s \in R_s \\ \overline{\theta }_\alpha &{} \text {otherwise} \end{array}\right. } \end{aligned}$$
(7)

and \(\theta \left[ {\mathbf R} \right] {\mathop {=}\limits ^{\text {def}}}(\theta _\alpha [R_\alpha ])_{\alpha }\). Here \(\overline{\theta }_\alpha \) is a value large enough to guarantee \(\phi \succeq \theta \) for all \(\phi \in \varPhi _K\), e.g. \(\overline{\theta }_\alpha = \max _{{\mathbf x}_\alpha } \theta _\alpha ({\mathbf x}_\alpha )\). These values are typically known from the (application-specific) design of \(\theta \).

Since for a \(\phi \in \varPhi _K\) non-resident states have all equal potentials \(\overline{\theta }_\alpha \), they are indistinguishable. This also implies that we can enforce equality of BP messages for non-resident states, e.g. \(\mu _{\alpha \rightarrow s}(x_s) = \mu _{\alpha \rightarrow s}(x'_s)\) for \(x_s, x_s' \notin R_s\), without affecting the optimal solution. This property can be shown to hold also for messages \(\lambda \) of the dual objective \(E_{\text {MAP}}^*\). [27] Overall, the \(|\mathcal L|\)-dimensional message vector (e.g. \(\mu _{\alpha \rightarrow s}(\cdot )\)) can be losslessly represented to a \(K+1\)-dimensional messages \(\left( \mu _{\alpha \rightarrow s}(x_s) \right) _{x_s \in R_s \cup \{*\}}\), where \(*\) represents all (indistinguishable) non-resident states \(x_s \notin R_s\). Further, message computation is reduced from \(O(|\mathcal L|^{|\alpha |})\) in general to \(O(K^{|\alpha |})\). Both memory and runtime savings can therefore be significant.

\(\varPhi _K\) has another property that will be useful in practical implementations: given one element \(\phi \in \varPhi _K\), we can create another \(\phi ' \in \varPhi _k\) by modifying one of the resident sets \(R_s\), and every element in \(\varPhi _K\) can be generated by a sequence of local alterations of resident sets. This property allows us to construct \(\phi '\) from the current potentials \(\phi \) in Algorithm 1 solely by local changes.

3 Nested MAP Inference on Trees

When the underlying graph is a tree, exact MAP inference can be performed efficiently by min-sum belief propagation on a tree, which is essentially an instance of dynamic programming. In this section we demonstrate how the message passing steps of min-sum BP can be interleaved with updating the currently active potentials \(\phi \) in order to reduce the target objective \(\overline{v}(\varPhi )\). The generic method (Algorithm 1) is specialized in Algorithm 2 for MAP inference problems on trees.

figure b

The current estimates \(\hat{v}\) (line 4) and \(\breve{v}\) (line 7) for \(\overline{v}(\varPhi )\) are efficiently computed and updated via min-sum BP. Note that these values are not actually used in the algorithm (other than for tracking and reporting progress). In order to obtain \(\hat{v}\) (and \(\breve{v}\)) it is only necessary to compute upward messages towards the root. Further, if \(\phi \) and \(\phi '\) differ only in their potentials at the root node s (i.e. \(\phi _\alpha = \phi '_\alpha \) if \(s \notin \alpha \)), then all upward messages other then the ones incoming at s can be reused when computing \(\breve{v}\).

Moreover, when a new root (e.g. \(\hat{s}\)) is selected, then only messages along the (directed) path from s to \(\hat{s}\) will need recomputation in order to make all upward messages valid again. Hence, if \(\hat{s}\) is a direct neighbor of s (i.e. a child of s), then only the messages on a single edge \(\hat{s} \rightarrow s\) are invalidated and need recomputation. It is sensible to traverse neighboring nodes, since updated potentials at node s will have the strongest impact on adjacent nodes. Overall we can state the following: message passing steps updating \(\mu \) and minimization steps w.r.t. \(\phi \) can be interleaved yielding an efficient implementation of the generic descent method in Algorithm 1. The most important invariant of the method is, that upward messages are always consistent with the current potentials \(\phi \). Since \(\hat{v}\) computed in line 4 is non-increasing by construction and lower bounded by \(v^*(\theta )\), we obtain the following

Proposition 1

The sequence of values \(\hat{v}\) generated in line 4 of Algorithm 2 converges.

The algorithm is presented for the particular choice of \(\varPhi = \varPhi _K\) with K user-specified. A further design choice is that \(\phi \) and \(\phi '\) differ only in their resident sets \(R_s\) and \(R_s'\) by \(K^+ < K\) elements. By exploiting the particular structure of \(\varPhi \), we can actually test \(\left( {\begin{array}{c}K\\ K-K^+\end{array}}\right) \) elements from \(\varPhi \) in each step simultaneously. For notational brevity we present the algorithm with vanishing unary potentials, i.e. \(\theta _s = \phi _s \equiv 0\). This can be always achieved w.l.o.g. by reparametrizing the original potential via \(\theta _{st}(x_s, x_t) \leftarrow \theta _{st}(x_s, x_t) + \theta _s(x_s)/\deg (s) + \theta _t(x_t)/\deg (t)\) and \(\theta _s(x_s) \leftarrow 0\) for all \(x_s, x_t\).

Candidate set \(C_s\) : In line 5 the selection of \(K^+\) candidate states \(C_s\) is left unspecified. Given that the objective is to lower \(\hat{v}\) as much as possible (and therefore make \(\breve{v}\) as small is possible), \(C_s\) should contain states with small min-marginals,

$$\begin{aligned} \sum _{t \in ch(s)} \min _{x_t} \left\{ \theta _{st}(x_s, x_t) + \sum \nolimits _{r \in ch(t)} \mu _{r \rightarrow t}(x_t) \right\} . \end{aligned}$$
(8)

This criterion assumes knowledge of the true unary and pairwise potentials associated with (a small number of) candidate states. In our implementations, given an enlarged candidate set \(\tilde{C}_s\) of states \(C_s\) is obtained by taking the \(K^+\) states from \(\tilde{C}_s\) with the smallest min-marginals. There are many options to generate \(\tilde{C}_s\). In our experiments we determine \(\tilde{C}_s\) by generating one sample for each state in one neighboring resident set, i.e. \(\tilde{C}_s = \big \{ H(x_t) : x_t \in \bigcup _{t \in N(s)} R_t \big \}\). H is a stochastic function that aims to model pairwise interactions, that is \({-\!\log } P(H(x_t), x_t) \approx \theta _{st}(H(x_t), x_t)\). Further details of the implementations are given in Sect. 5.

Limit points: While Algorithm 2 is guaranteed to be non-increasing in its estimate \(\hat{v}\), this does not imply that \(\hat{v} \rightarrow \overline{v}(\varPhi )\). In will be shown below that e.g. iterated conditional modes is an instance of our framework, hence a poor limit point could be returned by Algorithm 2. In practice we found that in our experiments setting \(K \ge 3\) is sufficient to provide results often indistinguishable from the true MAP solution.

Connection to move-making methods: At this point we are able to discuss move-making algorithms for MAP inference such as ICM (iterated conditional modes [2]), graph-cuts [4] and fusion moves [13] (or any other primal move-making framework) are instances of the generic method in Algorithm 1. Since these algorithms iteratively update a single primal label assignment, the set of potential vectors is given by a variation of \(\varPhi _1\),

$$\begin{aligned} \varPhi _1^\infty := \bigg \{ \phi&:\phi _\alpha ({\mathbf x}_\alpha ) = {\left\{ \begin{array}{ll} \theta _\alpha ({\mathbf x}_\alpha ) &{} \text {if } \forall s \in \alpha : x_s \in R_s \\ \infty &{} \text {otherwise} \end{array}\right. },\,R_s \subseteq \mathcal{L} : |R_s| = 1 \bigg \}, \end{aligned}$$
(9)

where there is exactly one resident state per node (the best solution found so far). The upper bound \(\overline{\theta }_\alpha \) is always set to \(\infty \). The transition from \(\phi \) to \(\phi '\) in all methods is done by first augmenting \(\phi \) to \(\hat{\phi }\) by allowing a larger resident set (adding all potentials for a single node in ICM, adding a constant state at each node in graph-cuts, and expanding the resident set by a global proposal in fusion moves), followed by a “projection” of \(\hat{\phi }\) to an element \(\phi ' \in \varPhi _1^\infty \) that decreases the MAP objective. In graph-cuts and fusion moves this projection to find the optimal \(\phi ' \in \varPhi _1^\infty \) is the computationally expensive step. Since \(K=1\) and \(\overline{\theta }= \infty \), the messages are uninformative, and the determination of \(\phi '\) is therefore agnostic to any non-resident state.

In light of these observations, one can interpret Algorithm 2 as variant of ICM with the following important differences: the resident sets contain more than one element, and by maintaining meaningful messages for non-resident states, sensible, finite estimates for the values (i.e. min-marginals) of even non-resident states are available. In contrast to other richer moving making algorithms the proposed method requires only local proposals.

4 Nested MAP Inference on Loopy Graphs

Most MAP instances are defined on graphs contain cycles, most prominently 4- or 8-neighborhood lattices used in computer vision applications. Algorithm 2 is not directly applicable to loopy graphs, since it relies crucially on the properties of belief propagation on a tree. There are two natural ways to extend the algorithm to general graphs, which are described in the following.

4.1 Limited Memory SGM

One straightforward option is to use a fixed tree decomposition and run Algorithm 2 independently on trees, and to extract the desired label assignment using the obtained min-marginals. This approach can be seen as limited-memory version of a semi-global method [8], which itself can be interpreted as one iteration of TRW message passing [5]. Let \(\mathcal{T}\) be a tree decomposition such that each \(T \in \mathcal{T}\) is a tree over \(\mathcal E\), and \(\{\theta ^T\}\) are potentials such that \(\sum _{T \in \mathcal T} \theta ^T = \theta \). For each \(T\in \mathcal T\) let \(\hat{v}^T\) be the sequence of estimates for \(v^*(\theta ^T)\) maintained by Algorithm 2, then in light of Proposition 1 we conclude

Proposition 2

The quantity \(\sum _{T \in \mathcal T} \hat{v}^T\) is convergent.

In contrast to standard tree decomposition for MAP inference the value of \(\sum _{T \in \mathcal T} \hat{v}^T\) is not necessarily a lower bound for \(v^*(\theta )\), since the attained surrogate potentials \(\phi ^T\) may be strictly larger than \(\theta ^T\). In the supplementary material [28] it is shown how to obtain a (rather loose) upper bound for \(v^*(\theta )\).

Figure 2 illustrates results for dense stereo and optical flow using horizontal, vertical and diagonal chains as suggested in [8]. We refer to Sect. 5.1 for a description of the employed unary and pairwise potentials. SGM relies on informative data terms to reach agreement of tree solutions in one step, which explains why it performed poorly for image completion tasks exhibiting weak data terms (discussed in Sect. 5.2) in our experiments. For dense correspondence problems with relatively informative unary potentials the MAP solutions obtained by limited memory SGM are qualitatively similar to the results returned by the method described in the following.

Fig. 2.
figure 2

Limited memory semi-global method (LM-SGM) applied on \(900 \times 750\) stereo images (a, c) and on optical flow (e, f). We compare to SGM over the full label space (b, d). Best viewed on screen.

4.2 Loopy Message Passing

Often a static tree decomposition does not yield satisfactory results for MAP problems, and proper inference on graphs with cycles is required. In such scenarios computing \(\hat{v} = \min _\tau E_{\text {MAP}}(\tau ; \phi ) = v^*(\phi )\) and \(\breve{v} = \min _\tau E_{\text {MAP}}(\tau ; \phi ') = v^*(\phi ')\) exactly in general requires an iterative method that converges to the desired value, even for “efficient” potentials coming from \(\varPhi _K\). Many methods to (approximately) solve \(\min _\tau E_{\text {MAP}}(\tau ; \phi )\) strongly resemble min-sum belief propagation (e.g. [6, 7, 10, 25], and therefore it is natural to consider respective adaptations of Algorithm 2: in short the messages \(\mu \) in Algorithm 2 are replaced by reparametrizations \(\lambda \) appearing in the dual program \(E_{\text {MAP}}^*(\lambda ; \theta )\) (or one of its variants), and the message passing steps (e.g. in line 11) are replaced by the respective message updates derived via the dual program. The modification of Algorithm 2 to loopy graphs is given in Algorithm 3 below.

figure c

If we go back to Algorithm 1, then the main difficulty to apply it on loopy graphs is that only lower and upper bounds for \(\hat{v}\) and \(\breve{v}\) are easily available by convex duality: the current dual objective is always a lower bound for the optimal value, and an upper bound is available by extracting a primal solution \(\tilde{{\mathbf x}}(\lambda )\) for given messages \(\lambda \), i.e. \(E_{\text {MAP}}(\tilde{{\mathbf x}}(\lambda ); \phi ) \ge v^*(\phi )\). Assume we have a lower bound \(\hat{v}_{\text {LB}}\) for \(\hat{v} = v^*(\phi )\) and an upper bound for \(\breve{v}_{\text {UB}}\) for \(\breve{v} = v^*(\phi ')\). If \(\breve{v}_{\text {UB}} < \hat{v}_{\text {LB}}\), then from the following chain of inequalities \(v^*(\phi ') \le \breve{v}_{\text {UB}} < \hat{v}_{\text {LB}} \le v^*(\phi )\) we conclude that \(v^*(\phi ') < v^*(\phi )\). Thus, we obtain the following proposition:

Proposition 3

If \(\phi \leftarrow \phi '\) is only executed if \(\breve{v}_{\text {UB}} < \hat{v}_{\text {LB}}\) in line 5 of Algorithm 1, then the sequence \(\hat{v}\) generated by the algorithm is non-increasing.

Unfortunately the condition \(\breve{v}_{\text {UB}} < \hat{v}_{\text {LB}}\) is a very strong one and requires the duality gap \(E_{\text {MAP}}(\tilde{{\mathbf x}}(\lambda ); \phi ) - E_{\text {MAP}}^*(\lambda ; \phi )\) to be extremely small in order for the condition to be satisfied. The theoretically justified extensions of Algorithms 1 and 2 to loopy graphs are not performing satisfactory in practice.

Hence, we propose to use the dual objectives \(E_{\text {MAP}}^*(\cdot ; \phi )\) and \(E_{\text {MAP}}^*(\cdot ; \phi ')\) as surrogates for \(v^*(\phi )\) and \(v^*(\phi ')\), respectively. Since we use lower bounds for both \(\hat{v}\) and \(\breve{v}\), monotonic behavior of \(\hat{v} = v^*(\phi )\) is not guaranteed. In practice, \(\hat{v}\) generally decreases monotonically and shows minor oscillations (as shown in the supplementary material).Footnote 5 In some applications (most notably image completion in Sect. 5.2) we observed a rather large duality gap between the dual objective \(\hat{v}\) and the value of the extracted primal solution. In this case we run a standard TRW-S like message passing algorithm to lower the primal-dual gap after Algorithm 2 completed a full traversal of nodes in the graph. In future work we aim to obtain stronger estimates for \(v^*( \phi ')\) via tree-block updates [23].

5 Results

In this section we show approximate MAP solutions obtained by our method applied in two very different application scenarios: the first application is dense correspondence estimation (including stereo disparity estimation and optical flow), and the second application is image completion via a random field model. In both settings the underlying graph is defined as 4-connected grid. We use the loopy extension of Algorithm 2 in our experiments. We fix the resident set size K to 5 in all application, hence the memory consumption required for the messages is \((5+1) \times 4 = 24\) floating point values per node. The algorithm is implemented in straightforward C++ with OpenMP enabled, and the runtimes are reported for a dual Xeon E5-2690 system with 64 GB of main memory. GPU acceleration is not employed.

5.1 Dense Correspondence

We demonstrate results on dense correspondence problems from the Middlebury benchmark datasets [1, 19]. Note that the goal is not to propose a competing new model for dense correspondence estimation, but to demonstrate the effectiveness of our method for inferring MAP estimates in these problems.

Dense disparity estimation: For the chosen dense disparity instances the state space has 120 or 240 elements (corresponding to integral disparities). The data term (unary potential) attains values in [0, 1] and is given by \( \min \left\{ 0, \frac{1-ZNCC}{2} \right\} \), where ZNCC is the zero-mean NCC of \(5\times 5\) gray-scale images patches. We use the 3-way \(P_1\)-\(P_2\) pairwise potential from [8] (with \(P_1=1/2\) and \(P_2=1\)). The sampling function H to determine \(\tilde{C}_s\) is given as follows: from ground truth disparity maps we estimate the relative frequencies of events \(x_s = x_t\) (\({\approx }94\%\)), \(x_s = x_t \pm 1\) (\({\approx }5.8\%\)), and \(|x_s-x_t|\ge 2\) (\({\approx }0.2\%\)) for neighboring pixels s, t, hence for a random variable \(u \sim U[0,1]\) we have

$$\begin{aligned} H(x;u)&= {\left\{ \begin{array}{ll} x &{} \text {if } u< 0.94 \\ x\pm 1 &{} \text {if } u < 0.998 \\ U[x_{min}, x_{max}] &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$
(10)

Figure 3 shows the evolution of the attained primal objective with respect to wall clock time for a full scale convex BP implementation and two variants of the proposed method: the first one uses a non-hierarchical (“flat”) approach and initializes \(R_s\) using uniform sampling, and the second one uses a multi-scale framework (with 5 octaves), and one element in the initial \(R_s\) is fixed to the label propagated from the coarser level. Clearly, our method achieves a lower energy much faster than full-scale BP (which also requires about 20–40 times more memory). Figure 1(a)–(d) shows corresponding labeling results returned by the hierarchical approach after 2, 4, 8, and 16 passes. Further results are available in the supplementary material [28].

Fig. 3.
figure 3

Evolution of primal E and dual \(E_{\text {MAP}}^*\) energies for dense disparity estimation with respect to wall time. E MAP is the converged primal energy of full scale BP.

Fig. 4.
figure 4

Evolution of primal energies for dense optical flow problems.

Optical flow: We ran similar numerical experiments for optical flow instances [1]. The state space contains \(129^2\) flow vectors corresponding to a 16 pixel search range at quarter-pixel accuracy. We upscale the original grayscale images to 400% and use the same ZNCC-based score as for dense stereo (but computed on \(7\times 7\) patches from the upscaled images). The pairwise smoothness term is the \(P_1\)-\(P_2\) model applied separately in the horizontal and vertical component of the motion vector. The decrease in primal energy for the solution returned after the respective number of passes with respect to wall clock time is shown in Fig. 4. In this case the memory consumption is \(6/129^2\) or less than \(0.04\%\) of running full inference,Footnote 6 and usable motion fields are obtained after about one second of CPU time. Visualizations of the corresponding flow fields are depicted in Fig. 1(f)–(i) and in the supplementary material [28].

Fig. 5.
figure 5

Masked source images and image completion results [Best viewed on screen.]

5.2 Image Completion

Dense correspondence problems can usually rely on relatively strong unary potentials (image matching scores), and the role of pairwise potentials is to provide a regularizer where the unary potential leads to ambiguous estimates. Hence, we show in this section results on the very complementary problem of image completion, where unary potentials are only available near the border of the missing region, and the labels assigned in the interior entirely rely on the pairwise potentials to avoid visual artefacts. We use a similar framework and settings as proposed in [11], where the image completion problem was formulated as MAP inference problem with the state space being patches extracted from the known parts of the given image. Both the unary and pairwise potentials for this problem are non-parametric and instance dependent. For all results in Fig. 5 we used \(13 \times 13\) patches to compute the unary potentials (compatibility with known pixels) and pairwise potentials (compatibility where patches assigned to adjacent nodes overlap). As in [11] the region to be filled is represented by a subsampled grid in the inference problem such that adjacent patches have approximately 50% overlap. Depending on the provided image the size of the state space varies between 10,000 and 50,000 patches that are available for image completion. The sampling function H proposes the appropriately shifted patch in the source image (60%), one of the 5 nearest neighbors in patch appearance (20%), and a random source image patch (20%). The results shown in Fig. 5 are obtained after running 100 passes of the loopy min-max inference method, which take about 10 s using a straightforward OpenMP accelerated C++ implementation. We did not investigate in further speedups such as using FFT suggested in [11]. The resulting images are generated by simply averaging pixels covered by multiple patches, and are qualitatively comparable to the ones reported in [11].

6 Conslusion and Future Work

In this work we formulate resource-efficient MAP inference as joint optimization problem over messages and over upper bounding potentials, which allow efficient inference. We show that this nested min-max problem can be efficiently minimized on trees yielding a convergent algorithm, and that its loopy extension also returns convincing MAP solutions in practice. Future work will address (i) theoretical aspects such as a better understanding of the method on loopy graphs, and (ii) practical aspects such as the impact of diverse [16] and non-local [14] label proposals.