Limited-Memory Belief Propagation via Nested Optimization

Zach, Christopher

doi:10.1007/978-3-319-78199-0_34

Limited-Memory Belief Propagation via Nested Optimization

Christopher Zach¹⁵

Conference paper
First Online: 22 March 2018

973 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 10746))

Abstract

In this work we express resource-efficient MAP inference as joint optimization problem w.r.t. (i) messages (i.e. reparametrizations) and (ii) surrogate potentials that are upper bounds for the problem of interest and allow efficient inference. We show that resulting nested optimization task can be solved on trees by a convergent and efficient algorithm, and that its loopy extension also returns convincing MAP solutions in practice. We demonstrate the utility of the method on dense correspondence and image completion problems.

Download conference paper PDF

1 Introduction

Maximum a-posteriori (MAP) inference in graphical models, and especially in random fields defined over image domains, is one of the most useful tools in computer vision and related fields. Sometimes non-linear optimization is the method of choice for best efficiency, if all potentials defining the objective are of parametric (or better convex) shape. On the other hand, if the potentials are of non-parametric shape (or highly non-convex), then well established methods such as loopy belief propagation (BP) or its convex variants are the method of choice. BP and related algorithms are highly successful and relatively efficient if the underlying state space is rather small, but suffer from two limitations in case of large state spaces: first, the intrinsic message passing step has at least linear but generally superlinear run-time complexity in terms of the state space size. Thus, the runtime of these methods does not scale well with the number of states. Further, even computing all terms (potentials) needed to define the problem instance can already be prohibitive. Second, the memory consumption grows linearly with the state space size, since belief propagation requires to maintain messages for each state. These shortcomings of BP are well known, and several approaches proposed in the literature (which will be reviewed in Sect. 1.1) attempt to overcome these limitations.

In this work we derive an algorithm similar in spirit to particle belief propagation (PBP, and its more recent flavors) by considering the following nested optimization problem: for a given inference problem determine the surrogate problem (represented by its potentials) among the set of “easy” problems (for which MAP inference is efficient) that is most similar to the given one. This is a nested problem since both the surrogate problem and the MAP solution need to be determined jointly. One contribution of this work is, that the joint search over potentials and MAP solutions can be carried out efficiently, and the resulting algorithm has a striking resemblence with PBP by alternating state selection and message passing steps. In contrast to PBP, which is rooted in approximating intractable distributions, our method aims directly to find an accurate MAP solution.

We also show that our proposed method is always convergent if applied on problems defined on tree graphs, and experimentally we found convergent behavior of its loopy extension to lattices. To our knowledge no such guarantee is available for any version of particle-based BP.

We illustrate the performance of the method in Fig. 1 on dense correspondence problems, where the evolution of label assignments returned by our method after the given numbers of passes over the image are shown. It can be seen that after 16 passes (<2 s CPU runtime) the label assignment has essentially converged. More important than the run-time is memory consumption: in the optical flow scenario our method requires about 32 MB (which compares to 16 GB—or 694 MB if the data terms are computed on the fly—required by a message passing method exploiting the specific product label structure [20]). This makes our approach also appealing for embedded and mobile implementations.

1.1 Related Work

There is a huge amount of literature related to MAP inference algorithms and efficient implementations, and we focus on methods specifically addressing large state spaces. A number of approaches approximate full-scale messages required in belief propagation by sets of particles to cope with large or even continuous state spaces, e.g. [3, 12, 14, 16, 17, 24]. These methods augment MAP inference with a stochastic particle selection step to approximate otherwise intractable distributions (or message functions) arising in the propagation step. They differ largely in how new candidates states (particles) are sampled and filtered, and some approaches are specifically designed for dense correspondence estimation. While these methods show good performance in practice, only few (and often weak) quantitative results are known for message passing in a stochastic context (e.g. [9, 15, 16]).

“Active MAP inference” [18] explicitly addresses the case when the computation of unary potentials is the main run-time bottleneck: unary potentials can be queried during MAP inference, and states selected for instantiating the unary potentials are ranked based on aggregated belief. In order to rank “unseen” states repeated MAP inference is necessary. In contrast to that particle max-product methods (and our approach) interleave MAP inference and candidate state scoring and is therefore very efficient for large state spaces. A different nested approach for MAP inference with large or continuous state spaces is considered in [17, 27], but in contrast to this work the surrogate problems are constructed to be lower bounds for the target inference problem. [21] mentions, but does not explore further a primal “majorize-minimize” framework, which shares the use of majorizing potentials (but in a different context) with our approach.^{Footnote 1}

1.2 Background

This section presents some material on MAP inference and the underlying linear programming relaxation. A labeling or MAP inference problem is determining the optimal label $x_s \in \mathcal{L}$ assigned at each node $s \in \mathcal V$, where the objective is over unary terms and pairwise terms^{Footnote 2},

$$\begin{aligned} {\mathbf x}^*&{\mathop {=}\limits ^{\text {def}}}\arg \min _{{\mathbf x}} \sum _{s \in \mathcal V} \theta _s(x_s) + \sum _{(s,t) \in \mathcal{E}} \theta _{st}(x_s, x_t) = \arg \min _{{\mathbf x}} \sum _\alpha \theta _\alpha ({\mathbf x}_\alpha ), \end{aligned}$$

(1)

where ${\mathbf x} = (x_s)_{s \in \mathcal V} \in \mathcal{L}^{|\mathcal{V}|}$, and $\mathcal E$ is a problem-specific set of undirected edges. $\theta $ are the potentials or costs for assigning particular states to nodes and edges. We will use $\alpha \in \mathcal{V} \cup \mathcal{E}$ to index unary and pairwise cliques in a unified way. The above label assignment problem is generally intractable to solve, and one highly successful approach to approximately solve this problem is to employ the corresponding linear programming (LP) relaxation (see e.g. [26])^{Footnote 3},

$$\begin{aligned} E_{\text {MAP}}(\tau ; \theta )&{\mathop {=}\limits ^{\text {def}}}\sum _{\alpha , {\mathbf x}_\alpha } \theta _\alpha ({\mathbf x}_\alpha ) \tau _\alpha ({\mathbf x}_\alpha ) \qquad \text {s.t. } \tau _s(x_s) = \sum _{{\mathbf x}_\alpha \setminus x_s} \tau _\alpha ({\mathbf x}_\alpha ), \end{aligned}$$

(2)

$\sum _{x_s} \tau _s(x_s) = 1$ and $\tau _\alpha ({\mathbf x}_\alpha ) \ge 0$. The unknowns $\{\tau _s\}_{s \in \mathcal V}$ and $\{\tau _{st}\}_{(s,t) \in \mathcal E}$ are “one-hot” encodings of the assigned labels, e.g. if $\tau ^*$ is the optimal solution of $E_{\text {MAP}}$ and the relaxation is tight, then $\tau _s(x_s)$ is ideally 1 iff state $x_s$ is the optimal label at node s, and 0 otherwise. The constraints linking $\tau _s$ and $\tau _\alpha $ are usually called marginalization constraints, and the unit sum constraint is typically referred as normalization constraint. The linear program in Eq. 2 is not unique, since redundant non-negativity and normalization constraints can be added to $E_{\text {MAP}}$ without affecting the optimal solution or value. Consequently, different duals are considered in the literature. A particular dual program is the following,

$$\begin{aligned} E_{\text {MAP}}^*(\lambda ; \theta )&= \sum _\alpha \min _{{\mathbf x}_\alpha } \left\{ \theta _\alpha ({\mathbf x}_\alpha ) - \sum \nolimits _{s \in \alpha } \lambda _{\alpha \rightarrow s}(x_s) \right\} \end{aligned}$$

(3)

subject to $\sum _{\alpha \ni s} \lambda _{\alpha \rightarrow s}(x_s) = 0$ for all s and $x_s$. Recall that cliques $\alpha $ range over nodes and edges. The dual unknowns $\{\lambda _{\alpha \rightarrow s}\}_{s \in \mathcal{V}, \alpha \ni s}$ are the Lagrange multipliers for the marginalization constraints, which are often termed “messages” [7, 22, 23, 26].

2 A Nested Approach to MAP Inference

This section describes our framework designed for MAP inference in large state spaces. First, we present a generic, high-level view, which will be made more concrete to obtain efficient implementable algorithms in subsequent sections.

For given potentials $\theta $ and $\phi $ we will write $\theta \preceq \phi $ if $\theta _\alpha ({\mathbf x}_\alpha ) \le \phi _\alpha ({\mathbf x}_\alpha )$ for all cliques $\alpha $ and clique states ${\mathbf x}_\alpha $. Hence, $\phi \succeq \theta $ is an element-wise upper bound of $\theta $. Let $\varPhi \subseteq \{ \phi : \phi \succeq \theta \}$ be a set of potentials upper bounding $\theta $. It will be described later in more detail, but in general the potentials $\phi \in \varPhi $ should also allow very efficient minimization of the MAP objective $E_{\text {MAP}}(\cdot ; \phi )$ (compared to $\theta $). Let

$$\begin{aligned} v^*(\theta ) := \min _{\tau } E_{\text {MAP}}(\tau ; \theta ) \end{aligned}$$

(4)

be the optimal value of interest (i.e. the optimal value for the MAP problem instance we want to solve). We introduce

$$\begin{aligned} \overline{v}(\varPhi ) := \min _{\phi \in \varPhi } \min _{\tau } E_{\text {MAP}}(\tau ; \phi ), \end{aligned}$$

(5)

i.e. the smallest MAP objective obtained by any potential $\phi \in \varPhi $. Since by assumption $\phi \succeq \theta $, we have $v^*(\theta ) \le \overline{v}(\varPhi )$. Thus, the goal is to find a $\phi ^* \in \varPhi $ such that the optimal value $\min _{\tau } E_{\text {MAP}}(\tau ; \phi ^*)$ is closest to $v^*(\theta )$.

Let us rephrase the meaning of $\overline{v}(\varPhi )$: if $\varPhi $ contains potentials that permit efficient inference, the task is to determine $\phi ^* \in \varPhi $ (and the corresponding primal solution ${\mathbf x}^*$) with the optimal value nearest to the one of the original problem, $v^*(\theta )$. In that sense $\phi ^*$ is the most faithful surrogate for $\theta $ among the elements in $\varPhi $. Since $\overline{v}(\varPhi )$ is defined as optimal value of a nested optimization problem, the requirement is that the additional cost of optimizing over $\phi \in \varPhi $ is more than compensated by the efficient computation of $E_{\text {MAP}}(\cdot ; \phi )$.

A generic descent algorithm that is guaranteed not to increase the maintained estimate $\hat{v}$ of $v^*(\varPhi )$ is given in Algorithm 1. The algorithm iteratively probes new potentials $\phi ' \in \varPhi $ and replaces the current one $\phi $, if the optimal value $\breve{v} = \min _\tau E_{\text {MAP}}(\tau ; \phi ')$ is smaller than $\hat{v} = \min _\tau E_{\text {MAP}}(\tau ; \phi )$. Note that computing $\hat{v}$ and $\breve{v}$ will be usually performed indirectly by a message passing or belief propagation algorithm, which iteratively or in closed form determine a set of messages certifying optimality of the returned solution. The generic algorithm Algorithm 1 leaves many design choices open, which will influence its practical performance (and even tractability):

How should $\varPhi $ be designed? Elements $\phi \in \varPhi $ need to allow very efficient MAP inference, and they need to be bounded from below by the given potentials $\theta $. Further, it must be easy to select elements from $\varPhi $. Below we describe a natural choice for $\varPhi $ in detail.
Given the current value of $\phi $, how should the next $\phi ' \in \varPhi $ be selected? In general the choice of $\phi '$ given $\phi $ is application specific. In our setting $\phi '$ will differ from $\phi $ only in potentials associated with a currently active node s, i.e. $\phi '_\alpha ({\mathbf x}_\alpha ) = \phi _\alpha ({\mathbf x}_\alpha )$ for all cliques $\alpha $ not containing a node s. Thus, the minimization w.r.t. $\phi \in \varPhi $ is therefore made compositional by successively optimizing only over $(\phi _\alpha )_{\alpha \ni s}$ for a node s currently in focus. As we will see later in Sect. 3, the BP messages needed to compute $v^*(\phi )$ also provide useful information on how to select $\phi '$. Further details on the transition $\phi \leadsto \phi '$ are given in Sects. 3 and 5.
How can we efficiently compute $\breve{v}$? In Sect. 3 we will show that $\breve{v}$ can be computed very efficiently if $\phi $ and $\phi '$ differ only locally. In this setting most BP messages used to determine $\hat{v}$ in the min-sum BP algorithm can be reused to compute $\breve{v}$. In Sect. 3 it is shown that Algorithm 1 (together with our choice of $\varPhi $ and the transition $\phi \leadsto \phi '$) condenses to alternating between locally modifying $\phi $ and local message passing steps.

The following sections specify more implementation details to convert the abstract meta-algorithm in Algorithm 1 into a practical implementation.

Design of $\varPhi $ : The requirement is that elements $\phi \in \varPhi $ need to allow efficient computation of $v^*(\phi ) = \min _\tau E_{\text {MAP}}(\tau ; \phi )$ in order for the nested algorithm (Algorithm 1) to be competitive in run-time (compared to directly inferring $\min _\tau E_{\text {MAP}}(\tau ; \theta )$). Intuitively, elements in $\varPhi $ shall be “uninformative” for most (clique) states in order to allow significant runtime and memory savings for computing and representing messages. One choice of $\varPhi $ that allows very efficient MAP inference is given as follows: the set $\varPhi _K$ consists of elements which are indexed by the cartesian product of “resident sets”^{Footnote 4} ${\mathbf R} = (R_s)_{s \in V}$, where $R_s \subseteq \mathcal{L}$ has exactly K elements (with $K \in \{1, \cdots , |\mathcal{L}|\}$ being a user-specified parameter):

$$\begin{aligned} \varPhi _K {\mathop {=}\limits ^{\text {def}}}\left\{ \theta \left[ {\mathbf R}\right] : \forall s: R_s \subseteq \mathcal{L} : |R_s| = K \right\} \end{aligned}$$

(6)

with

$$\begin{aligned} \theta _\alpha [R_\alpha ]({\mathbf x}_\alpha )&{\mathop {=}\limits ^{\text {def}}}{\left\{ \begin{array}{ll} \theta _\alpha ({\mathbf x}_\alpha ) &{} \text {if } \forall s \in \alpha : x_s \in R_s \\ \overline{\theta }_\alpha &{} \text {otherwise} \end{array}\right. } \end{aligned}$$

(7)

and $\theta \left[ {\mathbf R} \right] {\mathop {=}\limits ^{\text {def}}}(\theta _\alpha [R_\alpha ])_{\alpha }$. Here $\overline{\theta }_\alpha $ is a value large enough to guarantee $\phi \succeq \theta $ for all $\phi \in \varPhi _K$, e.g. $\overline{\theta }_\alpha = \max _{{\mathbf x}_\alpha } \theta _\alpha ({\mathbf x}_\alpha )$. These values are typically known from the (application-specific) design of $\theta $.

Since for a $\phi \in \varPhi _K$ non-resident states have all equal potentials $\overline{\theta }_\alpha $, they are indistinguishable. This also implies that we can enforce equality of BP messages for non-resident states, e.g. $\mu _{\alpha \rightarrow s}(x_s) = \mu _{\alpha \rightarrow s}(x'_s)$ for $x_s, x_s' \notin R_s$, without affecting the optimal solution. This property can be shown to hold also for messages $\lambda $ of the dual objective $E_{\text {MAP}}^*$. [27] Overall, the $|\mathcal L|$-dimensional message vector (e.g. $\mu _{\alpha \rightarrow s}(\cdot )$) can be losslessly represented to a $K+1$-dimensional messages $\left( \mu _{\alpha \rightarrow s}(x_s) \right) _{x_s \in R_s \cup \{*\}}$, where $*$ represents all (indistinguishable) non-resident states $x_s \notin R_s$. Further, message computation is reduced from $O(|\mathcal L|^{|\alpha |})$ in general to $O(K^{|\alpha |})$. Both memory and runtime savings can therefore be significant.

$\varPhi _K$ has another property that will be useful in practical implementations: given one element $\phi \in \varPhi _K$, we can create another $\phi ' \in \varPhi _k$ by modifying one of the resident sets $R_s$, and every element in $\varPhi _K$ can be generated by a sequence of local alterations of resident sets. This property allows us to construct $\phi '$ from the current potentials $\phi $ in Algorithm 1 solely by local changes.

3 Nested MAP Inference on Trees

When the underlying graph is a tree, exact MAP inference can be performed efficiently by min-sum belief propagation on a tree, which is essentially an instance of dynamic programming. In this section we demonstrate how the message passing steps of min-sum BP can be interleaved with updating the currently active potentials $\phi $ in order to reduce the target objective $\overline{v}(\varPhi )$. The generic method (Algorithm 1) is specialized in Algorithm 2 for MAP inference problems on trees.

The current estimates $\hat{v}$ (line 4) and $\breve{v}$ (line 7) for $\overline{v}(\varPhi )$ are efficiently computed and updated via min-sum BP. Note that these values are not actually used in the algorithm (other than for tracking and reporting progress). In order to obtain $\hat{v}$ (and $\breve{v}$) it is only necessary to compute upward messages towards the root. Further, if $\phi $ and $\phi '$ differ only in their potentials at the root node s (i.e. $\phi _\alpha = \phi '_\alpha $ if $s \notin \alpha $), then all upward messages other then the ones incoming at s can be reused when computing $\breve{v}$.

Moreover, when a new root (e.g. $\hat{s}$) is selected, then only messages along the (directed) path from s to $\hat{s}$ will need recomputation in order to make all upward messages valid again. Hence, if $\hat{s}$ is a direct neighbor of s (i.e. a child of s), then only the messages on a single edge $\hat{s} \rightarrow s$ are invalidated and need recomputation. It is sensible to traverse neighboring nodes, since updated potentials at node s will have the strongest impact on adjacent nodes. Overall we can state the following: message passing steps updating $\mu $ and minimization steps w.r.t. $\phi $ can be interleaved yielding an efficient implementation of the generic descent method in Algorithm 1. The most important invariant of the method is, that upward messages are always consistent with the current potentials $\phi $. Since $\hat{v}$ computed in line 4 is non-increasing by construction and lower bounded by $v^*(\theta )$, we obtain the following

Proposition 1

The sequence of values $\hat{v}$ generated in line 4 of Algorithm 2 converges.

The algorithm is presented for the particular choice of $\varPhi = \varPhi _K$ with K user-specified. A further design choice is that $\phi $ and $\phi '$ differ only in their resident sets $R_s$ and $R_s'$ by $K^+ < K$ elements. By exploiting the particular structure of $\varPhi $, we can actually test $\left( {\begin{array}{c}K\\ K-K^+\end{array}}\right) $ elements from $\varPhi $ in each step simultaneously. For notational brevity we present the algorithm with vanishing unary potentials, i.e. $\theta _s = \phi _s \equiv 0$. This can be always achieved w.l.o.g. by reparametrizing the original potential via $\theta _{st}(x_s, x_t) \leftarrow \theta _{st}(x_s, x_t) + \theta _s(x_s)/\deg (s) + \theta _t(x_t)/\deg (t)$ and $\theta _s(x_s) \leftarrow 0$ for all $x_s, x_t$.

Candidate set $C_s$ : In line 5 the selection of $K^+$ candidate states $C_s$ is left unspecified. Given that the objective is to lower $\hat{v}$ as much as possible (and therefore make $\breve{v}$ as small is possible), $C_s$ should contain states with small min-marginals,

$$\begin{aligned} \sum _{t \in ch(s)} \min _{x_t} \left\{ \theta _{st}(x_s, x_t) + \sum \nolimits _{r \in ch(t)} \mu _{r \rightarrow t}(x_t) \right\} . \end{aligned}$$

(8)

This criterion assumes knowledge of the true unary and pairwise potentials associated with (a small number of) candidate states. In our implementations, given an enlarged candidate set $\tilde{C}_s$ of states $C_s$ is obtained by taking the $K^+$ states from $\tilde{C}_s$ with the smallest min-marginals. There are many options to generate $\tilde{C}_s$. In our experiments we determine $\tilde{C}_s$ by generating one sample for each state in one neighboring resident set, i.e. $\tilde{C}_s = \big \{ H(x_t) : x_t \in \bigcup _{t \in N(s)} R_t \big \}$. H is a stochastic function that aims to model pairwise interactions, that is ${-\!\log } P(H(x_t), x_t) \approx \theta _{st}(H(x_t), x_t)$. Further details of the implementations are given in Sect. 5.

Limit points: While Algorithm 2 is guaranteed to be non-increasing in its estimate $\hat{v}$, this does not imply that $\hat{v} \rightarrow \overline{v}(\varPhi )$. In will be shown below that e.g. iterated conditional modes is an instance of our framework, hence a poor limit point could be returned by Algorithm 2. In practice we found that in our experiments setting $K \ge 3$ is sufficient to provide results often indistinguishable from the true MAP solution.

Connection to move-making methods: At this point we are able to discuss move-making algorithms for MAP inference such as ICM (iterated conditional modes [2]), graph-cuts [4] and fusion moves [13] (or any other primal move-making framework) are instances of the generic method in Algorithm 1. Since these algorithms iteratively update a single primal label assignment, the set of potential vectors is given by a variation of $\varPhi _1$,

$$\begin{aligned} \varPhi _1^\infty := \bigg \{ \phi&:\phi _\alpha ({\mathbf x}_\alpha ) = {\left\{ \begin{array}{ll} \theta _\alpha ({\mathbf x}_\alpha ) &{} \text {if } \forall s \in \alpha : x_s \in R_s \\ \infty &{} \text {otherwise} \end{array}\right. },\,R_s \subseteq \mathcal{L} : |R_s| = 1 \bigg \}, \end{aligned}$$

(9)

where there is exactly one resident state per node (the best solution found so far). The upper bound $\overline{\theta }_\alpha $ is always set to $\infty $. The transition from $\phi $ to $\phi '$ in all methods is done by first augmenting $\phi $ to $\hat{\phi }$ by allowing a larger resident set (adding all potentials for a single node in ICM, adding a constant state at each node in graph-cuts, and expanding the resident set by a global proposal in fusion moves), followed by a “projection” of $\hat{\phi }$ to an element $\phi ' \in \varPhi _1^\infty $ that decreases the MAP objective. In graph-cuts and fusion moves this projection to find the optimal $\phi ' \in \varPhi _1^\infty $ is the computationally expensive step. Since $K=1$ and $\overline{\theta }= \infty $, the messages are uninformative, and the determination of $\phi '$ is therefore agnostic to any non-resident state.

In light of these observations, one can interpret Algorithm 2 as variant of ICM with the following important differences: the resident sets contain more than one element, and by maintaining meaningful messages for non-resident states, sensible, finite estimates for the values (i.e. min-marginals) of even non-resident states are available. In contrast to other richer moving making algorithms the proposed method requires only local proposals.

4 Nested MAP Inference on Loopy Graphs

Most MAP instances are defined on graphs contain cycles, most prominently 4- or 8-neighborhood lattices used in computer vision applications. Algorithm 2 is not directly applicable to loopy graphs, since it relies crucially on the properties of belief propagation on a tree. There are two natural ways to extend the algorithm to general graphs, which are described in the following.

4.1 Limited Memory SGM

One straightforward option is to use a fixed tree decomposition and run Algorithm 2 independently on trees, and to extract the desired label assignment using the obtained min-marginals. This approach can be seen as limited-memory version of a semi-global method [8], which itself can be interpreted as one iteration of TRW message passing [5]. Let $\mathcal{T}$ be a tree decomposition such that each $T \in \mathcal{T}$ is a tree over $\mathcal E$, and $\{\theta ^T\}$ are potentials such that $\sum _{T \in \mathcal T} \theta ^T = \theta $. For each $T\in \mathcal T$ let $\hat{v}^T$ be the sequence of estimates for $v^*(\theta ^T)$ maintained by Algorithm 2, then in light of Proposition 1 we conclude

Proposition 2

The quantity $\sum _{T \in \mathcal T} \hat{v}^T$ is convergent.

In contrast to standard tree decomposition for MAP inference the value of $\sum _{T \in \mathcal T} \hat{v}^T$ is not necessarily a lower bound for $v^*(\theta )$, since the attained surrogate potentials $\phi ^T$ may be strictly larger than $\theta ^T$. In the supplementary material [28] it is shown how to obtain a (rather loose) upper bound for $v^*(\theta )$.

Figure 2 illustrates results for dense stereo and optical flow using horizontal, vertical and diagonal chains as suggested in [8]. We refer to Sect. 5.1 for a description of the employed unary and pairwise potentials. SGM relies on informative data terms to reach agreement of tree solutions in one step, which explains why it performed poorly for image completion tasks exhibiting weak data terms (discussed in Sect. 5.2) in our experiments. For dense correspondence problems with relatively informative unary potentials the MAP solutions obtained by limited memory SGM are qualitatively similar to the results returned by the method described in the following.

4.2 Loopy Message Passing

Often a static tree decomposition does not yield satisfactory results for MAP problems, and proper inference on graphs with cycles is required. In such scenarios computing $\hat{v} = \min _\tau E_{\text {MAP}}(\tau ; \phi ) = v^*(\phi )$ and $\breve{v} = \min _\tau E_{\text {MAP}}(\tau ; \phi ') = v^*(\phi ')$ exactly in general requires an iterative method that converges to the desired value, even for “efficient” potentials coming from $\varPhi _K$. Many methods to (approximately) solve $\min _\tau E_{\text {MAP}}(\tau ; \phi )$ strongly resemble min-sum belief propagation (e.g. [6, 7, 10, 25], and therefore it is natural to consider respective adaptations of Algorithm 2: in short the messages $\mu $ in Algorithm 2 are replaced by reparametrizations $\lambda $ appearing in the dual program $E_{\text {MAP}}^*(\lambda ; \theta )$ (or one of its variants), and the message passing steps (e.g. in line 11) are replaced by the respective message updates derived via the dual program. The modification of Algorithm 2 to loopy graphs is given in Algorithm 3 below.

If we go back to Algorithm 1, then the main difficulty to apply it on loopy graphs is that only lower and upper bounds for $\hat{v}$ and $\breve{v}$ are easily available by convex duality: the current dual objective is always a lower bound for the optimal value, and an upper bound is available by extracting a primal solution $\tilde{{\mathbf x}}(\lambda )$ for given messages $\lambda $, i.e. $E_{\text {MAP}}(\tilde{{\mathbf x}}(\lambda ); \phi ) \ge v^*(\phi )$. Assume we have a lower bound $\hat{v}_{\text {LB}}$ for $\hat{v} = v^*(\phi )$ and an upper bound for $\breve{v}_{\text {UB}}$ for $\breve{v} = v^*(\phi ')$. If $\breve{v}_{\text {UB}} < \hat{v}_{\text {LB}}$, then from the following chain of inequalities $v^*(\phi ') \le \breve{v}_{\text {UB}} < \hat{v}_{\text {LB}} \le v^*(\phi )$ we conclude that $v^*(\phi ') < v^*(\phi )$. Thus, we obtain the following proposition:

Proposition 3

If $\phi \leftarrow \phi '$ is only executed if $\breve{v}_{\text {UB}} < \hat{v}_{\text {LB}}$ in line 5 of Algorithm 1, then the sequence $\hat{v}$ generated by the algorithm is non-increasing.

Unfortunately the condition $\breve{v}_{\text {UB}} < \hat{v}_{\text {LB}}$ is a very strong one and requires the duality gap $E_{\text {MAP}}(\tilde{{\mathbf x}}(\lambda ); \phi ) - E_{\text {MAP}}^*(\lambda ; \phi )$ to be extremely small in order for the condition to be satisfied. The theoretically justified extensions of Algorithms 1 and 2 to loopy graphs are not performing satisfactory in practice.

Hence, we propose to use the dual objectives $E_{\text {MAP}}^*(\cdot ; \phi )$ and $E_{\text {MAP}}^*(\cdot ; \phi ')$ as surrogates for $v^*(\phi )$ and $v^*(\phi ')$, respectively. Since we use lower bounds for both $\hat{v}$ and $\breve{v}$, monotonic behavior of $\hat{v} = v^*(\phi )$ is not guaranteed. In practice, $\hat{v}$ generally decreases monotonically and shows minor oscillations (as shown in the supplementary material).^{Footnote 5} In some applications (most notably image completion in Sect. 5.2) we observed a rather large duality gap between the dual objective $\hat{v}$ and the value of the extracted primal solution. In this case we run a standard TRW-S like message passing algorithm to lower the primal-dual gap after Algorithm 2 completed a full traversal of nodes in the graph. In future work we aim to obtain stronger estimates for $v^*( \phi ')$ via tree-block updates [23].

5 Results

In this section we show approximate MAP solutions obtained by our method applied in two very different application scenarios: the first application is dense correspondence estimation (including stereo disparity estimation and optical flow), and the second application is image completion via a random field model. In both settings the underlying graph is defined as 4-connected grid. We use the loopy extension of Algorithm 2 in our experiments. We fix the resident set size K to 5 in all application, hence the memory consumption required for the messages is $(5+1) \times 4 = 24$ floating point values per node. The algorithm is implemented in straightforward C++ with OpenMP enabled, and the runtimes are reported for a dual Xeon E5-2690 system with 64 GB of main memory. GPU acceleration is not employed.

5.1 Dense Correspondence

We demonstrate results on dense correspondence problems from the Middlebury benchmark datasets [1, 19]. Note that the goal is not to propose a competing new model for dense correspondence estimation, but to demonstrate the effectiveness of our method for inferring MAP estimates in these problems.

Dense disparity estimation: For the chosen dense disparity instances the state space has 120 or 240 elements (corresponding to integral disparities). The data term (unary potential) attains values in [0, 1] and is given by $ \min \left\{ 0, \frac{1-ZNCC}{2} \right\} $, where ZNCC is the zero-mean NCC of $5\times 5$ gray-scale images patches. We use the 3-way $P_1$-$P_2$ pairwise potential from [8] (with $P_1=1/2$ and $P_2=1$). The sampling function H to determine $\tilde{C}_s$ is given as follows: from ground truth disparity maps we estimate the relative frequencies of events $x_s = x_t$ (${\approx }94\%$), $x_s = x_t \pm 1$ (${\approx }5.8\%$), and $|x_s-x_t|\ge 2$ (${\approx }0.2\%$) for neighboring pixels s, t, hence for a random variable $u \sim U[0,1]$ we have

$$\begin{aligned} H(x;u)&= {\left\{ \begin{array}{ll} x &{} \text {if } u< 0.94 \\ x\pm 1 &{} \text {if } u < 0.998 \\ U[x_{min}, x_{max}] &{} \text {otherwise.} \end{array}\right. } \end{aligned}$$

(10)

Figure 3 shows the evolution of the attained primal objective with respect to wall clock time for a full scale convex BP implementation and two variants of the proposed method: the first one uses a non-hierarchical (“flat”) approach and initializes $R_s$ using uniform sampling, and the second one uses a multi-scale framework (with 5 octaves), and one element in the initial $R_s$ is fixed to the label propagated from the coarser level. Clearly, our method achieves a lower energy much faster than full-scale BP (which also requires about 20–40 times more memory). Figure 1(a)–(d) shows corresponding labeling results returned by the hierarchical approach after 2, 4, 8, and 16 passes. Further results are available in the supplementary material [28].

Optical flow: We ran similar numerical experiments for optical flow instances [1]. The state space contains $129^2$ flow vectors corresponding to a 16 pixel search range at quarter-pixel accuracy. We upscale the original grayscale images to 400% and use the same ZNCC-based score as for dense stereo (but computed on $7\times 7$ patches from the upscaled images). The pairwise smoothness term is the $P_1$-$P_2$ model applied separately in the horizontal and vertical component of the motion vector. The decrease in primal energy for the solution returned after the respective number of passes with respect to wall clock time is shown in Fig. 4. In this case the memory consumption is $6/129^2$ or less than $0.04\%$ of running full inference,^{Footnote 6} and usable motion fields are obtained after about one second of CPU time. Visualizations of the corresponding flow fields are depicted in Fig. 1(f)–(i) and in the supplementary material [28].

5.2 Image Completion

Dense correspondence problems can usually rely on relatively strong unary potentials (image matching scores), and the role of pairwise potentials is to provide a regularizer where the unary potential leads to ambiguous estimates. Hence, we show in this section results on the very complementary problem of image completion, where unary potentials are only available near the border of the missing region, and the labels assigned in the interior entirely rely on the pairwise potentials to avoid visual artefacts. We use a similar framework and settings as proposed in [11], where the image completion problem was formulated as MAP inference problem with the state space being patches extracted from the known parts of the given image. Both the unary and pairwise potentials for this problem are non-parametric and instance dependent. For all results in Fig. 5 we used $13 \times 13$ patches to compute the unary potentials (compatibility with known pixels) and pairwise potentials (compatibility where patches assigned to adjacent nodes overlap). As in [11] the region to be filled is represented by a subsampled grid in the inference problem such that adjacent patches have approximately 50% overlap. Depending on the provided image the size of the state space varies between 10,000 and 50,000 patches that are available for image completion. The sampling function H proposes the appropriately shifted patch in the source image (60%), one of the 5 nearest neighbors in patch appearance (20%), and a random source image patch (20%). The results shown in Fig. 5 are obtained after running 100 passes of the loopy min-max inference method, which take about 10 s using a straightforward OpenMP accelerated C++ implementation. We did not investigate in further speedups such as using FFT suggested in [11]. The resulting images are generated by simply averaging pixels covered by multiple patches, and are qualitatively comparable to the ones reported in [11].

6 Conslusion and Future Work

In this work we formulate resource-efficient MAP inference as joint optimization problem over messages and over upper bounding potentials, which allow efficient inference. We show that this nested min-max problem can be efficiently minimized on trees yielding a convergent algorithm, and that its loopy extension also returns convincing MAP solutions in practice. Future work will address (i) theoretical aspects such as a better understanding of the method on loopy graphs, and (ii) practical aspects such as the impact of diverse [16] and non-local [14] label proposals.

Notes

1.
Stricly speaking we do not use a majorizer, but only (less constrained) upper bounds.
2.
We focus on problems with at most pairwise cliques, which are most relevant in practice, but everything can be generalized to higher order cliques straightforwardly.
3.
The expression ${\mathbf x}_\alpha \setminus x_s$ is shorthand for $\{ {\mathbf x}'_\alpha : x'_s = x_s \}$. We will also write compactly $\alpha \ni s$ instead of $\{ \alpha : s \in \alpha \}$.
4.
We prefer the term “resident set” over “particles”, since—in contrast to particle message passing—our method maintains messages also for non-resident states.
5.
One can make the algorithm (trivially) convergent e.g. by conditionally updating the primal solution, such that the solution with minimal primal objective so far is always reported.
6.
And $4.7\%$ when using the weaker product label space relaxation [20] with unary potentials being computed on the fly.

References

Baker, S., Scharstein, D., Lewis, J., Roth, S., Black, M.J., Szeliski, R.: A database and evaluation methodology for optical flow. IJCV 92(1), 1–31 (2011)
Article Google Scholar
Besag, J.: On the statistical analysis of dirty pictures. J. R. Stat. Soc. B 48, 259–302 (1986)
MathSciNet MATH Google Scholar
Besse, F., Rother, C., Fitzgibbon, A., Kautz, J.: PMBP: patchmatch belief propagation for correspondence field estimation. IJCV 110(1), 2–13 (2014)
Article Google Scholar
Boykov, Y., Veksler, O., Zabih, R.: Fast approximate energy minimization via graph cuts. IEEE Trans. Pattern Anal. Mach. Intell. 23(11), 1222–1239 (2001)
Article Google Scholar
Drory, A., Haubold, C., Avidan, S., Hamprecht, F.A.: Semi-global matching: a principled derivation in terms of message passing. In: Jiang, X., Hornegger, J., Koch, R. (eds.) GCPR 2014. LNCS, vol. 8753, pp. 43–53. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-11752-2_4
Google Scholar
Globerson, A., Jaakkola, T.: Fixing max-product: convergent message passing algorithms for MAP LP-relaxations. In: NIPS (2007)
Google Scholar
Hazan, T., Shashua, A.: Norm-prodcut belief propagtion: primal-dual message-passing for LP-relaxation and approximate-inference. IEEE Trans. Inform. Theory 56(12), 6294–6316 (2010)
Article MathSciNet MATH Google Scholar
Hirschmüller, H.: Accurate and efficient stereo processing by semi-global matching and mutual information. In: Proceedings of the CVPR, pp. 807–814 (2005)
Google Scholar
Ihler, A., McAllester, D.: Particle belief propagation. In: AISTATS, pp. 256–263 (2009)
Google Scholar
Kolmogorov, V.: Convergent tree-reweighted message passing for energy minimization. IEEE Trans. Pattern Anal. Mach. Intell. 28(10), 1568–1583 (2006)
Article Google Scholar
Komodakis, N., Tziritas, G.: Image completion using efficient belief propagation via priority scheduling and dynamic pruning. IEEE Trans. Image Proc. 16(11), 2649–2661 (2007)
Article MathSciNet Google Scholar
Kothapa, R., Pacheco, J., Sudderth, E.: Max-product particle belief propagation. Technical report, Master’s project report, Brown University, Department of Computer Science (2011)
Google Scholar
Lempitsky, V., Rother, C., Roth, S., Blake, A.: Fusion moves for Markov random field optimization. IEEE Trans. Pattern Anal. Mach. Intell. 32(8), 1392–1405 (2010)
Article Google Scholar
Li, Y., Min, D., Brown, M.S., Do, M.N., Lu, J.: SPM-BP: sped-up patchmatch belief propagation for continuous MRFs. In: Proceedings of the ICCV, pp. 4006–4014 (2015)
Google Scholar
Noorshams, N., Wainwright, M.J.: Stochastic belief propagation: a low-complexity alternative to the sum-product algorithm. IEEE Trans. Inform. Theory 59(4), 1981–2000 (2013)
Article MathSciNet MATH Google Scholar
Pacheco, J., Sudderth, E.: Proteins, particles, and pseudo-max-marginals: a submodular approach. In: Proceedings of the ICML, pp. 2200–2208 (2015)
Google Scholar
Peng, J., Hazan, T., McAllester, D., Urtasun, R.: Convex max-product algorithms for continuous MRFs with applications to protein folding. In: Proceedings of the ICML (2011)
Google Scholar
Roig, G., Boix, X., Nijs, R.D., Ramos, S., Kuhnlenz, K., Gool, L.V.: Active map inference in CRFs for efficient semantic segmentation. In: Proceedings of ICCV, pp. 2312–2319 (2013)
Google Scholar
Scharstein, D., Szeliski, R.: High-accuracy stereo depth maps using structured light. In: Proceedings of the CVPR, pp. 195–202 (2003)
Google Scholar
Shekhovtsov, A., Kovtun, I., Hlaváč, V.: Efficient MRF deformation model for non-rigid image matching. CVIU 112(1), 91–99 (2008)
Google Scholar
Shekhovtsov, A., Reinbacher, C., Graber, G., Pock, T.: Solving dense image matching in real-time using discrete-continuous optimization. arXiv preprint arXiv:1601.06274 (2016)
Sontag, D., Globerson, A., Jaakkola, T.: Introduction to dual decomposition for inference. In: Optimization for Machine Learning. MIT Press (2011)
Google Scholar
Sontag, D., Jaakkola, T.: Tree block coordinate descent for MAP in graphical models. J. Mach. Learn. Res. (2009)
Google Scholar
Trinh, H., McAllester, D.: Unsupervised learning of stereo vision with monocular cues. In: Proceedings of the BMVC, pp. 72–81 (2009)
Google Scholar
Wainwright, M.J., Jaakkola, T.S., Willsky, A.S.: MAP estimation via agreement on trees: message-passing and linear programming. IEEE Trans. Inf. Theory 51(11), 3697–3717 (2005)
Article MathSciNet MATH Google Scholar
Werner, T.: A linear programming approach to max-sum problem: a review. IEEE Trans. Pattern Anal. Mach. Intell. 29(7) (2007)
Google Scholar
Zach, C.: A principled approach for coarse-to-fine MAP inference. In: Proceedings of the CVPR, pp. 1330–1337 (2014)
Google Scholar
Zach, C.: Limited-memory belief propagation via nested optimization (2017). Supplementary material https://sites.google.com/site/christophermzach/home/pdf/lmbp_supp.pdf

Download references

Author information

Authors and Affiliations

Toshiba Research Europe, Cambridge, UK
Christopher Zach

Authors

Christopher Zach
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christopher Zach .

Editor information

Editors and Affiliations

Ca’ Foscari University of Venice, Venice, Italy
Marcello Pelillo
University of York, York, United Kingdom
Edwin Hancock

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zach, C. (2018). Limited-Memory Belief Propagation via Nested Optimization. In: Pelillo, M., Hancock, E. (eds) Energy Minimization Methods in Computer Vision and Pattern Recognition. EMMCVPR 2017. Lecture Notes in Computer Science(), vol 10746. Springer, Cham. https://doi.org/10.1007/978-3-319-78199-0_34

Download citation

DOI: https://doi.org/10.1007/978-3-319-78199-0_34
Published: 22 March 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-78198-3
Online ISBN: 978-3-319-78199-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics