Keywords

1 Introduction

Stereo matching has been one of the key problems in computer vision for years. Recently, most of publications [1,2,3,4] have been focused on solving this problem. And the segment-based methods [7,8,9] have attracted more and more attention due to their good performances for years.

Most segment-based stereo matching algorithms follow the four-step pipeline [5]: First, matching cost computation; Second, cost aggregation; Third, disparity computation/optimization; Fourth, disparity refinement. Traditional disparity refinement methods, involving left-right consistency checking [10], hole filling [11], and median filtering [12, 13], could not provide highly accurate disparity estimation. Yoon et al. [14] adopted adaptive supporting-weight approach for correspondence search to refine the local aggregation results. Yang [15] firstly proposed the non-local aggregation method and refined the non-local results with minimum spanning tree (MST). Based on Yang’s method, Mei et al. [16] proposed a segment-tree (ST) structure for non-local cost aggregation, they enhanced the disparity values, with a depth-color segmentation method extended from a classic graph-based segmentation method [17]. The region-based methods [18, 19], presented to further improve the disparity estimation, can get better results especially in low textured regions.

In this paper, we propose a stereo refinement algorithm based on merging super-pixels (MSP). Our algorithm includes the following seven steps: First, estimating the initial disparity values with a local or non-local method and locating the super-pixels with a depth-color segmentation method from stereo images; Second, estimating the robust information of each super-pixel by voting; Third, searching for the supporting neighbors of each super-pixel; Fourth, merging super-pixels into region based on the correlation of adjacent super-pixels; Fifth, updating the information of each region and finding out unreliable regions; Sixth, correcting unreliable region with its supporting region; Seventh, assigning disparity value for each pixel with considering the disparity of the correlative region.

In general, our paper makes these main contributions: (1) we merge super-pixels into stable region, then the disparity of each pixel can be estimated by considering the constraint on smoothness of the correlative region to maintain the continuity of the same disparity. (2) we apply the optimization of the cooperative relations between adjacent regions to reduce the unreliable disparity values and obtain the high-quality depth boundaries.

2 Obtaining Raw Cost Aggregation and Initial Disparities

2.1 Obtaining Cost and Disparity in Pixel Domain

First of all, we employ some local or non-local algorithms to obtain the raw cost aggregation and initial disparity values. These algorithms always poorly use WTA strategy to select disparities from multiple candidates and the disparity estimation obtained by these algorithms is not accurate enough. Later, the accuracy will be improved by our algorithm.

2.2 Over-Segment Based on Color-Depth

Segment-based algorithms usually assume that disparity values vary smoothly in each segment and the depth discontinuities only occur on segment boundaries. But in practice, over-segment based on color-depth is preferred and the assumption is not al-ways met. In this paper, we use efficient graph-based image segmentation [16, 17]. Figure 1 shows the disparity map of the Teddy stereo pair and the segmentation result of the left image produced by the method in [16]. In this paper, we call the over-segmentation super-pixel.

Fig. 1.
figure 1

The segmentation result of the left image by using color-depth based over-segmentation method and the disparity map of the Teddy stereo pair by using segment-tree stereo matching algorithm [16]. (Color figure online)

2.3 Cross-Checking Test

At first, a local or non-local cost aggregation method runs the left and the right image as reference images in turn to obtain two corresponding disparity maps. In order to eliminate the outlier in disparity map and obtain robust disparity estimation of each segmentation, the cross-checking test is applied. Then the occlusions and matching errors in the disparity map can be obtained, they are all called unreliable pixels in this paper. After cross-checking, the cost volume is refined according to [15]. Let \( D \) denotes the disparity map, a new cost value is computed for each pixel \( p \) at each disparity level \( d \) as:

$$ C_{ 1}^{n} \left( p \right) = \left\{ {\begin{array}{*{20}c} {\left| {d - D\left( p \right)} \right|,} & {p\, is\, stable\, and\,D\left( p \right) > 0} \\ {0,} & {otherwise} \\ \end{array} } \right. $$
(1)

3 Robust Super-Pixels Merging

The super-pixels are sensitive to unreliable pixels and they are correlative rather than individual. If the super-pixel is handled solely, the disparity values around the boundary between adjacent regions, which have the same disparity may be discontinuous. In this paper, an effective approach of merging super-pixels to stable region is proposed to resolve this problem.

3.1 Voting the Information of Super-Pixel

Before merging, the information of super-pixels should be obtained by voting. The information contains RGB values, disparity and the message whether the super-pixel is unreliable or not. The process of voting robust information can be expressed as:

First, the RGB values of super-pixel are estimated by using RGB values of all pixels within the region. And the RGB values of each super-pixel are respectively determined by voting a one-dimensional histogram, where the x-coordinate is the value of one of the three channels, and the y-coordinate is the count number of values. After sorting the histogram and smoothing operation by a Gaussian filter, the value of each individual channel is finally estimated by the maximum of the corresponding histogram;

Second, the disparity of each super-pixel is estimated in a similar way by getting rid of unreliable pixels.

Third, if the number of unreliable pixels in a super-pixel is more than a given per-cent of the number of all pixels within the super-pixel, we regard this super-pixel as an unreliable super-pixel and assign true (denotes the super-pixel is unreliable) for the message of this super-pixel. Let \( W_{occ} \) denotes the percent.

3.2 Supporting Neighbors Selection

In order to get rid of piecewise smooth, the super-pixels should be merged to stable region by considering the supporting neighbors of each super-pixel. Let \( W_{i} \left( {S_{p} } \right) \) denotes the weight of the correlation between the super-pixel \( S_{p} \) and its neighboring super-pixel \( S_{i} \). Considering the difference of disparity and color between super-pixels \( S_{p} \) and \( S_{i} \). The ratio \( \alpha \), which denotes the ratio of common border lengths to perimeter, can be written as:

$$ \alpha = \frac{{N_{i} }}{{N_{all} }} $$
(2)

where \( N_{i} \) denotes the length of the boundary between super-pixel \( S_{p} \) and \( S_{i} \). And \( N_{all} \) denotes the perimeter of super-pixel \( N_{i} \). Thus, \( W_{i} \left( {S_{p} } \right) \) can be written as:

$$ W_{i} \left( {S_{p} } \right) = \alpha \frac{{\left| {D_{r} \left( {S_{p} } \right) - D_{r} \left( {S_{i} } \right)} \right| \cdot \sigma_{s} }}{{\sigma_{c} }} = \left( {1 - \alpha } \right)\frac{{I_{r} \left( {S_{p} } \right) - I_{r} \left( {S_{i} } \right)}}{{\sigma_{c} }} $$
(3)

where \( S_{i} \) covers all neighbors of super-pixel \( S_{p} \). \( \sigma_{s} \) and \( \sigma_{c} \) are two variables, which can self-adapt in terms of the disparity range and color range, to normalize \( I_{r} \) and \( D_{r} \) to the range [0, 1]. \( D_{r} \) denotes the disparity of super-pixel and \( I_{r} \) denotes the RGB values of super-pixel.

Here, it is worthy of attention that the proposed approach just depends on the con-textual information of the adjacent super-pixels and no ambiguity or artificial factor exists.

The supporting neighbors are selected by minimizing the set of \( W_{i} \left( {S_{p} } \right) \), \( i = 1, 2 \ldots n \). Due to the several minimum (because of equal) at the same time, the supporting neighbors of super-pixel \( S_{p} \) are consist of all neighboring super-pixels, which can minimize the \( W_{i} \left( {S_{p} } \right) \).

3.3 Merging Super-Pixels to Stable Region

This step aims to obtain stable region by merging super-pixels and it is divided into the following three cases:

  1. (a)

    If two neighboring super-pixels are both reliable super-pixel and their disparities are equal, then merge the two super-pixels;

  2. (b)

    If the two super-pixels are both unreliable or one is unreliable region, the other is not and one is the supporting neighbor of the other one, then merge the two super-pixels;

  3. (c)

    The rest conditions will not be merged. If a super-pixel was not merged with any other super-pixel, it should be regarded as a stable region. We merge the super-pixels by using a forest structure. (The forest construction algorithm, which regards super-pixel as pixel, is similar to the ST structure algorithm in [16].)

Figure 2 gives the super-pixels merged result of the left image and the disparity map with first iteration. The experimental results show that the new segmentations are stable and our method performs well in disparity estimation.

Fig. 2.
figure 2

The first iteration: merging the super-pixels and then estimating the disparity map based on the merged result.

4 The Principle of Unreliable Region Optimization

The unreliable pixels have great effects on disparity estimation. In this section, we propose a new method to deal with unreliable pixels by optimizing the unreliable region. As described in Sect. 3, before optimizing, the information and the supporting neighbors of each region must be updated.

The principles of unreliable region optimization are as follows:

  1. (a)

    Considering each unreliable region’s supporting neighbors, if there is a supporting neighbor which is a reliable region, or an unreliable region which has already been optimized, then we regard the supporting neighbor as a supporting region;

  2. (b)

    If there is no supporting region of unreliable region \( S_{u} \), we select the neighbor which can minimize \( W_{i} \left( {S_{u} } \right) \) from all neighbors of \( S_{u} \) to be a supporting region;

  3. (c)

    If an unreliable region has more than one supporting region, selecting the supporting region with the minimum of disparity. And then we regard the selected supporting region as the final supporting region;

  4. (d)

    Assigning the final supporting region disparity for the correlative unreliable region disparity. And then set a label, which denotes the unreliable region has been optimized, to this unreliable region. Applying the four steps to all unreliable regions until each of them have been set an optimized label.

5 Depth Hypotheses Generation

In this section, we obtain the accurate disparity map by two steps. First, we adopt the constraint on smoothness to reduce the effect of spurious disparity estimation. Second, the iterative refinement is employed to enhance the accuracy of the disparity map.

5.1 The Constraint on Smoothness of Region

In order to reduce effects on spurious disparity estimation, we consider the smooth-ness of stable region. Usually, the depth discontinuity occurs around the boundaries of regions. Thus, the method, used to solve the smoothness problem, assigns the disparity value for each pixel by selecting the disparity from the correlative stable region disparity, which can minimize the cost aggregation. The optimal disparity value of pixel \( p \) within super-pixel \( S_{p} \) can be written as:

$$ \varphi_{d} \left( p \right) = \hbox{min} \left( {D_{{d_{i} }}^{A} \left( p \right)} \right),d_{i} \in \left[ {D_{r} \left( {S_{p} } \right) - \Delta d,D_{r} \left( {S_{p} } \right) + \Delta d} \right] $$
(4)

where \( \Delta d \) is a variable which determines the range of stable region disparity. If it is too small, the correct cost value may be excluded and if it is too large, the effects of spurious cost values may not be reduced. Thus we apply an adapting formulation for computing \( \Delta d \), the formulation can be written as:

$$ \Delta d = \frac{R}{\gamma } $$
(5)

where \( R \) denotes the disparity range of image and \( \gamma \) is a constant which is set to six in all of our experiments. According to Eq. (4), the disparity value of pixel \( p \) is \( d \) which minimizes \( D_{{d_{i} }}^{A} \left( p \right) \).

5.2 Enhancement with Iteration

After estimating the accurate disparity values, we can use iterative refinement to enhance the disparity estimation. As shown in Fig. 3, in the first iteration, disparity value with the best cost value is selected for each pixel, and then the robust typical disparity value can be voted for each stable region. In the next iteration, refining the disparity values by re-computing the steps from 2 to 7 based on the last iteration disparity map. New stable regions are determined and their information is updated. The best disparity values of pixels are selected only among the represent disparity value of the correlative stable regions. The final disparity values can be assigned after two iterations.

Fig. 3.
figure 3

The second iteration: merge super-pixels and then estimate the disparity map based on the merged result.

Figure 3 shows the second iteration segmentation result of the left image. Obviously, the experimental result performs better than the result in the first iteration (Fig. 2). In addition, in order to verify the robustness of the proposed algorithm, Fig. 4 shows the merged results of the rest stereo image pairs in the Middlebury data sets [6].

Fig. 4.
figure 4

The image from top to bottom is the merged super-pixels results of Tsukuba, Venus and Cones.

6 Experimental Results

The local algorithm [14] and the non-local algorithm [16] proved to be the top performer on Middlebury benchmark [6], but the results of this paper demonstrates that quantitative disparity map estimated by these algorithms can be improved by the proposed algorithm (MSP).

All experiments in this paper strictly follow a local stereo matching pipeline [5]. The specific descriptions are as follows:

  1. (a)

    Cost computation: The same cost used in the local method [14] and non-local method [16], is adopted in all our experiments. It is a blending of truncated color difference and truncated gradient difference.

  2. (b)

    Cost aggregation: Two cost aggregation methods are evaluated with various stereo data sets: local aggregation with adaptive supporting-weight (AW) [14], non-local aggregation with enhanced ST (Segment-tree) [16].

  3. (c)

    Disparity optimization: WTA (Winner-Take-All) operation is adopted in all experiments. This method simply chooses the disparity for each pixel with the minimal aggregated cost.

  4. (d)

    Disparity refinement: Based on the result of (c), applying the merged super-pixel (MSP) refinement algorithm to enhance the performance. Two parameters require to be set in this method, the parameter \( k \) is set to 0.03 and \( W_{occ} \) is set to 0.4. The final disparity map can be obtained by only iterating the proposed algorithm twice.

The disparity maps of all four stereo pairs in the Middlebury data sets computed by local method [14] are presented in Fig. 5(a). And the disparity maps obtained by the proposed algorithm, and based on the resulting disparity maps in Fig. 5(a), with different iterations, are presented in Fig. 5(b)–(c). Obviously, Fig. 5(b)–(c) show that their results are more accurate than the result in Fig. 6(a). Thus, it proves that the proposed method (MSP) is available to enhance the performance of local methods. Similarly, the proposed method (MSP) is effective to improve the performance of non-local methods. Visual comparisons in Fig. 5 show that the proposed refinement method performs better within the low textured regions. For instance, the region near the hand of teddy bear (the third row of Fig. 5) is estimated inaccurate with cost computation method (the first step of stereo matching pipeline). Both the local and non-local cost aggregation methods cannot correct these errors, but the proposed method can obtain the accurate disparity values through optimizing the unreliable region with its supporting region. Moreover, the method is more accurate around depth boundaries, such as the boundaries of the newspaper in Venus data set (the second row of Fig. 5). Errors around depth boundaries are mostly due to noises and would cause inconsistency, the method corrects the errors by merging super-pixels to stable region and assign the disparity value for each pixel by considering the constraint on smoothness of stable region. More details are presented in Figs. 6 and 7. According to the comparisons of the disparity estimation within zoom-in regions in Figs. 6 and 7, MSP-2 performs completely better than local and non-local methods, with more accurate estimation both in low textured regions (shown in Fig. 6) and around depth boundaries (shown in Fig. 7).

Fig. 5.
figure 5

Experimental results using the Middlebury data sets [6]: Tsukuba, Venus, Teddy and Cones. (a) is the disparity map obtained by using the local cost aggregation algorithm [14]. (b)–(c) are the refined results of (a) by applying MSP-1 and MSP-2 refinement method proposed in Sect. 2, respectively. (d) is the disparity map obtained by employing the non-local cost aggregation [16]. And (e)–(f) are the refined results of (d) by applying MSP-1 and MSP-2 refinement method, respectively. The bold numbers under the images are the average errors (percentages of bad pixels) which show that the significant improvement of quantitative evaluation with local and non-local stereo matching method by employing the proposed refinement method. The corresponding quantitative evaluation is summarized in Table 1. Visual comparison of the disparity maps using the local or non-local cost aggregation method without MSP or not shows that the proposed refinement method performs better around depth boundaries. For instance, the disparity estimations around the boundaries of the newspaper (the second row) in (b)–(c) or (e)–(f) are more accurate than in (a) or (d). Moreover, note that the proposed refinement can also enhance the performance in low textured regions. For example, the disparity estimations within the low texture region near the hand of teddy bear (the third row) in (b)–(c) or (e)–(f) are more accurate than in (a) or (d).

Fig. 6.
figure 6

(a) The left image of Teddy stereo pair from Middlebury data sets [6]. (b) The zoom-in region of yellow box. (c) The result of the local cost aggregation [14]. (d) The refined result of (c) by employing MSP once. (e) The refined result of (c) by employing MSP twice. (f) The result of the non-local cost aggregation [16]. (g) The refined result of (f) by employing MSP once. (h) The refined result of (f) by employing MSP twice. Visible comparison of the results in low textured region, (d)–(e) are more accurate than (c) and (g)–(h) are more accurate than (f), shows that the proposed refinement method is significantly available to reduce the efforts of spurious disparity values estimated by local or non-local method. (Color figure online)

Fig. 7.
figure 7

(a) The left image of Venus stereo pair from Middlebury data sets [6]. (b) The zoom-in region of the yellow box. (c) The result of the local cost aggregation [14]. (d) The refined result of (c) by employing MSP once. (e) The refined result of (c) by employing MSP twice. (f) The zoom-in region of the red box. (g) The result of the non-local cost aggregation [16]. (h) The refined result of (g) by employing MSP once. (i) The refined result of (g) by employing MSP twice. Visible comparison of the results around depth boundaries, (d)–(e) are more accurate than (c) and (h)–(i) are more accurate than (g), shows that the proposed refinement method is significant available to improve the accuracy of the results estimated by local or non-local method. (Color figure online)

The running time of the algorithm is related to the number of iterations. By using a PC with CPU of PM 2.5G, the total time for processing the stereo pair of Tsukuba is about 2 s. Here, the number of iterations is 2, and the time for image segmentation is about 1 s. The comparisons between the proposed refinement method and local method [14] or non-local method [16] are shown in Table 1. The average error of local method is reduced by 0.38% (from 6.67% to 6.29%) through applying the proposed method. And the rank is increased by 18.3 (from 79.5 to 61.2). The average error of non-local method [16] is reduced by 0.61% (from 5.35% to 4.74%) through using the proposed method. And the rank is increased by 13.4 (from 37.7 to 24.3). It is clear to see the significant improvement of quantitative evaluation when we replace local and non-local stereo matching method with our novel refinement method.

Table 1. Quantitative evaluation of six methods (AW [14], AW [14] + MSP-1, AW [14] + MSP-2, ST [16], ST [16] + MSP-1, ST [16] + MSP-2) on the standard middle-bury benchmark [6] with error threshold 1. The percentages of the erroneous pixels in nonocc./all/disc.regions are used to evaluate the performance of the method. The disparity estimation using local [14] or non-local [16] method with MSP-1 is slightly more accurate than only using [14] or [16], while MSP-2 outperforms the other methods with the best overall accuracy.

7 Conclusion

This paper proposed a novel refinement algorithm for stereo matching, permits us to obtain the high-quality and high-density disparity map of a scene from its initial disparity estimation. Its novelty is reflected in the following two aspects: Novelty 1, presenting the method of merging super-pixels into stable region. Novelty 2, dealing with unreliable pixels by optimizing the unreliable region.

The advantage of this algorithm lies in that it is able to restrain and correct errors both in low textured regions and around depth boundaries, making us obtain the high-quality and high-density disparity map.

In the near future, we will focus on testing the algorithm with more challenging stereo data sets and various local or non-local cost aggregation methods.