Keywords

1 Introduction

In the last two decades, a huge amount of multimedia data has been stored and made available through the Internet, which has attracted attention from research community to multimedia processing and analysis and, more specifically, to computer vision. Recent research results have pointed out that scale-awareness seems to be helpful in improving final results in many computer vision tasks [6, 11,12,13]. Even though Deep Convolutional Neural Networks (DCNNs) have improved the performance of computer vision systems, they still face some challenges including the existence of objects at multiple scales [5].

The adoption of a hierarchical approach which incorporates information from multiple scales is an alternative to deal with this issue. And, since segmentation is one of the first step involved in almost every computer vision task, the use of a hierarchical segmentation method has helped improving results for different tasks [3, 15, 16]. Specifically in image context, a hierarchical image segmentation is a set of image segmentations at different detail levels in which the segmentations at coarser detail levels can be produced from simple merges of regions from segmentations at finer detail levels [10].

Hierarchical segmentation methods [2, 10] have been successfully used. These methods create a hierarchy of partitions that can be represented as a tree (Fig. 1). Moreover, the final results can be represented as an Ultrametric Contour Map (UCM) [1], which allows to obtain a particular segmentation (at a given observation scale) through a simple thresholding (Fig. 2). The hierarchies are typically computed by an unsupervised process that is susceptible to under-segmentation at coarse levels and over-segmentation at fine levels. Thus, objects (or even parts of the same object) may appear at different scales due to their size or to their distance from the camera. To cope with that one may explore the use of non-horizontal cuts [8, 9]. Another possible solution is to flatten the hierarchy into a single non-trivial (or non-horizontal) segmentation, such as in [17].

Fig. 1.
figure 1

Example of obtained result by a hierarchical image segmentation method.

In [7], the authors proposed to modify the final result of a hierarchical algorithm by improving its alignment, i.e., by trying to modify the depth of the regions in the tree to better couple depth and scale and, therefore, putting (almost) all objects (and their parts) at the same level (or scale). To do that, they first train a regressor to predict the scale of regions using mid-level features. Then, they create a set of regions that better balance between over and under-segmentation, named anchor slice. Finally, the original hierarchy is realigned using the anchor slice, i.e., adjusting the hierarchy such that every region in the anchor slice is at the same level (or scale) – see Fig. 3.

Fig. 2.
figure 2

Example of obtained result by a hierarchical image segmentation method: (left) original image; and (right) Ultrametric Contour Map (UCM).

In this work, we explore the use of regression to predict the best scale value for given region, which is then used to realign the entire hierarchy. In our assessment, we used two different hierarchical image segmentation methods: gPb-owt-ucm [2] and hGB [10]. The main contributions of this work are: (i) impact analysis of a learning strategy on prediction of scale values for distinct hierarchical methods; and (ii) evaluation of the use of different combination of mid-level features to describe regions.

The paper is organized as follows. Section 2 describes the hierarchical segmentation methods used. In Sect. 3, we present the realignment approach for hierarchies. Section 4 presents experimental results. Finally, we draw some conclusions in Sect. 5.

Fig. 3.
figure 3

Example from [7] showing the realignment of a hierarchical image segmentation: (a) original hierarchy; and (b) realigned hierarchy.

2 Hierarchical Image Segmentation

There is a rich literature of hierarchical image segmentation. But here, in the following, we only describe the hierarchical methods used in this work: gPb-owt-ucm [2] and hGB [10].

Method gPb-owt-ucm. A widely-used state-of-the-art hierarchical segmentation method proposed in [2]. Discriminative features are learned for local boundary detection and spectral clustering is applied to it for boundary globalization. Afterwards, a hierarchical segmentation is built by exploring the information on the contour signal. The authors of [2] proposed a variant of the watershed transform, named Oriented Watershed Transform (OWT), for producing a set of initial regions from contour detection output. Then, an UCM is generated from the boundaries of these initial regions.

Method hGB. According to [10], a hierarchical segmentation should be able to maintain spatial and neighborhood information between segments, even when changing the scale. Thanks to that, one can compute the hierarchical observation scales for any graph, in which the adjacent graph regions are evaluated depending on the order of their merging in the fusion tree. The core of hGB [10] is the identification of the smallest scale value that can be used to merge the largest region to another one while guaranteeing that the internal differences of these merged regions are greater than the value calculated for smaller scales.

Starting with simple regions representing single image pixels, hGB is able to produce a hierarchy of partitions for the entire image. It has been successfully applied not only to image segmentation [10], but also to several other tasks, such as: video segmentation [16]; video summarization [3]; and video cosegmentation [15].

Fig. 4.
figure 4

Realign approach.

3 Realignment Approach

The realignment approach used in this work is illustrated in Fig. 4. First, a set of training images (Fig. 4a) is used to produce the corresponding set of hierarchies (Fig. 4b) using gPb-owt-ucm [2] or hGB [10]. Then, all regions belonging to training hierarchies are described with a set of features (see Sect. 3.1) and have their \(S_i\) scores (Eq. 1) calculated (Fig. 4c). These data is used to train a regression method (Fig. 4d). During testing, for each test image (Fig. 4e) the corresponding hierarchy is produced (Fig. 4f); and features are computed for every region (Fig. 4g). These features are used with the trained regressor to predict the best scale value (using \(S_i\) scores) for each region (Fig. 4h). Finally, the predicted scores/scales are used to realign (see Sect. 3.3) the original hierarchy (Fig. 4i), which is used to produce a final segmentation (Fig. 4j).

3.1 Features Extraction

The mid-level features were extracted from all regions of each hierarchy. Similar to [7], the chosen features were the following:

  • Graph partition properties: cut, ratio cut, normalized cut, unbalanced normalized cut;

  • Region properties: area, perimeter, bounding box size, major and minor axis lengths of the equivalent ellipse, eccentricity, orientation, convex area, Euler number;

  • Gestalt properties: inter- and intra-region texton similarity, inter- and intra-region brightness similarity, inter- and intra-region contour energy, curvilinear continuity, convexity.

We have also explored features to encode color properties, such as color mean, and color histogram. Color-related features are calculated for each channel (in RGB color space) and histograms are generated with 04 bins per channel (in RGB color space). More details about these features could be found in [4].

3.2 Training and Predicting

A regression forest (with 100 trees) was trained using all regions \(R_i\) from each hierarchy. For each region \(R_i\), its corresponding ground-truth region \(G_i\) is identified and used to calculate \(S_i\) score by Eq. 1.

$$\begin{aligned} S_i = \frac{|\,G_i\,|-|\,R_i\,|}{\max \left( |\,G_i\,|,|\,R_i\,|\right) } \end{aligned}$$
(1)

in which \(|\,R_i\,|\) and \(|\,G_i\,|\) represent the size of region \(R_i\) and of its corresponding ground-truth region \(G_i\), respectively. Similar to [7], the most-overlapping human-annotated segment is taken as the corresponding ground-truth. When \(S_i\) score is a negative value, it indicates that the region \(R_i\) is under-segmented, while a positive value stands for over-segmented (and, 0 for properly segmented). In order to describe the region properties, all features described before were extracted. In this step, regions whose area is less than 50 pixels were excluded.

After training the regression approach, they are used to make prediction. For that, the same set of features were extracted from all regions belonging to each hierarchy generated for the test subset and used to predict the best scale value.

3.3 Alignment of Hierarchical Image Segmentation

Following [7], each node belonging to a hierarchy should be labeled as: −1, 0, and +1 indicating under-, properly- and over-segmented, respectively. This could be done solving a problem that penalizes two cases: (i) segments in the group of under-segmented with positive scores; and (ii) segments in the group of over-segmented with negative scores. The resulting problem can be solved via dynamic programming. The anchor slice consists of regions labeled as 0.

After that, a local linear transform (the same used in [7]) is performed on the UCM corresponding to the hierarchy, and the anchor slice is aligned to scale value of 0.5 (for the convenience and later use).

4 Experimental Results

In order to evaluate the realignment of hierarchies generated by gPb-owt-ucm [2] and hGB [10], we use the BSDS500 dataset [2], which includes 500 images (200 for training, 100 for validation, and 200 for testing). As segmentation evaluation measures, we adopted: (i) Segmentation Covering (SC); (ii) Probabilistic Rand Index (PRI); (iii) Variation of Information (VI); and (iv) F-measure for boundary (\(F_b\)); all four computed at Optimal Dataset Scale (ODS) and Optimal Image Scale (OIS) – see [14] for a review of these measures and scales. Note that for all measures a large value is better, except for VI.

Average results obtained for regression made with random forest are shown in Table 1. In that table, ‘c’ stands form color based features (such as color mean and color histogram), ‘s’ is used to represent region shape features (such as area, perimeters, etc.), ‘gr’ stands for graph features (such as cut, ratio cut, etc.), and ‘ge’ is used to represent gestalt features (such as texton similarities, brightness similarities, etc.). For gPb-owt-ucm, the realignment exhibits a improvement in average VI score when a set of color, graph and shape features is used, while for hGB, there is no difference in any metric (on average). But a closer look in some specific final results may help us understanding better those results.

Table 1. Average results for regression with random forest.
Fig. 5.
figure 5

Examples of segmentations results before and after the realignments by using gPb-owt-ucm for computing the hierarchy.

In Figs. 5 and 6, we illustrate some examples in which the realignment of original hierarchies produces quite interesting results when the scale is set to 0.5 (which corresponds to the anchor slice). For gPb-owt-ucm, in Fig. 5(a) the results are showed without the realignment, while Fig. 5(b) illustrates the realigned results obtained. One can easily see the improvements related to SC, PRI, and VI measures. But this has some negative impact on \(F_b\) scores obtained by gPb-owt-ucm, specially for second example – see Fig. 5(b).

Similarly, for hGB, in Fig. 6(a) the results are showed without the realignment, while Fig. 6(b) illustrates the realigned results obtained. Again, it is easy to verify the improvements related to SC, PRI, and VI measures. But the main difference is that \(F_b\) scores obtained by hGB are not affected in this case. That could explain the decrease in average \(F_b\) scores in the realigned results for gPb-owt-ucm shown at Table 1.

Fig. 6.
figure 6

Examples of segmentations results before and after the realignments by using hGB for computing the hierarchy.

5 Conclusion

In this work, we explored the use of regression to predict the best scale value for given region, which is then used to realign the entire hierarchy.

Experimental results are presented for two different segmentation methods; along with an analysis of the adoption of different combination of mid-level features to describe regions.

For gPb-owt-ucm, the realignment exhibits a improvement in average VI score when a set of color, graph and shape features is used, while for hGB, there is no difference in any metric (on average). But a closer look in some specific final results seems to indicate that the realignment of hierarchies generated by gPb-owt-ucm has some negative impact on \(F_b\) scores, while this is not observed for hierarchies produced by hGB.

In order to improve and better understand our results, further works involve training of different regression approaches and adoption of other segmentation methods; and also the application of our proposal to another datasets.