1 Introduction

Image understanding and analysis is one of the important tasks in the image processing. The imaging process is influenced by various factors such as the object characteristics, the shooting environment and the conditions of the acquisition equipment. So the image processing process needs to consider many interference factors, such as shadows, discontinuity of color, and the variation of the target attitude. These interference factors have brought great challenges to the image processing algorithms, and make the existing image analysis algorithms have greater impact on the performance in complex environments. Therefore, how to improve the robustness of image analysis algorithms in complex environment has become a hot research topic in recent years.

The appearance of an image depends on many features, such as the illumination, the shape of the surface and the surface of each surface. Each of these features contains useful information about the objects in the image. If extracting these features from an image can eliminate the environmental impact effectively and make the image understanding more accurately. In 1978, Barrow and Tenenbaum [4] called these feature images as “intrinsic image”. The goal of intrinsic images decomposition is to separate an image into two layers, i.e., a reflectance image and a shading image, which multiply to form the original image. The reflectance image contains the intrinsic color, or albedo, of surface points independent of the illumination environment. The shading image consists of various lighting effects that include shadows and specular hightlights in addition to shading. The intrinsic images decomposition is expressed as

$$\begin{aligned} I(x) = R(x) \times S(x) \end{aligned}$$
(1)

where I(x) is the observed intensity at pixel x, R(x) is the reflectance, and S(x) is the shading. This decomposition of intrinsic images is of importance in both computer vision and computer graphics applications. First, the intrinsic decomposition facilitates advanced image editing in graphics applications such as re-texturing, re-color, and re-lighting. Second, the extracted intrinsic images benefit many computer vision algorithms. Shading images are preferred inputs to algorithms such as shape from shading, while reflectance images can be used for tasks such as segmentation and image white balance. Furthermore, most vision algorithms from low-level image analysis to high-level object recognition implicitly assume that its input image is a reflectance image.

One thousand years ago, humans began to solve the underconstrained problem of perceiving shape, reflectance, and illumination from a single image. Alhazen, a famous optical scientist, who noted that “Nothing of what is visible, apart from light and color, can be perceived by pure sensation, but only by discernment, inference, and recognition, in addition to sensation” [3]. When humans view a flat surface with patches of varying reflectance subject to spatially varying illumination, they are able to form a reasonably veridical percept of the reflectance in spite of the fact that a darker patch under brighter illumination may well have more light traveling from it the eyes compared to a lighter patch which is less well illuminated. Land and McCann’s [21] proposed the Retinex theory in 1971, which provided a computational approach to the problem in the “Mondrian World”. Retinex theory was later made practical by Horn [12], who prposed to obtain a decomposition of an image into its shading and reflectance components using the prior belief that sharp edges tend to be reflectance, and smooth variation tends to be shading. Since then, researchers have proposed various algorithms for the intrinsic image decomposition, but the problem described a challenge in computer vision which is still largely unsolved. Estimating two intrinsic components from a single input image is a fundamentally ill-posed problem: Given an input image that is composed from its reflectance and shading components, the number of unknowns is twice the number of equations. To solve this problem, further constraints are needed. Various approaches have been employed from intrinsic image decomposition.

This paper will proceed as follows. In Sect. 2, we will introduce the dataset about the intrinsic image. In Sect. 3, we will introduce intrinsic image decomposition algorithm from two aspects of reflection priors and shape priors. In Sect. 4, we analyze the performance and advantages and disadvantages of these algorithms on MIT intrinsic image dataset and IIW intrinsic image dataset. In Sect. 5 we conclude.

2 Intrinsic Image Dataset

Intrinsic image decomposition is a longstanding problem with many applications in computer vision and computer graphics. The goal of intrinsic images is to separate an image into a reflectance image and a shading image. The idea of intrinsic image decomposition was proposed in 1978 [4], however, due to the constraints of computing power, algorithm theory and lack of dataset, the research on the intrinsic image decomposition is slow. There has been significant recent progress on the problem of intrinsic image decomposition, aided by the release of the MIT Intrinsic Images dataset [10], which contains carefully constructed ground truth for images of objects. Bell et al. [5] proposed Intrinsic Images in the Wild, a large scale, public dataset for evaluating intrinsic image decompositions of indoor scenes. Laffont [18, 20] proposed the first synthetic dataset that depicts a scene with complex geometry, under multiple physically-based lighting conditions for each viewpoint, with ground truth reflectance and shading images. Jiang et al. [14] propose the BOLD (Birmingham Object Lighting Dataset) surface images, which contains 10 surfaces photographed under 33 lighting conditions.

2.1 MIT Intrinsic Image Dataset

In the MIT Intrinsic Image Dataset, Grosse et al. [10] focus on one particular case: they decomposed an image into three component, which is illumination, reflectance and specular. This decomposition is expressed as

$$\begin{aligned} I(x) = R(x) \times S(x) + C(x) \end{aligned}$$
(2)

where I(x) is the observed intensity at pixel x, R(x) is the reflectance, and S(x) is the shading, and C(x) is the specular.

Fig. 1.
figure 1

The MIT ground-truth intrinsic image. From left to right, original image, diffuse image, shading image, reflectance image, specular image.

The MIT Intrinsic Image Dataset contribution is a set of images of real objects decomposed into Lambertian shading, reflectance, and specularities, in Fig. 1. First, Grosse separates the diffuse and specular component, they use a cross polarization approach where a polarizing filter is placed over both the light and camera. Second, they have developed two different methods for separating the diffuse component into shading and reflectance. Third, they captured diffuse images with ten more light positions using a handheld lamp with a polarizing filter. The dataset contains 20 sets of images, each set of images was taken from a single object, including diffuse image, specular image, reflectance image, shading image and 10 lighting conditions images. The diffuse image corresponds to Lambertian surface, which can be further decomposed into shading image and reflectance image. The specular image accounts for light rays that reflect directly off the surface, creating visible highlights in the image. Quantitatively comparing algorithms requires choosing a meaningful error metric. Grosse define the LMSE (local mean squared error) instead of the MSE (mean squared error), because the MSE is too strict for most algorithms on the MIT Intrinsic Image Dataset. Grosse define the scale-invariant MSE for a true vector \(\hat{x}\) and the estimate \(\hat{x}\):

$$\begin{aligned} MSE(x,\hat{x}) = {\left\| {x - \hat{a}\hat{x}} \right\| ^2} \end{aligned}$$
(3)
$$\begin{aligned} a = \arg {\min _a}{\left\| {x - \hat{a}\hat{x}} \right\| ^2} \end{aligned}$$
(4)

given the true and estimated shading images S and \(\hat{S}\), define LMSE as the MSE summed over all local windows of size \(k\times k\) and spaced in steps of k / 2:

$$\begin{aligned} LMS{E_k}\left( {S,\hat{S}} \right) = \sum \nolimits _{\omega \in W} {MSE\left( {{S_\omega },{{\hat{S}}_\omega }} \right) } \end{aligned}$$
(5)
$$\begin{aligned} LMSE = \frac{1}{2}\frac{{LMS{E_k}\left( {S,\hat{S}} \right) }}{{LMS{E_k}\left( {S,0} \right) }} + \frac{1}{2}\frac{{LMS{E_k}\left( {R,\hat{R}} \right) }}{{LMS{E_k}\left( {R,0} \right) }} \end{aligned}$$
(6)

Barron [1,2,3] have created the MIT-Berkeley Intrinsic Image dataset, an augmented version of the MIT Intrinsic Image dataset. In the MIT-Berkeley dataset, they used photometric stereo on the additional images of each object to estimate the shape of each object and the spherical harmonic illumination for each image. For the MIT-Berkeley dataset and SIRFS algorithm, Barron and Malik [1] extended six different error metrics for each intrinsic scene property: Z-MAE is the shift-invariant absolute error between the estimated shape and the ground-truth shape. N-MAE is the mean error between our estimated normal field and ground-truth normal field, in radians. S-MSE and R-MSE are the scale-invariant mean-squared-error of our recovered shading and reflectance, respectively. RS-MSE is the error metric measures a locally scale-invariant error for both reflectance and shading. L-MSE is the scale-invariant MSE of a rendering of our recovered illumination on a sphere, relative to a rendering of the ground-truth illumination.

2.2 Intrinsic Image in the Wild (IIW) Dataset

Although the MIT dataset was the first public dataset with ground truth data, however, due to the limited scalability of the collection method, the MIT dataset contains only 20 different objects. The real-world scenes contain a rich range of shapes and materials, lit by complex illumination. Bell et al. [5] present a new, large-scale dataset of Intrinsic Images in the Wild (IIW)– real-world photos of indoor scenes, with crowdsourced annotations of reflectance comparisons between points in a scene. Instead of creating per-pixel annotations, they designed a scaleable approach to human annotation involving humans reasoning about the relative reflectance of pairs of pixels in each image. There are 4416 query points and 10645 query pairs between these point per image, over a total of 5230 images. Figuer 2, Each query to a human subject was in the form of “which point has a darker surface color?”, the answer have three points: Point 1, Point 2 and About the same. They also ask users to specify there confidence in their assessment as Guessing, Probably, or Definitely. In the AMT (Amazon Mechanical Turk), they obtained 4,880,372 responses from 1381 workers which aggregated to obtain 875,833 comparisons across 5,230 photos. The IIW dataset contains over 5000 images featuring a wide variety of scenes, and has been annotated with millions of individual reflectance comparisons, this makes the dataset several orders of magnitude larger than existing intrinsic image datasets.

Fig. 2.
figure 2

Human judegement for an example scene.

In order to use the judgement to evaluate intrinsic image decompositions, Bell et all propose a new metric, the “weighted human disagreement rate” (WHDR), which measures the percent of human judgement that an algorithm disagree with, weighted by the confidence of each judgement:

$$\begin{aligned} WHD{R_\delta }\left( {J,R} \right) = \frac{{\sum \nolimits _i {{\omega _i}1} \left( {{J_i} \ne {{\hat{J}}_{i,\delta }}\left( R \right) } \right) }}{{\sum \nolimits _i {{\omega _i}} }} \end{aligned}$$
(7)

where R is the algorithm output reflectance layer, \({{{\hat{J}}_{i,\delta }}}\) is the judgement predicted by the algorithm being evaluated, and \(\delta \) is the relative difference between two surface reflectance where people just begin to switch between saying “they are about the same” (E) to “one point is darker”. (1 or 2).

2.3 MPI Sintel Dataset

Butler et al. [6] present the MPI Sintel Dataset in Fig. 3. This is a set of complex computer-generated images that were found to have similar statistics to natural images. The MPI-Sintel Dataset was not intended for evaluation of intrinsic image algorithms, but some papers use it for lack of a readily apparent alternative that would reproduce many of the challenge of real-world scenes, such as complex object shapes, occlusion, and complex lighting, and would be accompanied by the requisite ground truth data. It consists of 890 images from 18 scenes with 50 frames each. The dataset contains final images, clean images, albedo images and depth images. They used rendering all the scenes with uniform grey albedo on all object to created the ground-truth shading images. Chen and Koltun [7] and Narihira et al. [23] used the clean images as input, which is infinite depth of field, no motion blur, and no atmospheric effects like fog compared to final images. Fan et al. [8] regenerated the clean images with multiplication of albedo images and shading images as ground-truth.

Fig. 3.
figure 3

The MPI Sintel Dataset.

3 Existing Algorithm

In complex environments, one of the key technologies of robust image processing systems is how to extract the intrinsic features from target objects. Barrow and Tenenbaum [4] defined what is intrinsic images decomposition, Land and McCann’s [21] Retinex theory give a solution to the intrinsic images decomposition. Since then, many researchers have been involved in how to solve the problem. Estimating two intrinsic components from a single input image is a fundamentally ill-posed problem: Given an input image that is composed from its reflectance and shading components, the number of unknowns is twice the number of equations. To solve this problem, we can use the prior knowledge and constraint condition of the target object to estimate the reflectance image and shading image respectively. We can also learn the multi-modal patterns of target objects by pattern recognition algorithm, and then classify the learning results to estimate the intrinsic image.

3.1 Image Sequence and Multiple Views

Weiss [30], who focus on a slightly easier problem: given a sequence of images where the reflectance is constant and the illumination changes, can we recover shading images and a single reflectance image. Following recent work on the statistics of natural images, he use a prior that assumes that illumination images will give rise to sparse filter outputs, and this leads to a simple, novel algorithm for recovering reflectance images. Laffont et al. [19] proposed a method to estimate intrinsic images from multiple views of an outdoor scene, in addition to reflectance, the method also generates a separate image for the sun, sky and indirect illumination.

3.2 Priors and Constraints Based on Reflectance

The reflectance, represents how the material of the object reflects light independent of viewpoint and illumination. It related to the material, color, texture, etc. of the objects. Based on the assumption of hue illumination invariant, Pan et al. [24] and Shi et al. [28] proposed the hue can also used for separating illumination information and reflection features. But, the illumination invariance of hue is established only under the condition of weak illumination variation. Under strong illumination variation, the hue is still affected by the change of illumination. For the hue instability, Tappen et al. [29] obtain a classifier for hue variation by training and learning, and realizes the separation of illumination variation and reflectance features. Finalyson et al. [9] used hue calibrated camera to capture images, re-searched a new hue space with illumination invariance. Shen et al. [26], Serra et al. [25] and Kang et al. [15] used the local continuity constraint of the hue of object’s surface to estimate the reflection characteristics of objects. These algorithms mainly use the object reflection characteristics of prior and constraint to estimated the reflectance image. These priors and constraints apply to a single object, in the complex scene, the target object occlusion, shadow, the variation of the target attitude will have an impact on the algorithm. In recent years, deep learning theory has been widely used in the field of signal processing and has achieved remarkable results. With the publication of IIW (Intrinsic Images in the Wild) dataset, the dataset has a large amount of ground truth data makes the deep convolutional neural networks can be applied to the decomposition of intrinsic images. Narihira narihira2015learning used complex deep features with a simple local classification rule for lightness prediction in natural image. Zhou et al. [32] proposed a data-driven approach for intrinsic image decomposition by training a model to predict relative reflectance ordering between image patches from large-scale human annotations. Zoran et al. [33] proposed a framework that infers mid-level visual properties of an image by learning about ordinal relationships, instead of estimating metric quantities directly. For human, even the observation of gray image can also distinguish the illumination changes in the image [17]. In fact, in addition to hue features, texture also has a certain illumination invariance. Shen and Yeo [27] uses the local continuity of the texture to separate the original image into shading image and reflectance image. In this algorithm, the texture of the pixels in the class tends to be consistent by gradually adjusting the same pixels, to achieve the purpose of separating the reflectance features. But, this algorithm of computational complexity is high.

3.3 Priors and Constraints Based on Shape and Illumination

The Retinex theory assumes that illumination is a continuous variable, and that the strong gradients change in the image are caused by reflectance characteristics. Therefore, the illumination information and reflectance characteristics can be classified by the image gradient variation as the threshold. However, the assumption that this illumination change is low frequency information is very rough, when there is occlusion or object deformation, the illumination will be mutated, which will produce high frequency illumination changes. Therefore, we need to take advantage of object shape characteristics as further constraints. Based on the Retinex theory, the shape priors and constraints are added as an additional constraint to decomposed intrinsic images. Barron [1, 3] present three priors on shape: (1) a crude prior on flatness, to address the bas-relief ambiguity, (2) a prior on the orientation of the surface normal near the occluding contour, and (3) a prior on smoothness in world coordinates, based on the variation of mean curvature. Barron and Malik [3] uses those three priors on shape and two priors on reflectance in the given a known illumination to decompose intrinsic image. This lead to impressive results on MIT Intrinsic image dataset, but the method is limited to single masked objects in a scene, and problems with complex illumination remain. So, many images couldn’t satisfy the requirements of this method. In recent years, with the popularity of the depth sensor, the depth information is also used as the shape priors to extract the intrinsic features. Barron and Malik [2], Chen and Koltun [7], Lee et al. [22], and Jeon et al. [13], in their papers using depth information as priors to extract intrinsic images respectively. Due to the kinect device constraints, the depth information in the natural scene cann’t be obtained, which limits the application of the algorithm based on the depth information priors and constraint. Barron and Malik [2] estimated depth maps by a fully convolutional network then jointly optimizes the intrinsic factorization to recover the input image. Kim et al. [16] presented a method for jointly predicting a depth map and intrinsic images from single-image input. The model is called as JCNF (joint convolutional neural field), which jointly uses conditional random field (CRF) and convolutional neural network (CNN). The model architecture differs from previous CNNs in several ways. One is the sharing of convolutional activations and layers between networks for each task, which allows each network to account for inferences made in other networks. Another is to perform learning in the gradient domain, where there exist stronger correlations between depth and intrinsic images than in the image value domain.

4 Experiments and Result

4.1 Experiments on MIT Dataset

Quantitatively evaluating the accuracy of the intrinsic image decomposition algroithms is challenging. Thankfully, the MIT Intrinsic Image dataset provides ground-truth shading image and reflectance image for 20 objects (one object per image), and includes many additional images of each object under different illumination conditions. We selected several representative algorithms for experimental comparison. Some algorithms need to use depth maps as input, such as Barron’s Scene-SIRFS [2] and Chen’s algorithm [7], however, the MIT dataset does not contains depth maps. Barron [1, 3] have created the MIT-Berkeley Intrinsic Images dataset, an augmented version of the MIT Intrinsic Images dataset, in which they used photometric stereo on the additional images of each object estimate the shapes of each object and the spherical harmonic illumination for each image. The MIT dataset and MIT-Berkeley dataset have the same objects and image category, includes each object’s original image, diffuse image, shading image, reflectance image, specular image, and different illumination conditions image. In additional, the MIT-Berkeley contains each object’s depth maps (Table 1).

Table 1. The experiment results in MIT dataset.

The other difficulty in evaluation is choosing a meaningful error metric. The mean squared error (MSE) is too strict for most algorithms on the MIT dataset. Incorrectly classifying a single edge can often ruin the relative shading and reflectance for different parts of the image, and this often dominates the error scores. For this problem, we uses the local mean squared error (LMSE) instead of the MSE. Barron and Malik [1] present six different error metrics that have been designed to capture differnt kinds of important errors for each intrinsic scene property: Z-MAE, N-MAE, S-MSE, R-MSE, RS-MSE, L-MSE. These error metrics are only applicable to the Barron algorithms, we don’t use these error metrics.

4.2 Experiments on IIW Dataset

In this section, we evaluate the performance of some algorithms [4, 5, 8, 23, 31] using the IIW dataset. These algorithms are based on depth convolution neural network. The IIW dataset provides 875,833 comparisons across 5,230 images, in each image, there are 4416 query points and 106 45 query pairs. This large set of pairwise comparisons has been used to benchmark several reflectance model. The weighted human disagreement rat (WHDR) [5] is proposed to measure the precent of human judegements that an algorithm disagree with (Table 2).

Table 2. The experiment results in IIW dataset.

5 Conclusions

In the paper, we introduce the intrinsic image datasets and decomposition algorithms. Existing traditional algorithms are based on the Retinex theory. Those traditional algorithms using the feature of object as further priors to extract shading image and reflectance image from a single image. Such as: priors on shape, priors on hue, priors on texture. Simple use the local features of the object as priors, although achieved good results, but the application of algorithm has been limited. For example, Barron in his SIRFS algorithm, he used priors on shape and priors on albedo to recover shape, reflectance and a spherical harmonic model of illumination. SIRFS algorithm is severely limited by its assumption that input images are segmented images of single objects, illuminated under a single global model of illumination. Natural images, in contrast, contain many shapes which may occlude or support one another, as well as complicated, spatially-varying illumination in the form of shadows, attenuation, and inter-reflection. SIRFS algorithm can’t handle these images. Although Barron improves the SIRFS algorithm, the depth map is used as an additional prior for processing natural scene images. Because of the need to add additional input, the improved algorithm is not convenient to use. Traditional algorithms which based on Retinex theory have various limitations and disadvantages. In recent years, deep learning theory has been widely used in the field of single processing and has achieved remarkable results. Some researchers apply the deep learning theory in the intrinsic image decomposition and has achieved remarkable results. The neural network is obtained by deep learning, and the feature is extracted from the image instead of the traditional fixed feature.