1 Introduction

In real-world applications image quality may vary drastically depending on factors such as the capture sensor used and lighting conditions. These scenarios need to be taken into consideration when performing image classification, since quality shift directly influence its results.

Lately, deep convolutional neural networks have obtained outstanding results in image classification problems. Nonetheless, little has been done to understand the impacts of image quality on the classification results of such networks. In most studies, networks were only tested on images whose quality is similar to the training set (i.e. similar noise/blur levels). The lack of research in this topic is not exclusive to deep learning applications. Most image classification systems neglect preprocessing [12] and assume that image quality does not vary [5].

Given that in real-world applications image quality may vary, we evaluate classification performance of classic deep convolutional neural networks when dealing with different types and levels of noise. In addition, we investigate if denoising methods can help mitigate this problem.

1.1 Related Work

We devote our efforts to investigate the effects of noisy images when using deep convolutional neural networks in classification tasks. There are papers investigating the effect of label noise in the learning capability and performance of convolutional neural networks [2]. However, this problem is not addressed in this paper. Thus, in our experiments, we assume that all labels are correct.

Some studies have already identified that image quality can hinder classification performance in systems that employ neural networks [5] and in systems that use hand-crafted features [4, 9]. Recently, the development of noise-robust neural networks has been investigated. For instance, [6] presented a network architecture that can cope with some types of nosy images, while [13] designed a network that is capable of dealing with noise in speech recognition.

Dodge and Karam [5] showed that state-of-the-art deep neural networks are affected when classifying images with lower quality. In their experiments, each network was trained on images from the original dataset (with a negligible amount of noise due to the image formation process) and, then, used to classify images from the same dataset on their original state, degraded by noise and affected by blur. Their results show that classification performance is hampered when classifying images with lower quality. However, their experiments do not cover the presence of low-quality images in the training set and their impact in the learned model.

Paranhos da Costa et al. [4] extended the methodology of [5] by considering that low-quality images can also appear in the training set. In their setup, several noisy versions of a dataset are created: each version has the same images as the original dataset, wherein all images are affected by a type of noise at a fixed level. They also evaluated the effects of denoising techniques by studying restoration of noisy images. Hand-crafted features (LBP and HOG) were extracted, SVM classifiers trained with each version of the training set and, then, used to classify all versions of the test set. Even so, their study only considered two hand-crafted features (LBP and HOG).

We believe that noise makes classification more difficult due to the fact that models trained with a particular noisy/restored training set version – and tested on images with the same noise configuration – usually perform worse than a model trained and tested on the original data. Our empirical evaluation is based on [4], however there are two main differences. First, our experiments target deep neural networks, with the ability to learn from data, even from low quality data, while the previous study considers hand-crafted features and SVM classifiers. Second, we investigate if training models with a specific noise or image restoration setup can help to build models that are more resilient to changes in image quality for future data, i.e., in the test set.

2 Experiments

2.1 Experimental Setup

The first step in the experimental setup used in this study is to create noisy and restored versions of each one of the three publicly-available datasets selected for our experiments: MNIST, CIFAR-10 and SVHN (further information on these datasets is presented in Sect. 2.2). To do that, five copies of the original dataset are degraded by a Gaussian noise with standard deviation \(\sigma = \{10, 20, 30, 40, 50\}\), and another five copies were hindered by a salt & pepper noise using \(p = \{0.1, 0.2, 0.3, 0.4, 0.5\}\), where p is the probability of a pixel being affected by the noise. Next, denoising methods are applied on each of these versions, generating 10 restored versions of each dataset, one for each noisy version.

The restored versions of the datasets affected by Gaussian noise are obtained by filtering the images with the Non-Local Means (NLM) algorithm [3]. To perform the NLM denoising, we used a \(7 \times 7\) patch, a \(11\times 11\) window and a set the parameter h equal to the standard deviation of the Gaussian noise used to corrupt the dataset being restored. Regarding the salt & pepper noise, all restored versions were generated by filtering the noise images using a \(3 \times 3\) median filter. Hence, we have 21 different versions of each dataset (the original dataset, 10 noisy and 10 restored versions). Since all versions contain the same images, we always use the same training-test split presented in the original dataset paper.

The second step is to learn a classifier for each training set version. This means that a single network architecture was trained with each dataset version, creating 21 different classifiers. Then, each classifier was tested on all versions of the test set. This process is illustrated by the diagram shown in Fig. 1. For the MNIST dataset, we selected an architecture similar to the LeNet-5 [10], while for the CIFAR-10 and the SVHN datasets, an architecture similar to the base model C of [15] was used. These architectures were implemented using the Keras library and our implementation was based on the code available on [1]. A convolutional neural network architecture was selected for each dataset.

Fig. 1.
figure 1

Experimental setup diagram. In our experiments a different model is trained for each noise configuration, then these models are used to classify all versions of the test set. This figure was based on Fig. 2 of [4].

In the third part of our experimental setup, we compare the learned models. We begin our analysis by comparing classification accuracies when both the training and test set have the same type and level of noise. By doing so, we want to measure how much harder classifying these datasets gets – for that network architecture – once the noise occurs at a particular level in the entire dataset (training and test set).

Additionally, we compare the results on the noisy versions with their restored counterparts, which allows us to measure how much the use of denoising techniques can help to improve accuracy. After, we visualize the classification results of all trained models in all versions of the test set using heatmaps. Such visualization shows how performance varies for each model. Lastly, we compute the mean (and standard deviation) of the accuracies obtained by each classifier over all test set versions. Using these values we can quantize the overall performance difference among models and compare models with regards to their resilience to different types of noisy images. This setup is illustrated by the diagram shown in Fig. 1.

2.2 Datasets

 

MNIST: :

handwritten digits [10] broadly used in deep learning experiments due to being real-world data that requires minimal pre-processing/formatting.

CIFAR-10: :

consisting of 60,000 color \(32 \times 32\) images equally split into 10 classes [8]. This dataset is subdivided into training and test sets, which include 50,000 and 10,000 images, respectively.

SVHN: :

house numbers from Google Street View images [11], defines a real-world problem of recognizing digits in natural images. It is composed by 73,257 images in the training set and 26,032 images in the test set.

 

3 Results and Discussion

To contrast the impact of noise and denoising methods in image quality, the average Peak Signal-to-Noise Ratio (PSNR) values for each dataset version are shown in Table 1. By comparing the results when training and testing is performed in the same dataset version it is possible to analyse how noise affects classification, in particular if it makes the task more difficult by changing the parameter space learned by the network. These results are shown in Table 2, in which it is possible to notice that, in general, the presence of noise, even when restored using a denoising algorithm, increases the complexity of the classification task.

Table 1. Average PSNR for each noise level.
Table 2. Accuracy of each network when training and testing was conducted using the same dataset version.

To better understand the effects of using denoising methods we plot some of the results presented in Table 2 in Fig. 3, showing the accuracies of the models trained with images affected by different levels of Gaussian noise as well salt & pepper noise, and their restored counterparts. Neither of these two figures present the results for the MNIST dataset, because, as can be seen in Table 2, the differences for this dataset were too small.

Despite the increase in PSNR when employing NLM for denoising, accuracy decreased when classifying data restored by this method. This is probably due to that fact such denoising procedure generate blurry images, removing relevant information, as can be seen in Fig. 2.

Fig. 2.
figure 2

Examples of noisy images for each dataset. The first row show the original images. The second row depict images with Gaussian noise (\(\sigma = 30\)) and their restored versions. Finally, the third row has images affected by salt & pepper noise (\(p = 0.3\)) and denoised by a median filter.

Fig. 3.
figure 3

Comparison of the accuracy of each network for different noise parameters (a) Gaussian noise standard deviation \(\sigma \) (b) Salt & Pepper probability p, with and without the use of denoising algorithms (DNo) for restoration.

Fig. 4.
figure 4

Heatmaps representing the results obtained on MNIST (left), CIFAR-10 (center) and SVHN (right).

Next, results comparing models when classifying all dataset versions are presented using the heatmaps of Fig. 4. Each row in a heatmap represent a version of the training set, while each column displays the results for a version of the test set. As demonstrated in [4], models tend to achieve their best accuracy when classifying data that has the same quality as the data used to train them. Nevertheless, depending on the training set, different generalization capability is achieved. This is demonstrated by models whose results are similar to their best, even when classifying data affected by other types of noise. To compare the noise resilience of these models, Table 3 show the mean and standard deviation accuracies obtained by each classifier over all test set versions (the mean and standard deviation of each row of each heatmap).

Table 3. Average accuracy and standard deviation (in percentages) for each model in all test set versions.

In this comparison, the network trained with the original dataset is used as a baseline scenario, given that this network has no previous knowledge of any type of noise, while the others have already seen noisy images in some level. Therefore, networks trained with noisy images have an advantage when dealing with noise in future data even when it occurs at a different level. For the MNIST dataset, models trained on the original data obtained an average accuracy of \(93.13\%\) and a standard deviation of \(9.68\%\), while the best overall result of \(97.27 \pm 1.74\%\) was obtained by the model trained using images affected by salt & pepper noise (with \(p = 0.3\)). On the CIFAR-10 dataset, the best average results (\(57.83 \pm 13.71\%\)) were obtained by the model trained with data corrupted by salt & pepper with \(p = 0.4\) and restored using the median filter. The model trained using the original data obtained \(43.50 \pm 21.84\% \). Lastly, in the SVHN dataset, the model trained on the original dataset obtained a overall \(68.21 \pm 24.64\%\) accuracy, against \(83.41 \pm 11.47\%\) obtained by the model trained with images affected by salt & pepper noise (\(p = 0.2\)).

Nevertheless, it is possible to notice that some models are better at generalising to other types of noise. For instance, in the MNIST dataset, most models trained with salt & pepper noise obtained were able to achieve results around 0.6 or higher, while the other model did not.

To facilitate reproducibility of our experiments, our code is publicly available at http://github.com/tiagosn/dnnnoise2017.

4 Conclusions

We analysed the behaviour of deep convolutional neural networks when dealing with different types of image quality. Our study covered images affected by s&p and Gaussian noise and their restored versions. Although noise injection in the training data is a common practice, our systematic methodology provide a better understanding of the behaviour of the models under noise conditions. The results indicate that training networks using data affected by some types of noise could be beneficial for applications that need to deal with images with varying quality, given that it seems to improve the resilience of the network to other types of noise and noise levels.

Concerning denoising methods, images restored with the median filter, when compared against images with s&p noise, were able to improve the accuracy in data with the same quality. Nevertheless, models trained with s&p noise usually obtained a better noise resilience. Restoring images with NLM resulted, for the most part, in a decrease in performance. This was probably due to the removal of relevant information caused by NLM smoothing. Hence, better results might be achieved with a different parameter choice.

As future work we intend to explore deeper models such as VGG [14] and ResNets [7]. These experiments should also include neural networks designed to be robust to noise such as in [6]. Moreover, we aim at conducting experiments in larger datasets like ImageNet.