Keywords

1 Introduction

Since several years Artificial Neural Networks (ANN) are used for classification of objects in images [1]. The networks are trained in a comprehensible learning phase generally with a huge set of training images. The networks thus created can classify objects in images in real-time. One problem for the neural networks is, that the object can occur at different locations inside the image. Even if the images have the same size, the objects have different scale, orientation and they are slightly distorted. Since the invention of Convolutional Neural Networks(CNN) by LeCun et al. [2] most systems for object recognition in images use CNN now. On of the advantages of CNN is invariance to translations and transformations to a certain scale. By the help of visualization it was possible to explain, how CNN work [3, 4]. The problem is, the more layers you have, the more complex the software architecture grows. The computing time is growing with the number of neurons.

Usually the object recognition takes place in the spatial domain. Image information can also be presented in the frequency domain by using the Fourier transform. By this mathematical operation translation invariance can be achieved. The Fourier transform is used in signal processing and communications technology for the transmission and compression of image data since several years.

Aim of this work is to reduce expenditures for computing of artificial neural networks by applying methods borrowed from signal processing.

2 Related Work

Translation invariance can be reached by Wavelet Scattering Networks as proposed by Bruna and Mallat [5]. Also Rippel et al. propose wavelets for the use in ANN because they work in spectral and spatial domain at the same time [6]. The approaches by wavelets don’t rely purely on the frequency domain, they still also use the spatial domain.

Due to the property, that a convolution in space is equivalent to a multiplication in frequency domain [7], Fast Fourier Transform (FFT) is already used in CNN [8]. That differs to our research because the transformation is done in the images and the filters at the same time, only to substitute multiplication. The detection is still done in the spatial domain.

In the Frequency Sensitive Hash Nets (Freshnets) proposed by Chen et al. [9] the Fourier transformation is used to compress Convolutional Neural Networks. Higher frequencies, which are generally less important, get less weights then lower frequencies, which are more important.

3 Method and Architecture

Images can be transformed into the frequency domain by the Fourier transform.

$$\begin{aligned} F(y) = \int _{\mathbb {R}^{n}} f(x) e^{-iy\cdot x} dx. \end{aligned}$$
(1)

With the reverse Fourier transform it is possible to reconstruct the image.

$$\begin{aligned} f(x) = \frac{1}{{2 \pi }^{n} }\int _{\mathbb {R}^{n}} F(y) e^{iy\cdot x} dy. \end{aligned}$$
(2)

This is important for our research because it shows, that the full information of the image is preserved in the Fourier transform. With help of the Euler Identity \( e^{ix} = \cos (x) + i \sin (x) \) one can see that the Fourier transform is based on the periodical functions \(\sin \) and \(\cos \). This is the reason for invariance to translations.

There are discrete variants of the Fourier-Transform, e.g. the Discrete Cosinus Transform (DCT), which we use in our experiments.

$$\begin{aligned} X_{ k } = \sum _{n=0 }^{ N-1 }{ x_{n} \cos {\left[ \frac{\pi }{N}\left( n+\frac{1}{2} \right) k \right] }} ~~~~~~~ k = 0,...,N-1. \end{aligned}$$
(3)

For 2D images the DCT can be calculated in the horizontal and vertical direction of the image.

$$\begin{aligned} X_{ k_{1},k_{2} } = \sum _{n_{1}=0 }^{ N_{1}-1 } \sum _{n_{2}=0 }^{ N_{2}-1 } { x_{n_{1},n_{2}} \cos {\left[ \frac{\pi }{N_{1}}\left( n_{1}+\frac{1}{2} \right) k_{1} \right] } \cos {\left[ \frac{\pi }{N_{2}}\left( n_{2}+\frac{1}{2} \right) k_{2} \right] }}. \end{aligned}$$
(4)
Fig. 1.
figure 1

Image of one handwritten digit 7 of the MNIST dataset [2] and its DCT calculated by Eq. (4).

The DCTs 2D images present horizontal frequencies in the direction from left to right, vertical frequencies from top to bottom. In the upper left corner are the low frequencies, in the lower right corner are the high frequencies. The DCT in Fig. 1 has been calculated by Eq. (4). The plain DCTs of the images have negative and positive values. For printing reasons the image values where normalized, so DCT minimum value is black, DCT maximum value is white and DCT-Value 0 becomes gray as you can see in Fig. 1.

For classification only the horizontal and the vertical position in the DCT and the absolute value of the coefficientes are important. Strong positive values and negative values in the DCT both show activity at one frequency in the DCT frequency domain. In the learning phase of neural networks positive and negative values for one class at the same frequency disturb the learning phase. The weights of the neuron are strengthened by the positive and weakened by the negative input. This has exstinguishing effects in the learning phase of the ANN. For this reason all values of the DCTs in our network are turned to absolute values.

$$\begin{aligned} Y_{ k_{1},k_{2} }= \sum _{n_{1}=0 }^{ N_{1}-1 } \sum _{n_{2}=0 }^{ N_{2}-1 } { x_{n_{1},n_{2}} \left| \cos {\left[ \frac{\pi }{N_{1}}\left( n_{1}+\frac{1}{2} \right) k_{1} \right] } \cos {\left[ \frac{\pi }{N_{2}}\left( n_{2}+\frac{1}{2} \right) k_{2} \right] } \right| }. \end{aligned}$$
(5)
Fig. 2.
figure 2

Image of one handwritten digit 7 of the MNIST dataset [2], its DCT calculated by Eq. (4) and its absolute value DCT calculated by Eq. (5).

The absolute value DCT (third image in Fig. 2) has been calculated by Eq. (5). For printing reasons the image values where normalized, so DCT minimum value is black, DCT maximum value is white. In the following experiments we used the absolute valued DCTs. The original input image we used only for comparison.

For the execution of the experiments a software has been written in JAVA with optional use of the library deeplearning4j. It transforms the images into the Fourier domain by using the absolute value DCTs, and then feeds it in a neural network. The architecture is kept simple. The software has modules to visualize the weights and activity of the neurons during learning. It is possible to connect the image directly to the input layer or to first transform it into a DCT and connect this to the input layer. So it is optional working in the spatial- or in frequency domain. Last option is the focus of this research. After training a test dataset is used to evaluate the quality of the classification. As result the software calculates the accuracy as value between 0 and 1. Accuracy 1 denotes that all images are classified correct.

Fig. 3.
figure 3

Architecture of the neural network.

4 Experiments

4.1 Experiments with MNIST dataset

The MNIST dataset [2] consists of 70.000 hand written digits, divided into a training set of 60.000 and a set of 10.000 for test. The images have the size \(28\,\times \,28\) pixels. The object classes to be detected are the digits 0 to 9.

We have build the DCT for each image and converted them to absolute values. To get the information about the patterns to be learned by the neural network we visualized the patterns. We grouped the DCT images for every class and built the average. From this average over each class we subtracted the average of all classes. This was done because the neural network by backpropagation does a similar operation. Afterwards we contrasted manually to make interpretation more easy for human visual system.

Fig. 4.
figure 4

Absolute value DCTs of MNIST dataset grouped by class. For visualization reasons the images are contrasted. The images are only created for visualization of the patterns. They are not used in the network.

So image 0 in Fig. 4 is the average of all DCTs of digits 0 in the MNIST database minus average DCT of all images. 1 dct is the average of all DCTs of digits 1 minus average all DCTs. We can interpret the DCTs now: Digit 1 is similar to a vertical bar, so it mainly has horizontal frequencies as white band at the top of the DCT. 0 is a circle with frequencies in all directions, so the DCT is concentrated in the left upper corner. The digit 8 consists of two circles, so the DCT is similar to the 0. Because the two circles of the 8 are smaller in diameter then the circle of the 0, the frequencies in the 8 DCT are pushed in direction to the middle. In the 7 DCT you can see the horizontal bar of the digit 7 as white band of frequencies at the left side of the image. The diagonal part of the 7 is the small white band from the left upper corner in the direction to the center. The average DCTs are not used in the neural network. They are only produced for to visualize the patterns and thus to explain the method. As input for our network we only used the absolute value DTCs as calculated in Eq. (5).

In our program we have the option to follow back the activity of the neurons, and we can visualize the weights of the neurons.

Fig. 5.
figure 5

Visualization of weights of one neuron in hiddenlayer after 400, 500, 600 and 1000 iterations.

In Fig. 5 this is shown for the digit 7. The weights of the Outputlayer, where digit 7 is detected, has maximum weight to one neuron of the Hiddenlayer. The weights of this neuron are shown during train after 400, 500, 600 and 1000 iterations.

You should compare the average absolute value DCT (image 7 in Fig. 4) to the last image of Fig. 5. They are similar. Note that the first is manually built by statistics whereas the weights are automatically built by our network. This undermines that our method of building average DCTs is a valid way to visualize the process of image classification in the Fourier domain.

We have trained our network with a configuration of 1400 neurons in the hiddenlayer and 10 epochs. We used the training set of the MNIST database converted to absolute value DCTs for the training phase. In the test of our thus trained network with used the test set of the MNIST database converted to absolute value DCTs and reached an accuracy of 0.9805, which means that 98% of the digits where classified correctly.

4.2 Experiments with CIFAR-10 Dataset

The CIFAR-10 dataset [10] consists of 60.000 small color images, divided into a training set of 50.000 and a set of 10.000 for test. The images have the size \(32\,\times \,32\) pixels. The ten classes are airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck. Because our implementation of the DCT network is not working with color information until now, we transformed a CIFAR-10 dataset into digital gray-level images with 256 steps between black and white (bw) (Fig. 6).

Fig. 6.
figure 6

Example images of the CIFAR-10(bw) dataset [10]. The dataset consists of 60.000 images of the ten classes airplane, automobile, bird, cat, deer, dog, frog, horse, ship and truck.

Fig. 7.
figure 7

Absolute value DCTs of CIFAR-10(bw) dataset grouped by class. For visualization reasons the images are contrasted. The images are only created for visualization of the patterns. They are not used in the network.

In the same way like in the experiment before we built the absolute value DCTs and then grouped by class to reveal the patterns in the average images. The results are shown in Fig. 7. Again we can interpret the DCTs: Average horse pictures have strong vertical structures by the legs, so we get the upper white band in the DCT image of horse. Average frog has neither vertical nor horizontal structures. Automobiles and trucks are similar to each other, last ones have more vertical structures as you can see by the small white bar at the top of the average DCT. So also for real images you can find patterns within a class in the Fourier domain. The network still has to adapt to different patterns within one class. But in comparison to the spatial domain the DCT-images are more similar. Again the average DCTs are not used in the neural network. They are presented here for to show the patterns which the neural network is learning in the frequency domain.

In the experiments with the CIFAR-10(bw) dataset we compared classification in the frequency to classification in the spatial domain. We tried different network parameters. After a test with the CIFAR-10(bw) dataset we did not change the network parameters and repeated the classification with the CIFAR-10 absolute value DCT dataset. In this way it is possible to compare the accuracy of the two methods. In the following figures the accuracy of the CIFAR-10(bw) dataset is shown by the line with squares, CIFAR-10(bw) absolute value DCTs by the line with triangles. First we took the Architecture with one Hiddenlayer (Fig. 3) and different number of epoches. In one epoche each image of the training set is used to train the weights of the neurons once. As you can see in Fig. 8 the accuracy of both datasets is not very high. But the accuracy of the CIFAR-10(bw) absolute value DCT is generally higher then the accuracy of the CIFAR-10(bw) dataset.

Fig. 8.
figure 8

Accuracy of CIFAR-10(bw) marked by squares and CIFAR-10(bw) absolute value DCT marked by triangles in ANN with one hidden layer, tested with 2 to 20 epoches

In the next experiment we tested the influence of the number of neurons. As you can see in Fig. 9 the accuracy of the CIFAR-10(bw) absolute value DCT is again much higher then the accuracy of the CIFAR-10(bw) dataset.

Fig. 9.
figure 9

Accuracy of CIFAR-10(bw) marked by squares and CIFAR-10(bw) absolute value DCT marked by triangles in ANN with one hidden layer, tested with 50 to 2000 neurons

For our last experiment we changed the architecture of our network: it is similar to the architecture in Fig. 3, only in front of the hiddenlayer we put one convolution layer and one subsampling layer. We have chosen this configuration to eliminate effects of rotation and deformation, which are still prevalent in the DCTs. So we combined the Fourier transform with a convolutional process (Fig. 10).

Fig. 10.
figure 10

Accuracy of CIFAR-10(bw) marked by squares and CIFAR-10(bw) absolute value DCT marked by triangles in ANN with convolution-, subsampling- and hiddenlayer, tested with 25–1500 neurons

In this last experiment an accuracy of 0.6112 for classification of the CIFAR-10(bw) absolute value DCT dataset was reached with 1000 neurons in the hidden layer. With the CIFAR-10(bw) dataset the maximal accuracy is only 0.5418. In most cases the accuracy by the DCT is better then the accuracy reached by original images.

5 Conclusion

Our experiment with the MNIST dataset showed, that a classification of handwritten digits is possible in the frequency domain with a very good accuracy of at least 0.9805. Our experiments with CIFAR-10(bw) dataset showed that classification in the Fourier domain also works with real images. Besides the fact that we only worked in black an white whereas the CIFAR-10 dataset includes color information, it is not intended to compare the results to other methods which work with hundreds of layers and several million neurons and a lot of computing power. The experiments are a comparison of two methods. It was shown that classification in the Fourier domain outperforms the classification in the spatial domain, especially for handwritten digits.

We also revealed the special patterns of the classes in the Fourier domain, on which the neural network is trained. With the successful use of a convolution layer also in the frequency domain we showed, that it is possible to combine convolution and Fourier domain representation. In future work it is planned to include the results of this work in huger and more complex neural networks.