Keywords

1 Introduction

The aurora is a beautiful and natural phenomenon generated by the interaction between energetic charged particles from outer space and Earth’s upper atmosphere. Different types of aurora are correlated with specific dynamic activities [1]. Aurora image classification is the basis for aurora semantic analysis. The aurora images can be divided into arc aurora, drapery corona aurora and radiation corona aurora. Figure 1 gives examples of these three kinds of aurora images.

Fig. 1.
figure 1

Examples of three kinds of aurora images. (a) arc aurora; (b) drapery corona aurora; (c) radiation corona aurora.

Since Syrjasuo et al. firstly introduced computer vision technique to aurora image analysis in 2004 [2], there appears many automatic algorithms for aurora image classification [3,4,5,6,7,8,9,10]. Most available methods consist of two steps [10]. The features are extracted from raw input in the first step and a classifier is learned based on the obtained features in the second step. However, the aurora image is gray image and has nonrigid structure. So most hand-crafted features designed for natural images cannot represent aurora images properly.

Recently, convolutional neural network (CNN) has been demonstrated as an effective model for many compute vision fields such as image classification [11], image segmentation [12] and object detection [13]. CNN has the ability of automatically learning complex pattern and representative features from data in a hierarchical stream. This motivates us to design an automatic aurora classification method based on CNN.

In most of the CNN models, initialization is crucial since poorly initialized networks are likely to find poor apparent local minimum [14]. Usually, networks are initialized with parameters that learned from large scale natural image dataset. But, as shown in Fig. 1, aurora images are essentially different from natural images. It’s not reasonable to pretrain network for aurora image classification with natural images. To get a proper initialization for aurora image classification, a task-specific initialization scheme is designed in our method. The convolutional kernels of the first convolutional layer are initialized by the features learned from aurora image patches with auto-encoder.

Since there exist several black areas in the four corners of an aurora image and interferences such as cosmic ray tracks, dayglow contamination and system noise caused by equipment which can lead to mistakes on extracting the patches and labeling the aurora types [15, 16]. It is unreasonable for aurora images to extract patches randomly or by sliding window according zigzag order from up-left corner to down-right of an image as usual. Human have the ability to choose visual information of interest while looking at a static or dynamic scene [17]. This tremendous ability can help us to interact with complex environments by selecting relevant and important information to be proceeded in the brain. It has been claimed that visual attention is attracted to the most informative region [18]. That is to say, those areas where human attend to are usually the most informative areas in an image. So we employ a selection strategy to obtain valuable patches at a set of candidate locations from human fixation points captured by eye tracker.

Besides, most previous patch-based CNN methods took the same size patches extracted from image dataset [19]. Either global information or local details will be lost when the size of patches is inappropriate. To address this issue, we use different sizes of patches to learn features for initialization. A bio-inspired strategy based on eyemovement information is applied to determine the patch size. Correspondingly, a multi-size kernels CNN is constructed, which has a multi-scale architecture with three parallel streams in this paper. Those streams differ in the kernel size of their first convolutional layers. In this way, different streams get different receptive fields on the input image and learn multi-scale features. The stream with larger receptive field gets more global information, while the stream with smaller receptive field get more local details.

In conclusion, there are three major contributions in this paper: (1) A patch selection strategy guided by eye movement information for patch extraction is introduced. And a bio-inspired strategy that uses gaze duration recorded by eye tracker is applied to determine the patch size (2) A task-specific initialization scheme that initializes convolutional kernels with features learned from the patch is designed; (3) A new multi-size kernels CNN (MSKCNN) is designed to utilize multi-scale information.

Fig. 2.
figure 2

Flow chart of the proposed model

2 The Proposed Model

The structure of proposed model is shown in Fig. 2, which is a multi-scale architecture containing single-layer auto-encoders in parallel and a multi-size kernels CNN. The details about our model are specified as follows.

2.1 Eye Tracking Data Collection

In this paper, EyeLink 1000 eye tracker with 2000 Hz sampling rate is used to record eye tracking data. A bite bar is used to minimize head movement. Subjects view the aurora images on a 19-inch monitor at 70 cm from human eyes.

Five subjects took part in the eye tracking experiment. They are all space physics experts who are very familiar with the aurora images. They were informed that they would be shown a series of aurora images in which they would give the type of the aurora image. Response is recorded by pressing button. The button ‘1’, ‘2’ and ‘3’ represent for arc aurora, drapery corona aurora and radiation corona aurora respectively.

At the beginning of experiment, the eye tracker is calibrated using a 9-point calibration and validation procedure. As can be seen in Fig. 3, the fixation cross is shown for 200 ms before aurora image firstly. Each aurora image is displayed on the screen for 4 s followed by the fixation cross picture. To initiate the next aurora image, the subject has to fixate a cross centered on the screen for 200 ms again. According to our study purpose, the two measures, fixation point and gaze duration are used in this paper. As shown in Fig. 3, the center of the red circle indicates location of fixation points. And the radius of the circle is directly proportional to gaze duration.

Fig. 3.
figure 3

Procedure of the eye tracking data collection

2.2 Patches Extraction Guided by Eye Movement Information

Attention selection and information abstraction of human beings are cognitive mechanisms employed by human for parsing, gazing, structuring and organizing perceptual stimuli [20]. A cognitive technique is adopted to obtain valuable patches at a set of candidate locations from the fixation points. And psychological research find that gaze duration corresponds to the duration of cognitive processing of the material located at fixation [21]. It means that an area is usually labeled as long gaze duration when the subject observes the details on the area carefully. So there should be an inverse correlation between gaze duration and the patch size.

We apply a linear function with negative slope for mapping the gaze duration in eye movement information to patch size. For an aurora image, the patch size k is assigned according to the information of fixation points as given in Eq. (1),

$$\begin{aligned} k=a-b*[floor(3*\frac{t-t_{avemin}}{t_{avemax}-t_{avemin}})+1],t_{avemin}<t<t_{avemax}, \end{aligned}$$
(1)

where t is gaze duration of a specific eye-fixation point, \(t_{avemin}\) and \(t_{avemax}\) are the average minimum and maximum gaze duration of all aurora images respectively. Those fixation points with gaze duration less than \(t_{avemin}\) or more than \(t_{avemax}\) are excluded as outliers. The patches are extracted from valid fixation points from all subjects, which guarantees that the features extracted from these valuable patches can reflect the generality of all subjects. The distribution of gaze duration is described in Fig. 4. a is a constant determined by experiments; b is the size difference between different patches which is empirically chosen as 2. So there are three kinds of patches with sizes \(a-b\), a and \(a+b\). Each kind of patch is used to train a single-layer auto-encoder.

Fig. 4.
figure 4

Distribution of gaze duration

2.3 Task-Specific Initialization

This section will elaborate our task-specific initialization scheme which consists of three steps: data pre-processing, auto-encoder training and convolutional kernels constructing for CNN.

Data Pre-processing. We reshape each patch with size k to a column vector and combine all these column vectors into a matrix,

$$\begin{aligned} X_{original}^{k}\in \mathbb {R}^{k^{2}*M}, \end{aligned}$$
(2)

where M is the number of patches. Because the learning capability of the auto-encoder depends on the quality of input data [22], \(X_{original}^{k}\) is preprocessed to reduce the redundancy of input. Firstly, we substract the mean value of each column and get \(X_{zero\_mean}^{k}\),

$$\begin{aligned} X_{zero\_mean}^{k}= X_{original}^{k}-\frac{1}{k} \begin{bmatrix} sum(x_{original}^{k(1)})&\cdots&sum(x_{original}^{k(M)}) \\ \vdots&\vdots&\vdots \\ sum(x_{original}^{k(1)})&\cdots&sum(x_{original}^{k(M)}) \end{bmatrix}. \end{aligned}$$
(3)

Then singular value decomposition (SVD) is performed on the covariance matrix \(\varSigma \) of \(X_{zero\_mean}^{k}\):

$$\begin{aligned} U\varLambda U^{T}=svd(\varSigma ),U,\varLambda ,\varSigma \in \mathbb {R}^{M*M}. \end{aligned}$$
(4)

The matrix U contains the eigenvectors of \(\varSigma \), and \(\varLambda =diag(\lambda _{1},\lambda _{2},\cdots ,\lambda _{M})\) contains the corresponding eigenvalues. Finally, we can get the preprocessed data of \(X_{original}^{k}\) as:

$$\begin{aligned} X^{k}=U*diag(\frac{1}{\sqrt{\lambda _{1}+\varepsilon }},\frac{1}{\sqrt{\lambda _{2}+\varepsilon }},\cdots ,\frac{1}{\sqrt{\lambda _{M}+\varepsilon }})*U^{T}*X_{zero\_mean}^{k}. \end{aligned}$$
(5)

where is \(\varepsilon \) a constant which is empirically chosen as \(\varepsilon = 10^{-5}\).

Auto-Encoder Training. Auto-encoder model is an unsupervised learning algorithm that applies back propagation and tries to learn a function.

$$\begin{aligned} y=h_{W,b}(x)\approx x. \end{aligned}$$
(6)

The architecture of the single-layer auto-encoder used in this work is shown in Fig. 5. By learning an approximation to identity function, the auto-encoder can automatically learn representative features from unlabeled data. Fed with patches extracted from aurora image, the auto-encoder can get the task-specific features for aurora image. And these learned features will be used to initialize the convolutional kernels.

Fig. 5.
figure 5

Architecture of the single-layer auto-encoder.

Suppose there is a training set:

$$\begin{aligned} {(x^{k(1)},y^{1}),(x^{k(2)},y^{2}),\cdots ,(x^{k(M)},y^{M})}, \end{aligned}$$
(7)

the cost function of single-layer auto-encoder can be defined as,

$$\begin{aligned} J(W,b)=\frac{1}{M}\sum _{m=1}^{M}\frac{1}{2}||h_{W,b}(x^{k(m))}))-y^{m}||^2+ \frac{\lambda }{2}\sum _{l=1}^{s_{2}}\sum _{j=1}^{s_{i}}\sum _{i=1}^{s_{i+1}}(W_{ij}^{l})^{2}, \end{aligned}$$
(8)

where h(x) is activate function; \(s_{l}\) is the number of units on the lth layer and \(W_{ij}^{l}\) is a weight between the jth unit on the lth layer and ith unit on the \((l+1)\)th layer. The first term in J(Wb) is an average sum-of-squares error. The second term is a regular term against over-fitting.

A sparse constraint is imposed on the hidden layer to find representative features efficiently. Let \(\hat{\rho }_{j}\) be the average activation of the hidden unit j:

$$\begin{aligned} \hat{\rho }_{j}=\frac{1}{M}\sum _{m=1}^{M}[a_{j}(x^{k(m)})]. \end{aligned}$$
(9)

An extra penalty term is added to J(Wb). This ensures that \(\hat{\rho }_{j}\) is close to the sparse coefficient \(\rho \). In this paper, penalty term is set as following:

$$\begin{aligned} \sum _{j=1}^{H}KL(\rho |\hat{\rho }_{j}), \end{aligned}$$
(10)

where H is the number of hidden units and \(KL(\rho |\hat{\rho }_{j})\) is the Kullback-Leibler (KL) divergence between a Bernoulli random variable with mean \(\rho \) and a Bernoulli random variable with mean \(\hat{\rho }_{j}\). Then the overall cost function of sparse auto-encoder is updated to:

$$\begin{aligned} J_{sparse}=J(W,b)+\beta \sum _{j=1}^{H}KL(\rho |\hat{\rho }_{j}), \end{aligned}$$
(11)

where \(\beta \) is used to control the importance of the sparse constraint. To train our auto-encoder model, we can now repeatedly take steps of gradient descent to reduce the cost function.

Convolution Kernels Constructing for CNN. Patches that maximally activate hidden units of an auto-encoder are treated as the features learned by auto-encoder. These learned features are used to initialize convolution filters.

It has been proved that the input which maximally activates hidden unit i is given by setting value of each pixel [23],

$$\begin{aligned} C_{i}^{k}(j)=\frac{W_{ij}^{l}}{\sqrt{\sum _{j=1}^{k^{2}}(W_{ij}^{l})^{2}}} (i=1,2,\cdots ,H;j=1,2,\cdots ,k^{2}), \end{aligned}$$
(12)

where H is the number of hidden units in auto-encoder and it is also the number of features map in the first convolutional layer. Then convolutional kernels in the first convolutional layer in our MSKCNN models are initialized with patches \(C_{i}^{k} (i=1,2,\cdots ,H)\) formed by these pixel values \(C_{i}^{k}(j) (j=1,2,\cdots ,k^{2})\). As a result, the characteristic features of aurora images have been captured by the first convolutional layer. So that those parameters after the first convolutional layer can be initialized randomly.

2.4 Net Architecture of the Multi-size Kernels CNN

Our network starts from three streams. Each stream has the same architecture with the net designed by Alex et al. [11]. Differences between different streams are the kernel size of their first convolutional layers. So that different streams can get different receptive fields on the input image. As a result, multi-scale features can be learned with the proposed MSKCNN.

In our task-specific initialization, feature maps of first convolutional from each stream can be computed as,

$$\begin{aligned} I\_conved(row,col)=\sum _{\delta _{i}}^{k}\sum _{\delta _{j}}^{k}C_{i}^{k}(\delta _{i},\delta _{j})* X^k(s_c*row+\delta _{i},s_c*col+\delta _{j}), \end{aligned}$$
(13)

where \(s_c\) is the stride of convolution kernels.

The fusion of streams is done by linking the last fully connected layers of different streams end-to-end. Because the dimension of the last fully connected layer is 1024, the fused layer is a 1024*3 dimension feature vector. Then it is followed by an additional fully connected layer with 1024 dimension to reduce dimension.

3 Experiments and Analysis

3.1 Datasets and Experimental Setup

The experimental data used in this paper is obtained by the All-Sky imager in Arctic Yellow River Station. 9000 manual labeled images are used as our aurora dataset and each category contains 3000 images. All images have an original size of 512*512 and are resized to 256*256. We trained our model with a batch size of 100. Training and testing of the model are performed on a single GTX TITAN GPU with Caffe deep learning framework [24].

3.2 Experiment

Evaluation on Task-Specific Initialization. In this section, two comparison experiments are conducted to demonstrate the effectiveness of the proposed task-specific initialization.

In the first experiment, we investigate performance of the single stream model with different kernel sizes in the first convolutional layer under different kinds of initialization. The results are given in Table 1. The first row in Table 1 shows the performance with random initialization. And the second row shows the performance with the proposed task-specific initialization. The third row shows the performance with eye guided task-specific initialization. One-tenth of the images in our aurora image dataset are chosen randomly as the training data, while the others as the testing data. The average classification accuracy with our task-specific initialization is 61.3% while it is 56.1% with random initialization.

Table 1. Classification accuracy under different patch sizes
Fig. 6.
figure 6

Classification accuracy under different proportion of training data and testing data.

In the second experiment, we test the performance of task-specific initialization under different proportion between training data and testing data. The experimental results are shown in Fig. 6. It can be concluded that our task-specific initialization outperforms random initialization consistently.

State-of-the-art Comparison. Performance under different parameter a in Eq. (1) is tested to determine the kernel sizes of the first convolutional layers in our model. As shown in Table 2, the best classification accuracy is 96.2% when a is 11. So, the kernel sizes of the first convolutional layers in the three streams of our proposed MSKCNN is set to be 9, 11 and 13 respectively.

Table 2. Classification accuracy under different a (train:test = 6:4)
Fig. 7.
figure 7

Performance of different methods on our aurora image dataset. (train:test = 6:4)

We compare the proposed MSKCNN model with three convolutional neural networks for image classification: Lenet [25], GoogleNet [26], AlexNet [11] and the latest aurora image classification method based on deep learning: 2DPCANET [10]. The results are presented in Fig. 7. It shows that the classification accuracy increases from 88.7% with the original AlexNet to 96.2% with the proposed method. And our method outperforms the other classification method based on deep learning.

4 Conclusion

In this paper, we present an approach for aurora image classification. The proposed MSKCNN contains three different streams with different receptive field to utilize information flows with small to large contexts. Patch extraction is guided by eye fixation locations and gaze duration of space physics experts. So the patch extraction becomes more consistent with human visual and cognitive mechanism. Then, a task-specific initialization is designed. This novel initialization method learns features from aurora image patches and then initializes kernels of the first convolution layers with these learned features. Experiment results demonstrate the effectiveness of our proposed method on aurora datasets.

In the future, we will further extend the proposed work on other dataset.