Deep Learning Architecture Search by Neuro-Cell-Based Evolution with Function-Preserving Mutations

Wistuba, Martin

doi:10.1007/978-3-030-10928-8_15

Deep Learning Architecture Search by Neuro-Cell-Based Evolution with Function-Preserving Mutations

Martin Wistuba¹⁷

Conference paper
First Online: 23 January 2019

2437 Accesses
17 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11052))

Abstract

The design of convolutional neural network architectures for a new image data set is a laborious and computational expensive task which requires expert knowledge. We propose a novel neuro-evolutionary technique to solve this problem without human interference. Our method assumes that a convolutional neural network architecture is a sequence of neuro-cells and keeps mutating them using function-preserving operations. This novel combination of approaches has several advantages. We define the network architecture by a sequence of repeating neuro-cells which reduces the search space complexity. Furthermore, these cells are possibly transferable and can be used in order to arbitrarily extend the complexity of the network. Mutations based on function-preserving operations guarantee better parameter initialization than random initialization such that less training time is required per network architecture. Our proposed method finds within 12 GPU hours neural network architectures that can achieve a classification error of about 4% and 24% with only 5.5 and 6.5 million parameters on CIFAR-10 and CIFAR-100, respectively. In comparison to competitor approaches, our method provides similar competitive results but requires orders of magnitudes less search time and in many cases less network parameters.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

Deep learning techniques have been the key to major improvements in machine learning in various domains such as image and speech recognition and machine translation. Besides more affordable computational power, the proposal of new kinds of architectures such as ResNet [8] and DenseNet [9] helped to increase the accuracy. However, the selection on which architecture to choose and how to wire different layers for a particular data set is not trivial and demands domain expertise and time from the human practitioner.

Within the last one or two years we observed an increase in research efforts by the machine learning community in order to automate the search for neural network architectures. Researchers showed that both, reinforcement learning [31] and neuro-evolution [20], are capable of finding network architectures that are competitive to the state-of-the-art. Since these methods still require GPU years until this performance is reached, further work has been proposed to significantly decrease the run time [1, 3, 15, 30, 32].

In this paper, we want to present a simple evolutionary algorithm which reduces the search time to just hours. This is an important step since now, similar to hyperparameter optimization for other machine learning models, optimizing the network architecture becomes affordable for everyone. Our presented approach starts from a very simple network template which contains a sequence of neuro-cells. These neuro-cells are architecture patterns and the optimal pattern will be automatically detected by our proposed algorithm. This algorithm assumes that the cell initially contains only a single convolutional layer and then keeps changing it by function-preserving mutations. These mutations change the structure of the architecture without changing the network’s predictions. This can be considered as a special initialization such that the network requires less computational effort for training.

Our contributions in this paper are three-fold:

1.
We are the first to propose an evolutionary algorithm which optimizes neuro-cells with function-preserving mutations.
2.
We expand the set of function-preserving operations proposed by Chen et al. [4] to depthwise separable convolutions, kernel widening, skip connections and layers with multiple in- and outputs.
3.
We provide empirical evidence that our method is outperforming many competitors within only hours of search time. We analyze our proposed method and the transferability of neuro-cells in detail.

2 Related Work

Evolutionary algorithms and reinforcement learning are currently the two state-of-the-art techniques used by neural network architectures search algorithms. With Neural Architecture Search [31], Zoph et al. demonstrated in an experiment over 28 days and with 800 GPUs that neural network architectures with performances close to state-of-the-art architectures can be found. In parallel or inspired by this work, others proposed to use reinforcement learning to detect sequential architectures [1], reduce the search space to repeating cells [30, 32] or apply function-preserving actions to accelerate the search [3].

Neuro-evolution dates back three decades. In the beginning it focused only on evolving weights [18] but it turned out to be effective to evolve the architecture as well [23]. Neuro-evolutionary algorithms gained new momentum due to the work by Real et al. [20]. In an extraordinary experiment that used 250 GPUs for almost 11 days, they showed that architectures can be found which provide similar good results as human-crafted image classification network architectures. Very recently, the idea of learning cells instead of the full network has also been adopted for evolutionary algorithms [15]. Miikkulainen et al. even propose to coevolve a set of cells and their wiring [17].

Other methods that try to optimize neural network architectures or their hyperparameters are based on model-based optimization [7, 14, 22, 26], random search [2] and Monte-Carlo Tree Search [19, 27].

3 Function-Preserving Knowledge Transfer

Chen et al. [4] proposed a family of function-preserving network manipulations in order to transfer knowledge from one network to another. Suppose a teacher network is represented by a function $f\left( \mathbf {x}\ |\ \varvec{\theta }^{\left( f\right) }\right) $ where $\mathbf {x}$ is the input of the network and $\varvec{\theta }^{\left( f\right) }$ are its parameters. Then an operation changing the network f to a student network g is called function-preserving if and only if the output for any given model remains unchanged:

$$\begin{aligned} \forall \mathbf {x}:\ f\left( \mathbf {x}\ |\ \varvec{\theta }^{\left( f\right) }\right) =g\left( \mathbf {x}\ |\ \varvec{\theta }^{\left( g\right) }\right) \!. \end{aligned}$$

(1)

Note that typically the number of parameters of f and g are different. We will use this approach in order to initialize our mutated network architectures. Then, the network is trained for some additional epochs with gradient-based optimization techniques. Using this initialization, the network requires only few epochs before it provides decent predictions. We briefly explain the proposed manipulations and our novel contributions to it. Please note that a fully connected layer is a special case of a convolutional layer.

3.1 Convolutions in Deep Learning

Convolutional layers are a common layer type used in neural networks for visual tasks. We denote the convolution operation between the layer input $X\in \mathbb {R}^{w\,\times \,h\,\times \,i}$ with a layer with parameters $W\in \mathbb {R}^{k_{1}\,\times \,k_{2}\times i\,\times \,o}$ by $X*W$. Here, i is the number of input channels, $w\,\times \,h$ the input dimension, $k_{1}\,\times \,k_{2}$ the kernel size and o the number of output feature maps. Depthwise separable convolutions, or for short just separable convolutions, are a special kind of convolution factored into two operations. During the depthwise convolution a spatial convolution with parameters $W_{d}\in \mathbb {R}^{k_{1}\,\times \,k_{2}\times i}$ is applied for each channel separately. We denote this operation by using $\circledast $. This is in contrast to the typical convolution which is applied across all channels. In the next step the pointwise convolution, i.e. a convolution with a $1\,\times \,1$ kernel, traverses the feature maps which result from the first operation with parameters $W_{p}\in \mathbb {R}^{1\,\times \,1\,\times \,i\,\times \,o}$. Comparing the normal convolution operation $X*W$ with the separable convolution $\left( X\circledast W_{d}\right) *W_{p}$, we immediately notice that in practice the former requires with $k_{1}k_{2}io$ more parameters than the latter which only needs $k_{1}k_{2}i+io$. Figure 1 provides a graphical representation of the network. If $X^{\left( l\right) }$ is the input for an operation in layer $l+1$, e.g. a convolution, then we represent each channel $X_{\cdot ,\cdot ,i}^{\left( l\right) }$ by a circle. Arrows represent a spatial convolution which is parameterized by some parameters indicated by a character (in our example characters a to i). We clearly see that the depthwise convolution within the depthwise separable convolution separately operates on channels and normal convolutions operate across channels.

3.2 Layer Widening

Assume the teacher network f contains a convolutional layer with a $k_{1}\,\times \,k_{2}$ kernel which is represented by a matrix $W^{\left( l\right) }\in \mathbb {R}^{k_{1}\,\times \,k_{2}\times i\,\times \,o}$ where i is the number of input feature maps and o is the number of output feature maps or filters. Widening this layer means that we increase the number of filters to $o'>o$. Chen et al. [4] proposed to extend $W^{\left( l\right) }$ by replicating the parameters along the last axis at random. This means the widened layer of the student network uses the parameters

$$\begin{aligned} V_{\cdot ,\cdot ,\cdot ,j}^{\left( l\right) }= {\left\{ \begin{array}{ll} W_{\cdot ,\cdot ,\cdot ,j}^{\left( l\right) } &{} j\le o\\ W_{\cdot ,\cdot ,\cdot ,r}^{\left( l\right) } &{} r\text { uniformly sampled from }\left\{ 1,\ldots ,o\right\} \end{array}\right. }. \end{aligned}$$

(2)

In order to achieve the function-preserving property, the replication of some filters needs to be taken into account for the next layer $V^{\left( l+1\right) }$. This is achieved by dividing the parameters of $W_{\cdot ,\cdot ,j,\cdot }^{\left( l+1\right) }$ by the number of times the j-th filter has been replicated. If $n_{j}$ is the number of times the j-th filter was replicated, the weights of the next layer for the student network are defined by

$$\begin{aligned} V_{\cdot ,\cdot ,j,\cdot }^{\left( l+1\right) }=\frac{1}{n_{j}}W_{\cdot ,\cdot ,j,\cdot }^{\left( l+1\right) }. \end{aligned}$$

(3)

We extended this mechanism to depthwise separable convolutional layers. A depthwise separable convolutional layer at depth l is widened as visualized in Fig. 2a. The pointwise convolution for the student is estimated according to Eq. 2. This results into replicated output feature maps indicated by two green colored circles in the figure. The depthwise convolution is identical to the one of the teacher network, i.e. the operations with parameters a and b. Independently of whether we used a depthwise separable or normal convolution in layer l, widening it requires adaptations in a following depthwise separable convolutional layer as visualized in Fig. 2a. The parameters of the depthwise convolution are replicated according to the replication of parameters in the previous layer similar to Eq. 2. In our example we replicated the operation with parameters f in the previous layer. Therefore, we have now replicated spatial convolutions with parameters h. Furthermore, the parameter of the pointwise convolution (in the example parameterized by i, j, k and l) depend on the replications in the previous layers analogously to Eq. 3. In our example we did not replicate the blue feature map, so the weights for this channel remain unchanged. However, we duplicated the green feature map which is transformed into the purple feature map depthwise convolution. Taking into account that this channel contributes now twice to the pointwise convolution, all corresponding weights (in the example k and l) are divided by two.

Widening the separable layer followed by another separable layer is the most complicated case. Other cases can be derived by dropping the depthwise convolutions from Fig. 2a.

3.3 Layer Deepening

Chen et al. [4] proposed a way to deepen a network by inserting an additional convolutional or fully connected layer. We complete this definition by extending it to depthwise separable convolutions.

A layer can be considered to be a function which gets as an input the output of the previous layer and provides the input for the next layer. A simple function-preserving operation is to set the weights of a new layer such that the input of the layer is equal to its output. If we assume i incoming channels and an odd kernel height and weight for the new convolutional layer, we achieve this by setting the weights of the layer with a $k_{1}\,\times \,k_{2}$ kernel to the identity matrix:

$$\begin{aligned} V_{j,h}^{\left( l\right) }={\left\{ \begin{array}{ll} I_{i,i} &{} j=\frac{k_{1}+1}{2}\wedge h=\frac{k_{2}+1}{2}\\ \mathbf {0} &{} \text {otherwise} \end{array}\right. }. \end{aligned}$$

(4)

This operation is function-preserving and the number of filters is equal to the number of input channels. More filters can be added by layer widening, however, it is not possible to use less than i filters for the new layer. Another restriction is that this operation is only possible for activation functions $\sigma $ with

$$\begin{aligned} \sigma \left( \mathbf {x}\right) =\sigma \left( I\sigma \left( \mathbf {x}\right) \right) \ \forall \mathbf {x}. \end{aligned}$$

(5)

The ReLU activation function $\text {ReLU}\left( \mathbf {x}\right) =\max \left\{ \mathbf {x},\mathbf {0}\right\} $ fulfills this requirement.

We extend this operation to depthwise convolutions and visualize it in Fig. 2b. The parameters of the pointwise convolution $V_{p}$ are initialized analogously to Eq. 4 and the depthwise convolution $V_{d}$ is set to one:

$$\begin{aligned} V_{p}&=I_{i,i}\end{aligned}$$

(6)

$$\begin{aligned} V_{d}&=\mathbf {1}. \end{aligned}$$

(7)

As we see in Fig. 2b, this initialization ensures that both, the depthwise and pointwise convolution, just copy the input. New layers can be inserted at arbitrary positions with one exception. Under certain conditions an insertion right after the input layer is not function-preserving. For example if a ReLU activation is used, there exists no identity function for inputs with negative entries.

3.4 Kernel Widening

Increasing the kernel size in a convolutional layer is achieved by padding the tensor using zeros until it matches the desired size. The same idea can be applied to increase the kernel size of depthwise separable convolution by padding the depthwise convolution with zeros.

3.5 Insert Skip Connections

Many modern neural network architectures rely on skip connections [8]. The idea is to add the output of the current layer to the output of a previous. One simple example is

$$\begin{aligned} X^{\left( l+1\right) }=\sigma \left( X^{\left( l\right) }*V^{\left( l+1\right) }+X^{\left( l\right) }\right) \!. \end{aligned}$$

(8)

Therefore, we propose a function-preserving operation which allows inserting skip connection. We propose to add layer(s) and initialize them in a way such that the output is 0 independent on the input. This allows to add a skip because now adding the output of the previous layer to zero is an identity operation. We visualized a simple example in Fig. 3a based on Eq. 8. A new operation is added setting its parameters to zero, $V^{\left( l+1\right) }=\mathbf {0}$, achieving a zero output. Now, adding this output to the input is an identity operation.

3.6 Branch Layers

We also propose to branch layers. Given a convolutional layer $X^{\left( l\right) }*W^{\left( l+1\right) }$ it can be reformulated as

$$\begin{aligned} \text {merge}\left( X^{\left( l\right) }*V_{1}^{\left( l+1\right) },\ X^{\left( l\right) }*V_{2}^{\left( l+1\right) }\right) \!, \end{aligned}$$

(9)

where merge concatenates the resulting output. The student network’s parameters are defined as

$$\begin{aligned} V_{1}^{\left( l+1\right) }&=W_{\cdot ,\cdot ,\cdot ,1:\left\lfloor o/2\right\rfloor }^{\left( l+1\right) }\\ V_{2}^{\left( l+1\right) }&=W_{\cdot ,\cdot ,\cdot ,\left( \left\lfloor o/2\right\rfloor +1\right) :o}^{\left( l+1\right) }. \end{aligned}$$

This operation is not only function-preserving, it also does not add any further parameters and in fact is the very same operation. However, combining this operation with other function-preserving operations allows to extend networks by having parallel convolutional operations or add new convolutional layers with smaller filter sizes. In Fig. 3b we demonstrate how to achieve this. The colored layer is first branched and then a new convolutional layer is added to the left branch. In contrast to only adding a new layer as described in Sect. 3.3, the new layer has only two output channels instead of three.

3.7 Multiple In- or Outputs

All the presented operations are still possible for networks where a layer might have inputs from different layers or provide output for multiple outputs. In that case only the affected weights need to be adapted according to the aforementioned equations.

4 Evolution of Neuro-Cells

The very basic idea of our proposed cell-based neuro-evolution is the following. Given is a very simple neural network architecture which contains multiple neuro-cells (see Fig. 4). The cells itself share their structure and the task is to find a structure that improves the overall neural network architecture for a given data set and machine learning task. In the beginning, a cell is identical to a convolutional layer and is changed during the evolutionary optimization process. Our evolutionary algorithm is using tournament selection to select an individual from the population: randomly, a fraction k of individuals is selected from the population. From this set the individual with highest fitness is selected for mutation. We define the fitness by the accuracy achieved by the individual on a hold-out data set. The mutation is selected at random which is applied to all neuro-cells such that they remain identical. The network is trained for some epochs on the training set and is then added to the population. Finally, the process starts all over again. After meeting some stopping criterion, the individual with highest fitness is returned.

4.1 Mutations

All mutations used are based on the function-preserving operations introduced in the last section. This means, a mutation does not change the fitness of an individual, however, it will increase its complexity. The advantage over creating the same network structure with randomly initialized weights is obviously that we start with a partially pretrained network. This enables us to train the network in less epochs. All mutations are applied only to the structure within a neuro-cell if not otherwise mentioned. Our neuro-evolutional algorithm considers the following mutations.

Insert Convolution. A convolution is added at a random position. Its kernel size is $3\,\times \,3$, the number of filters is equal to its input dimension. It is randomly decided whether it is a separable convolution instead.

Branch and Insert Convolution. A convolution is selected at random and branched according to Sect. 3.6. A new convolution is added according to the “Insert Convolution” mutation in one of the branches. For an example see Fig. 3b.

Insert Skip. A convolution is selected at random. Its output is added to the output of a newly added convolution (see “Insert Convolution”) and is the input for the following layers. For an example see Fig. 3a.

Alter Number of Filters. A convolution is selected at random and widened by a factor uniformly at random sampled from $\left[ 1.2,2\right] $. This mutation might also be applied to convolutions outside of a neuro-cell.

Alter Number of Units. Similar to the previous one but alters the number of units of fully connected layers. This mutation is only applied outside the neuro-cells.

Alter Kernel Size. Selects a convolution at random and increases its kernel size by two along each axis.

Branch Convolution. Selects a convolution at random and branches it according to Sect. 3.6.

The motivation of selecting this set of mutations is to enable the neuro-evolutionary algorithm to discover similar architectures as proposed by human experts. Adding convolutions allows to reach popular architectures such as VGG16 [21], combinations of adding skips and convolutions allow to discover residual networks [8]. Finally the combination of branching, change of kernel sizes and addition of (separable) convolutions allows to discover architectures similar to Inception [25], Xception [5] or FractalNet [13].

The optimization is started with only a single individual. We enrich the population by starting with an initialization step which creates 15 mutated versions of the first individual. Then, individuals are selected based on the previously described tournament selection process.

5 Experiments

In the experimental section we will run our proposed method for the task of image classification on the two data sets CIFAR-10 and CIFAR-100. We conduct the following experiments. First, we analyze the performance of our neuro-evolutional approach with respect to classification error and compare it to various competitor approaches. We show that we achieve a significant search time improvement at costs of slightly larger error. Furthermore, we give insights how the evolution and the neuro-cells progress and develop during the optimization process. Additionally, we discuss the possibility of transferring detected cells to novel data sets. Finally, we compare the performance of two different random approaches in order to prove our method’s benefit.

5.1 Experimental Setup

The network template used in our experiments is sketched in Fig. 4. It starts with a small convolution, followed twice by a neuro-cell and a max pooling layer. Then, another neuro-cell is added, followed by a larger convolution, a fully connected layer and the final softmax layer. Each max pooling layer has a stride of two and is followed by a drop-out layer with drop-out rate 70%. The fully connected layer is followed by a drop-out layer with rate 50%. In this section, whenever we sketch or mention a convolutional layer, we actually mean a convolutional layer followed by batch normalization [11] and a ReLU activation. The neuro-cell is initialized with a single convolution with 128 filters and a kernel size of $3\,\times \,3$. A weight decay of 0.0001 is used.

We evaluate our method and compare it to competitor methods on CIFAR-10 and CIFAR-100 [12]. We use standard preprocessing and data augmentation. All images are preprocessed by subtracting from each channel its mean and dividing it by its standard deviation. The data augmentation involves padding the image to size $40\,\times \,40$ and then cropping it to dimension $32\,\times \,32$ as well as flipping images horizontally at random. We split the official training partitions into a partition which we use to train the networks and a hold-out partition to evaluate the fitness of the individuals.

For the neuro-evolutionary algorithm we select a tournament size equal to 15% of the population but at least two. The initial network is trained for 63 epochs, every other network is trained for 15 epochs with Nesterov momentum and a cosine learning rate schedule with initial learning rate 0.05, $T_{0}=1$ and $T_{\text {mul}}=2$ [16]. We define the fitness of an individual by the accuracy of the corresponding network on the hold-out partition. After the search budget is exhausted, the individual with highest fitness is trained on the full training split until convergence using CutOut [6]. Finally, the error on test is reported.

Table 1. Classification error on CIFAR-10 and CIFAR-100 including spent search time in GPU days. The first block presents the performance of state-of-the-art human-designed architectures. The second block contains results of various automated architecture search methods based on reinforcement learning. The third block contains results for automated methods based on evolutionary algorithms. The final block presents our results. For our method, we report the mean of five repetitions for the classification error and the number of parameters, the best run and the run with least network parameters.

Full size table

5.2 Search for Networks

In Table 1 we report the mean and standard deviation of our approach across five runs and compare it to other approaches.

The first block contains several architectures proposed by human experts. DenseNet [9] is clearly the best among them, reaching an error of 4.51% with only 800 thousand parameters. Using about 25 million parameters, the error decreases to 3.42%.

The second block contains several architecture search methods based on reinforcement learning. Most of them are able to find very competitive networks but at the cost of very high search times. NASNet [32] finds the best-performing network which is on par with DenseNet but requires less parameters. However, the authors report that they required about 5.5 GPU years in order to reach this performance. Efficient Architecture Search [3] still achieves an error of 4.23% but reduces the search time drastically to ten days.

The third block contains various automated approaches based on evolutionary methods. Hierarchical Evolution [15] finds the best performing architecture among them in 300 GPU days. Methodologically, our approach also belongs into this category. We want to highlight in particular the search time required by our proposed method. Within only 12 and 24 h, respectively, a network architecture is found which gives better predictions than most competitors and is very close to the best methods. After 12 h of search, we report a mean classification error over five repetitions of $4.02\pm 0.376$ and $23.92\pm 2.493$ on CIFAR-10 and CIFAR-100, respectively. Extending the search by another 12 h, the error reduces to $3.89\pm 0.231$ and $22.32\pm 0.429$.

In order to give insights into the optimization process, we visualized one run on CIFAR-10 in Figs. 5 and 6. Figure 5 visualizes the fitness of each individual but also its ancestry by a phylogenetic tree [28]. The x-axis represents the time, the y-axis has no meaning. The color indicates the fitness, dots represent individuals and the ancestry is represented by edges. We notice that within the first 10 h the fitness is increasing quickly. Afterwards, progress is slow but steady. Figure 6 provides in parallel insight which stages the final neuro-cell underwent. Over time the cell develops multiple computation branches, finally adding some skip connections. Notice, that branching the $7\,\times \,7$ convolution as first shown at Hour 19 has no purpose. However, this might have changed for a longer run when e.g. another layer was added in one of these branches.

5.3 Neuro-Cell Transferability

An interesting aspect is whether a neuro-cell detected on one data set can be reused in a different architecture and for a different data set. For this reason, we expanded the template from Fig. 4 by duplicating the number of cells to the one shown in Fig. 7. We used the cells and other hyperparameters detected in our 12 h CIFAR-10 experiment and used the resulting networks for image classification on CIFAR-100. These models achieved an average error of 24.77% with a standard deviation of 1.61%. This result is not as good as the one achieved by searching for the best architecture for CIFAR-100 but therefore no new search is required for the new data set.

5.4 Random Search

In this section we will discuss the importance of our evolutionary approach by comparing it to two random network searches.

Comparison to Random Individual Selection. Random individual selection is in fact not really a valid comparison because it is actually a special case of our proposed method with a tournament size of one. For this experiment, we select a random individual from the population instead of selecting the best individual of a random population subset. With this small change, we run our algorithm five times for twelve hours. We report a mean classification error of 4.55% with standard deviation 0.34%. Note, that the best of these runs achieved an error of 4.04% which is still worse than the mean error achieved when using larger tournament sizes. Thus, we can confirm that tournament selection provides better results than random selection.

Comparison to Random Mutations. We conduct another experiment where we apply k mutations to the initial individual. In practice, k is dependent on the data set and not known and thus, this method is actually not really applicable. However, for this experiment, we set k to the number of mutations used for the best cell in our 12 h experiment. In comparison to the random individual selection, this method further increases the error to 4.73% on average over five repetitions with a standard deviation of 0.63%.

6 Conclusions

We proposed a novel approach which optimizes the neural network architecture based on an evolutionary algorithm. It requires as an input a simple template containing neuro-cells, replicated architecture patterns, and automatically keeps improving this initial architecture. The mutations of our evolutionary algorithm are based on function-preserving operations which change the network’s architecture without changing its prediction. This enables shorter training times in comparison to a random initialization. In comparison to the state-of-the-art, we report very competitive results and show outstanding results with respect to the search time. Our approach is up to 50,000 times faster than some of the competitor methods with an error rate at most 0.6% higher than the best competitor on CIFAR-10.

References

Baker, B., Gupta, O., Naik, N., Raskar, R.: Designing neural network architectures using reinforcement learning. In: Proceedings of the International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April (2017)
Google Scholar
Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13, 281–305 (2012)
MathSciNet MATH Google Scholar
Cai, H., Chen, T., Zhang, W., Yu, Y., Wang, J.: Reinforcement learning for architecture search by network transformation. CoRR abs/1707.04873 (2017)
Google Scholar
Chen, T., Goodfellow, I.J., Shlens, J.: Net2Net: accelerating learning via knowledge transfer. In: Proceedings of the International Conference on Learning Representations, ICLR 2016, San Juan, Puerto Rico, 2–4 May (2016)
Google Scholar
Chollet, F.: Xception: deep learning with depthwise separable convolutions. CoRR abs/1610.02357 (2016)
Google Scholar
Devries, T., Taylor, G.W.: Improved regularization of convolutional neural networks with cutout. CoRR abs/1708.04552 (2017)
Google Scholar
Diaz, G.I., Fokoue-Nkoutche, A., Nannicini, G., Samulowitz, H.: An effective algorithm for hyperparameter optimization of neural networks. IBM J. Res. Dev. 61(4), 9 (2017)
Google Scholar
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, 27–30 June 2016, pp. 770–778 (2016)
Google Scholar
Huang, G., Liu, Z., van der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, 21–26 July 2017, pp. 2261–2269 (2017)
Google Scholar
Huang, G., Sun, Y., Liu, Z., Sedra, D., Weinberger, K.Q.: Deep Networks with Stochastic Depth. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016, Part IV. LNCS, vol. 9908, pp. 646–661. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46493-0_39
Chapter Google Scholar
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 6–11 July 2015, pp. 448–456 (2015)
Google Scholar
Krizhevsky, A.: Learning multiple layers of features from tiny images. Technical report (2009)
Google Scholar
Larsson, G., Maire, M., Shakhnarovich, G.: Fractalnet: Ultra-deep neural networks without residuals. In: Proceedings of the International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April (2017)
Google Scholar
Liu, C., et al.: Progressive neural architecture search. CoRR abs/1712.00559 (2017)
Google Scholar
Liu, H., Simonyan, K., Vinyals, O., Fernando, C., Kavukcuoglu, K.: Hierarchical representations for efficient architecture search. In: Proceedings of the International Conference on Learning Representations, ICLR 2018, Vancouver, Canada (2018)
Google Scholar
Loshchilov, I., Hutter, F.: SGDR: Stochastic gradient descent with warm restarts. In: Proceedings of the International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April (2017)
Google Scholar
Miikkulainen, R., et al.: Evolving deep neural networks. CoRR abs/1703.00548 (2017)
Google Scholar
Miller, G.F., Todd, P.M., Hegde, S.U.: Designing neural networks using genetic algorithms. In: Proceedings of the 3rd International Conference on Genetic Algorithms, June 1989, pp. 379–384. George Mason University, Fairfax, Virginia, USA (1989)
Google Scholar
Negrinho, R., Gordon, G.J.: Deeparchitect: Automatically designing and training deep architectures. CoRR abs/1704.08792 (2017)
Google Scholar
Real, E., et al.: Large-scale evolution of image classifiers. In: Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6–11 August 2017, pp. 2902–2911 (2017)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014)
Google Scholar
Snoek, J., Larochelle, H., Adams, R.P.: Practical bayesian optimization of machine learning algorithms. In: Advances in Neural Information Processing Systems 25: 26th Annual Conference on Neural Information Processing Systems 2012, Proceedings of a meeting held 3–6 December 2012, Lake Tahoe, Nevada, United States, pp. 2960–2968 (2012)
Google Scholar
Stanley, K.O., Miikkulainen, R.: Evolving neural networks through augmenting topologies. Evol. Comput. 10(2), 99–127 (2002)
Article Google Scholar
Suganuma, M., Shirakawa, S., Nagao, T.: A genetic programming approach to designing convolutional neural network architectures. In: Proceedings of the Genetic and Evolutionary Computation Conference, GECCO 2017, Berlin, Germany, 15–19 July 2017, pp. 497–504 (2017)
Google Scholar
Szegedy, C., et al.: Going deeper with convolutions. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2015, Boston, MA, USA, 7–12 June 2015, pp. 1–9 (2015)
Google Scholar
Wistuba, M.: Bayesian optimization combined with successive halving for neural network architecture optimization. In: Proceedings of AutoML@PKDD/ECML 2017, Skopje, Macedonia, 22 September 2017, pp. 2–11 (2017)
Google Scholar
Wistuba, M.: Finding competitive network architectures within a day using UCT. CoRR abs/1712.07420 (2017)
Google Scholar
Yu, G., Smith, D.K., Zhu, H., Guan, Y., Lam, T.T.Y.: ggtree: an R package for visualization and annotation of phylogenetic trees with their covariates and other associated data. Methods Ecol. Evol. 8(1), 28–36 (2016)
Article Google Scholar
Zagoruyko, S., Komodakis, N.: Wide residual networks. In: Proceedings of the British Machine Vision Conference 2016, BMVC 2016, York, UK, 19–22 September 2016 (2016)
Google Scholar
Zhong, Z., Yan, J., Liu, C.: Practical network blocks design with q-learning. CoRR abs/1708.05552 (2017)
Google Scholar
Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. In: Proceedings of the International Conference on Learning Representations, ICLR 2017, Toulon, France, 24–26 April (2017)
Google Scholar
Zoph, B., Vasudevan, V., Shlens, J., Le, Q.V.: Learning transferable architectures for scalable image recognition. CoRR abs/1707.07012 (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

IBM Research, Dublin, Ireland
Martin Wistuba

Authors

Martin Wistuba
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Martin Wistuba .

Editor information

Editors and Affiliations

IBM Research - Ireland, Dublin, Ireland
Michele Berlingerio
Institute for Scientific Interchange, Turin, Italy
Francesco Bonchi
University of Nottingham, Nottingham, UK
Thomas Gärtner
University College Dublin, Dublin, Ireland
Neil Hurley
University College Dublin, Dublin, Ireland
Georgiana Ifrim

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wistuba, M. (2019). Deep Learning Architecture Search by Neuro-Cell-Based Evolution with Function-Preserving Mutations. In: Berlingerio, M., Bonchi, F., Gärtner, T., Hurley, N., Ifrim, G. (eds) Machine Learning and Knowledge Discovery in Databases. ECML PKDD 2018. Lecture Notes in Computer Science(), vol 11052. Springer, Cham. https://doi.org/10.1007/978-3-030-10928-8_15

Download citation

DOI: https://doi.org/10.1007/978-3-030-10928-8_15
Published: 23 January 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-10927-1
Online ISBN: 978-3-030-10928-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)