Keywords

1 Introduction

In recent years, deep learning has established itself as a real game changer in many domains like speech recognition [1], natural language processing [2] and computer vision [3]. The real power of deep convolutional neural networks resides in their ability of directly processing high-dimensional raw data and produce meaningful high-level representations regardless of the specific task at hand [4]. So far, computer vision has been one of the most benefited fields and tasks like object recognition and object detection have seen significant advances in terms of accuracy (ref. to ImageNet LSVRC [5] and PascalVOC [6]). However, most of the recent successes are based on benchmarks providing an enormous quantity of data with a fixed training and test set. This is not always the case of real world applications where training data are often only partially available at the beginning and new data keep coming while the system is already deployed and working, like for recommendation systems and anomaly detection where learning sudden behavioral changes becomes quite critical. One possible approach to deal with this incremental scenario is to store all the previously seen past data, and retrain the model from scratch as soon as a new batch of data is available (in the following we refer to this approach as cumulative approach). However, this solution is often impractical for many real world systems where memory and computational resources are subject to stiff constraints. As a matter of fact, employing convolutional neural networks is expensive and training them on large datasets could take days or weeks, even on modern GPUs [3]. A different approach to address this issue, is to update the model based only on the new available batch of data. This is computationally cheaper and storing past data is not necessary. Of course, nothing comes for free, and this approach may lead to a substantial loss of accuracy with respect to the cumulative strategy. Moreover, the stability-plasticity dilemma may arise and dangerous shifts of the model are always possible because of catastrophic forgetting [710]. Forgetting previously learned patterns can be conveniently seen as overfitting the new available training data. Nevertheless, in the incremental scenario, the issue is much more puzzling than overfitting a single, fixed-size training set, since incremental batches are usually smaller, highly variable and class biased. It is worth noting that this is also different and much more complicated than transfer learning where the previously used dataset and task are no longer of interest and forgetting previously learned patterns is not a concern.

In this work we compare and evaluate different incremental learning strategies for CNN-based architectures, targeting real-word applications. The aim is understanding how to minimize the impact of forgetting thus reducing as much as possible the gap with respect to the cumulative strategy.

In Sect. 2, different incremental tuning strategies for CNNs are introduced. In Sect. 3, we provide details of the two (very different) real-world datasets used for the comparative evaluation, while, in Sect. 4, we describe the experimentations carried out together with their results. Finally, in Sect. 5, some conclusions are drawn.

2 Incremental Learning Strategies

In this section we present different incremental tuning strategies employing CNNs, which are sufficiently general to be applied to any dataset with a number of possible variations. Then, we illustrate the specific implementations used in our experiments on the two selected datasets.

The different possibilities we explored to deal with an incremental tuning/learning scenario, can be conveniently framed in three main strategies:

  1. 1.

    Training/tuning an ad hoc CNN architecture suitable for the problem.

  2. 2.

    Using an already trained CNN as a fixed feature extractor in conjunction with an incremental classifier.

  3. 3.

    Fine-tuning an already trained CNN.

Of course, all the above strategies can be used also outside the context of incremental learning, but very little is known about their applicability and relevance if such is the case. In our experiments we tested three instantiations of the aforementioned strategies:

  • LeNet7: consists of the classical “LeNet7” proposed by Yan LeCun in 2004 [11]. Its architecture is based on seven layers (much less than current state-of-the-art CNNs designed for large-scale datasets). However, it has been successfully applied to many object recognition datasets (NORB, COIL, CIFAR, etc.) with colorful or grayscale images of size varying from 32 × 32 to 96 × 96, and is still competitive on low/medium scale problems.

  • CaffeNet + SVM: In this strategy we employ a pre-trained CNN provided in the Caffe library [12] “Model Zoo”, BVLC Reference CaffeNet, which is based on the well-known “AlexNet” architecture proposed in [3] and trained on ImageNet. This model is used off-the-shelf to extract high-level features from the second-last hidden layer following the strategy proposed in [1315]. Then a linear and incremental SVMFootnote 1 is used (instead of the native soft-max output layer) for the final classification.

  • CaffeNet + FT: Even in this case the BVLC Reference CaffeNet is employed. However, instead of using it as a fixed feature extractor the network is fine-tuned to suit the new task. Even if for fine-tuning it is generally recommended to diversify the learning rate of the last layer (which is re-initialized to suit the novel number of output neurons) from the others, we found no significant difference during our exploratory analysis and therefore we kept the hyper-parametrization as homogeneous as possible.

Furthermore, for the BigBrother dataset (see following Section) we decided to test an additional pair of strategies: VGG_Face + SVM and VGG_Face + FT. They are identical respectively to CaffeNet + SVM and CaffeNet + FT with exception of the pre-trained model used. The VGG_Face is a very deep architecture (16-levels) that has been trained directly on a very large dataset of faces (2,622 Subjects and 2.6 M images) [17].

3 Datasets

In the literature, there are few labeled image datasets suitable for incremental learning. In particular, we are interested in datasets where the objects of interest have been acquired in a number of successive sessions and the environmental condition can change among the sessions.

We focused on two applicative fields where incremental learning is very relevant: robotics and biometrics. The two datasets selected, iCubWorld28 [18] and BigBrother [19], are briefly described below.

3.1 iCubWorld28

The iCubWorld28 dataset [18] consists of 28 distinct domestic objects evenly organized into 7 categories (see Fig. 1). Images are 128 × 128 pixels in RGB format. The acquisition session of a single object consists in a video recording of about 20 s where the object is slowly moved/rotated in front of the camera. Each acquisition session results in about 200 train and 200 test images for each of the 28 objects. Being designed to assess the incremental learning performance of the iCub robot visual recognition subsystem, the same acquisition approach has been repeated for 4 consecutive days, ending up with four subsets (Day 1, to 4) of around 8 K images each (39,693 in total).

Fig. 1.
figure 1

Example images of the 28 objects (7 categories) from one of the 4 subsets constituting icubWorld28.

To better assess the capabilities of our incremental learning strategies we split each training set of Day 1, 2 and 3 in three parts of equal size. On the contrary, Day 4 was left unchanged and entirely used as test set (as in [18]). In Table 1, we report the full details about the size of the training and test set used for our experiments.

Table 1. iCubWorld28 batches size and membership to the original Day.

3.2 BigBrother

The BigBrother dataset [19] has been created starting from 2 DVDs made commercially available at the end of the 2006 edition of the “Big Brother” reality show produced for the Italian TV and documenting the 99 days of permanence of 20 participants in a closed environment. It consists of 23,842 70 × 70 gray-scale images of faces belonging to 19 subjects (one participant was immediately eliminated at the beginning of the reality show). In addition to the typical training and test sets, an additional large set of images (called “updating set”) is provided for incremental learning/tuning purposes. Details about the composition of each set can be found in [19], together with the number of days the person lived in the house. However, some subjects lived in the house for a short period and too few images are thus available for an in-depth evaluation. For this reason, a subset of the whole database, referred to as SETB, has been defined by the authors of [19]. It includes the images of the 7 subjects who lived in the house for a longer period (such number of users seems realistic for a home environment application).

In this work, we compare our incremental tuning strategies on the SETB of the Big-Brother dataset consisting of a total of 54 incremental batches. In Fig. 2, an example image for each of the different seven subjects of the SETB is shown. It is worth noting that images have been automatically extracted from the video frames by Viola and Jones detector [20] and are often characterized by bad lighting, poor focus, occlusions, and non-frontal pose.

Fig. 2.
figure 2

Example images of the seven subjects contained in the SETB of the BigBrother Dataset.

4 Experiments and Results

For all the strategies employed in this work, we trained the models until full convergence on the first batch of data and tuned them on the successive incremental batches, trying to balance the trade-off between accuracy gain and forgetting. This protocol fits the requirements of many real-world applications where a reasonable initial accuracy is demanded and the first batch is large enough to reach that accuracy.

For the iCubWorld28 dataset the three strategies have been validated over 10 runs where we randomly shuffled the position of the batches \( TrainB_{2} , \ldots ,TrainB_{9} \).

To control forgetting during the incremental learning phase we relied on early stopping and for each batch a fixed number of iterations were performed depending on the specific strategy. For example, for the LeNet7, trained with stochastic gradient descent (SGD), we chose a learning rate of 0.01, a mini-batch size of 100 and a number of iterations of 50 for all the eight incremental batches. We found that tuning these hyper-parameters can have a significant impact on forgetting (see an example on the BigBrother dataset in Fig. 4).

To better understand the efficacy of the proposed incremental strategies and to quantify the impact of forgetting, we also tested each model on the corresponding cumulative strategies. In Fig. 3, the average accuracy over the ten runs is reported for each strategy. We note that:

  • The CaffeNet + SVM has a quite good recognition rate increment along the 9 batches, moving from an accuracy of 41,63 % to 66,97 %. The standard deviation is initially higher with respect to the other strategies, but it rapidly decreases as new batches of data are available and the SVM model is updated. Furthermore, the small gap with respect to its cumulative counterpart proves that a fixed features extractor favors stability and reduces forgetting.

  • The CaffeNet + FT is the most effective strategy for this dataset. This is probably because the features originally learned on the ImageNet dataset are very general and the iCubWorld28 dataset can be thought as a specific sub-domain where feature fine-tuning can help pattern classification. Moreover, even if splitting the dataset in 9 batches makes the task harder, we managed to achieve an averaged accuracy of 78.40 % that outperforms previously proposed methods on the same datasets [18]. Even if in this case the gap with respect to the cumulative approach is slightly higher, the proper adjustment of early stopping and learning rate during the incremental phase allows to effectively control forgetting.

  • The LeNet7 on this dataset is probably not able to learn (being the number of patterns too limited) complex invariant features that are necessary to deal with the multi-axes rotations, partial occlusions and the complex backgrounds which characterize this problem. The gap with respect to the cumulative approach is here high. This is in line with previous studies [7, 10] showing that smaller models without pre-training are much more susceptible to forgetting.

Fig. 3.
figure 3

IcubWorld28 dataset: average accuracy during incremental training (8 batches). The bars indicate the standard deviation of the ten runs performed for each strategy. The dotted lines denote the cumulative strategies.

For the SETB of the BigBrother dataset, in order to make reproducible and comparable results, we decided to keep fixed (i.e., no shuffling) the order of the 54 updating batches as contained in the original dataset [19]. In Fig. 5, accuracy results are reported for each of the 5 tested strategies. It is worth pointing out that in this case the 54 incremental batches used for updating the model have a very high variance in terms of number of patterns they contain: in particular, it can vary from few dozens to many hundreds. This is typical in many real-world systems, where the hypothesis of collecting uniform and equally sized batches is often unrealistic.

Controlling forgetting was here more complex than for the iCubWorld28 dataset. In fact, in this case, due to the aforementioned high variation in the number of patterns in the different incremental batches, we found that adapting the tuning strengthFootnote 2 to the batch size can lead to relevant improvements.

In Fig. 4, an exemplifying parameterization for the CaffeNet + FT strategy is reported where we compare the learning trend by using i) a low learning rate, ii) a high learning rate, iii) an adjustable learning rate depending on the size of the batch. Results show that using an adjustable learning rate leads to better results. Therefore, in the rest of the experiments on the BigBrother dataset, an adjustable learning rateFootnote 3 is used.

Fig. 4.
figure 4

Accuracy results of different parameterizations for the CaffeNet + FT strategy: an example of the impact of the learning rate on the BigBrother dataset, in our incremental scenario (54 batches).

In Fig. 5 the accuracy on the BigBrother dataset for all the strategies introduced in Sect. 2 is reported. For this dataset we note that:

  • Unlike the previous experiment, here LeNet7 model performs slightly better than CaffeNet + SVM or CaffeNet + FT. This is probably because of the high peculiar features (and invariance) requested for face recognition. Hence, learning the features from scratch for this dataset seems more appropriate than adapting general features by fine-tuning.

  • The previous observation is corroborated by the really good performance of the VGG_Face + SVM and VGG_Face + FT strategies. In fact, since VGG_Face features have been learned in a face recognition task by using a dataset containing millions of faces, they are pretty effective for a transfer learning in the same domain.

  • Since the features are already optimal, VGG_Face + SVM seems to be the best choice both for the accuracy and the stability. It reaches an accuracy of 96,73 % that is 24,1 % better than accuracy reported in [21] for the same dataset (in the supervised learning scenario).

Fig. 5.
figure 5

BigBrother dataset (SETB): accuracy of the different strategies during incremental training (54 batches).

5 Conclusions

Incremental and on-line learning is still scarcely studied in the field of deep learning (especially for CNNs) but it is essential for many real-world applications. In this work we explored different strategies to train Convolutional Neural Networks incrementally. We recognize that the empirical evaluations carried out in this work are still limited, and further studies are necessary to better understand the advantages and weakness of each strategy. However, it seems that the lesson learned in classical transfer learning [13, 14, 22] holds here too:

  • Forgetting can be a very detrimental issue: hence, when possible (i.e., transfer learning from the same domain), it is preferable to use CNN as a fixed feature extractor to feed an incremental classifier. In general, this results in better stability and often in improved efficiency (i.e., tuning all CNN layers can be computationally expensive).

  • If the features are not optimized (transfer learning from a different domain), the tuning of low level layers may be preferable and the learning strength (i.e., learning rate, number of iteration, etc.) can be used to control forgetting.

  • Training a CNN from scratch can be advantageous if the problem patterns (and feature invariances) are highly specific and a sufficient number of samples are available.

In the near future, we plan to extend this work with a more extensive experimental evaluation, finding a more principled way to control forgetting and adapting the tuning parameters to the size (and bias) of each incremental batch. Lastly, we are interested in studying real-world applications of semi-supervised incremental learning strategies for CNNs, with approaches similar to [23].