Comparing Incremental Learning Strategies for Convolutional Neural Networks

Lomonaco, Vincenzo; Maltoni, Davide

doi:10.1007/978-3-319-46182-3_15

Vincenzo Lomonaco¹⁷ &
Davide Maltoni¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 9896))

Included in the following conference series:

IAPR Workshop on Artificial Neural Networks in Pattern Recognition

3799 Accesses
8 Citations
6 Altmetric

Abstract

In the last decade, Convolutional Neural Networks (CNNs) have shown to perform incredibly well in many computer vision tasks such as object recognition and object detection, being able to extract meaningful high-level invariant features. However, partly because of their complex training and tricky hyper-parameters tuning, CNNs have been scarcely studied in the context of incremental learning where data are available in consecutive batches and retraining the model from scratch is unfeasible. In this work we compare different incremental learning strategies for CNN based architectures, targeting real-word applications.

You have full access to this open access chapter, Download conference paper PDF

Incremental learning without looking back: a neural connection relocation approach

Article 22 March 2023

Deep Neural Networks: Incremental Learning

End-to-End Incremental Learning

Keywords

1 Introduction

In recent years, deep learning has established itself as a real game changer in many domains like speech recognition [1], natural language processing [2] and computer vision [3]. The real power of deep convolutional neural networks resides in their ability of directly processing high-dimensional raw data and produce meaningful high-level representations regardless of the specific task at hand [4]. So far, computer vision has been one of the most benefited fields and tasks like object recognition and object detection have seen significant advances in terms of accuracy (ref. to ImageNet LSVRC [5] and PascalVOC [6]). However, most of the recent successes are based on benchmarks providing an enormous quantity of data with a fixed training and test set. This is not always the case of real world applications where training data are often only partially available at the beginning and new data keep coming while the system is already deployed and working, like for recommendation systems and anomaly detection where learning sudden behavioral changes becomes quite critical. One possible approach to deal with this incremental scenario is to store all the previously seen past data, and retrain the model from scratch as soon as a new batch of data is available (in the following we refer to this approach as cumulative approach). However, this solution is often impractical for many real world systems where memory and computational resources are subject to stiff constraints. As a matter of fact, employing convolutional neural networks is expensive and training them on large datasets could take days or weeks, even on modern GPUs [3]. A different approach to address this issue, is to update the model based only on the new available batch of data. This is computationally cheaper and storing past data is not necessary. Of course, nothing comes for free, and this approach may lead to a substantial loss of accuracy with respect to the cumulative strategy. Moreover, the stability-plasticity dilemma may arise and dangerous shifts of the model are always possible because of catastrophic forgetting [7–10]. Forgetting previously learned patterns can be conveniently seen as overfitting the new available training data. Nevertheless, in the incremental scenario, the issue is much more puzzling than overfitting a single, fixed-size training set, since incremental batches are usually smaller, highly variable and class biased. It is worth noting that this is also different and much more complicated than transfer learning where the previously used dataset and task are no longer of interest and forgetting previously learned patterns is not a concern.

In this work we compare and evaluate different incremental learning strategies for CNN-based architectures, targeting real-word applications. The aim is understanding how to minimize the impact of forgetting thus reducing as much as possible the gap with respect to the cumulative strategy.

In Sect. 2, different incremental tuning strategies for CNNs are introduced. In Sect. 3, we provide details of the two (very different) real-world datasets used for the comparative evaluation, while, in Sect. 4, we describe the experimentations carried out together with their results. Finally, in Sect. 5, some conclusions are drawn.

2 Incremental Learning Strategies

In this section we present different incremental tuning strategies employing CNNs, which are sufficiently general to be applied to any dataset with a number of possible variations. Then, we illustrate the specific implementations used in our experiments on the two selected datasets.

The different possibilities we explored to deal with an incremental tuning/learning scenario, can be conveniently framed in three main strategies:

1.
Training/tuning an ad hoc CNN architecture suitable for the problem.
2.
Using an already trained CNN as a fixed feature extractor in conjunction with an incremental classifier.
3.
Fine-tuning an already trained CNN.

Of course, all the above strategies can be used also outside the context of incremental learning, but very little is known about their applicability and relevance if such is the case. In our experiments we tested three instantiations of the aforementioned strategies:

LeNet7: consists of the classical “LeNet7” proposed by Yan LeCun in 2004 [11]. Its architecture is based on seven layers (much less than current state-of-the-art CNNs designed for large-scale datasets). However, it has been successfully applied to many object recognition datasets (NORB, COIL, CIFAR, etc.) with colorful or grayscale images of size varying from 32 × 32 to 96 × 96, and is still competitive on low/medium scale problems.
CaffeNet + SVM: In this strategy we employ a pre-trained CNN provided in the Caffe library [12] “Model Zoo”, BVLC Reference CaffeNet, which is based on the well-known “AlexNet” architecture proposed in [3] and trained on ImageNet. This model is used off-the-shelf to extract high-level features from the second-last hidden layer following the strategy proposed in [13–15]. Then a linear and incremental SVM^{Footnote 1} is used (instead of the native soft-max output layer) for the final classification.
CaffeNet + FT: Even in this case the BVLC Reference CaffeNet is employed. However, instead of using it as a fixed feature extractor the network is fine-tuned to suit the new task. Even if for fine-tuning it is generally recommended to diversify the learning rate of the last layer (which is re-initialized to suit the novel number of output neurons) from the others, we found no significant difference during our exploratory analysis and therefore we kept the hyper-parametrization as homogeneous as possible.

Furthermore, for the BigBrother dataset (see following Section) we decided to test an additional pair of strategies: VGG_Face + SVM and VGG_Face + FT. They are identical respectively to CaffeNet + SVM and CaffeNet + FT with exception of the pre-trained model used. The VGG_Face is a very deep architecture (16-levels) that has been trained directly on a very large dataset of faces (2,622 Subjects and 2.6 M images) [17].

3 Datasets

In the literature, there are few labeled image datasets suitable for incremental learning. In particular, we are interested in datasets where the objects of interest have been acquired in a number of successive sessions and the environmental condition can change among the sessions.

We focused on two applicative fields where incremental learning is very relevant: robotics and biometrics. The two datasets selected, iCubWorld28 [18] and BigBrother [19], are briefly described below.

3.1 iCubWorld28

The iCubWorld28 dataset [18] consists of 28 distinct domestic objects evenly organized into 7 categories (see Fig. 1). Images are 128 × 128 pixels in RGB format. The acquisition session of a single object consists in a video recording of about 20 s where the object is slowly moved/rotated in front of the camera. Each acquisition session results in about 200 train and 200 test images for each of the 28 objects. Being designed to assess the incremental learning performance of the iCub robot visual recognition subsystem, the same acquisition approach has been repeated for 4 consecutive days, ending up with four subsets (Day 1, to 4) of around 8 K images each (39,693 in total).

To better assess the capabilities of our incremental learning strategies we split each training set of Day 1, 2 and 3 in three parts of equal size. On the contrary, Day 4 was left unchanged and entirely used as test set (as in [18]). In Table 1, we report the full details about the size of the training and test set used for our experiments.

Table 1. iCubWorld28 batches size and membership to the original Day.

Full size table

3.2 BigBrother

The BigBrother dataset [19] has been created starting from 2 DVDs made commercially available at the end of the 2006 edition of the “Big Brother” reality show produced for the Italian TV and documenting the 99 days of permanence of 20 participants in a closed environment. It consists of 23,842 70 × 70 gray-scale images of faces belonging to 19 subjects (one participant was immediately eliminated at the beginning of the reality show). In addition to the typical training and test sets, an additional large set of images (called “updating set”) is provided for incremental learning/tuning purposes. Details about the composition of each set can be found in [19], together with the number of days the person lived in the house. However, some subjects lived in the house for a short period and too few images are thus available for an in-depth evaluation. For this reason, a subset of the whole database, referred to as SETB, has been defined by the authors of [19]. It includes the images of the 7 subjects who lived in the house for a longer period (such number of users seems realistic for a home environment application).

In this work, we compare our incremental tuning strategies on the SETB of the Big-Brother dataset consisting of a total of 54 incremental batches. In Fig. 2, an example image for each of the different seven subjects of the SETB is shown. It is worth noting that images have been automatically extracted from the video frames by Viola and Jones detector [20] and are often characterized by bad lighting, poor focus, occlusions, and non-frontal pose.

4 Experiments and Results

For all the strategies employed in this work, we trained the models until full convergence on the first batch of data and tuned them on the successive incremental batches, trying to balance the trade-off between accuracy gain and forgetting. This protocol fits the requirements of many real-world applications where a reasonable initial accuracy is demanded and the first batch is large enough to reach that accuracy.

For the iCubWorld28 dataset the three strategies have been validated over 10 runs where we randomly shuffled the position of the batches \( TrainB_{2} , \ldots ,TrainB_{9} \).

To control forgetting during the incremental learning phase we relied on early stopping and for each batch a fixed number of iterations were performed depending on the specific strategy. For example, for the LeNet7, trained with stochastic gradient descent (SGD), we chose a learning rate of 0.01, a mini-batch size of 100 and a number of iterations of 50 for all the eight incremental batches. We found that tuning these hyper-parameters can have a significant impact on forgetting (see an example on the BigBrother dataset in Fig. 4).

To better understand the efficacy of the proposed incremental strategies and to quantify the impact of forgetting, we also tested each model on the corresponding cumulative strategies. In Fig. 3, the average accuracy over the ten runs is reported for each strategy. We note that:

The CaffeNet + SVM has a quite good recognition rate increment along the 9 batches, moving from an accuracy of 41,63 % to 66,97 %. The standard deviation is initially higher with respect to the other strategies, but it rapidly decreases as new batches of data are available and the SVM model is updated. Furthermore, the small gap with respect to its cumulative counterpart proves that a fixed features extractor favors stability and reduces forgetting.
The CaffeNet + FT is the most effective strategy for this dataset. This is probably because the features originally learned on the ImageNet dataset are very general and the iCubWorld28 dataset can be thought as a specific sub-domain where feature fine-tuning can help pattern classification. Moreover, even if splitting the dataset in 9 batches makes the task harder, we managed to achieve an averaged accuracy of 78.40 % that outperforms previously proposed methods on the same datasets [18]. Even if in this case the gap with respect to the cumulative approach is slightly higher, the proper adjustment of early stopping and learning rate during the incremental phase allows to effectively control forgetting.
The LeNet7 on this dataset is probably not able to learn (being the number of patterns too limited) complex invariant features that are necessary to deal with the multi-axes rotations, partial occlusions and the complex backgrounds which characterize this problem. The gap with respect to the cumulative approach is here high. This is in line with previous studies [7, 10] showing that smaller models without pre-training are much more susceptible to forgetting.

For the SETB of the BigBrother dataset, in order to make reproducible and comparable results, we decided to keep fixed (i.e., no shuffling) the order of the 54 updating batches as contained in the original dataset [19]. In Fig. 5, accuracy results are reported for each of the 5 tested strategies. It is worth pointing out that in this case the 54 incremental batches used for updating the model have a very high variance in terms of number of patterns they contain: in particular, it can vary from few dozens to many hundreds. This is typical in many real-world systems, where the hypothesis of collecting uniform and equally sized batches is often unrealistic.

Controlling forgetting was here more complex than for the iCubWorld28 dataset. In fact, in this case, due to the aforementioned high variation in the number of patterns in the different incremental batches, we found that adapting the tuning strength^{Footnote 2} to the batch size can lead to relevant improvements.

In Fig. 4, an exemplifying parameterization for the CaffeNet + FT strategy is reported where we compare the learning trend by using i) a low learning rate, ii) a high learning rate, iii) an adjustable learning rate depending on the size of the batch. Results show that using an adjustable learning rate leads to better results. Therefore, in the rest of the experiments on the BigBrother dataset, an adjustable learning rate^{Footnote 3} is used.

In Fig. 5 the accuracy on the BigBrother dataset for all the strategies introduced in Sect. 2 is reported. For this dataset we note that:

Unlike the previous experiment, here LeNet7 model performs slightly better than CaffeNet + SVM or CaffeNet + FT. This is probably because of the high peculiar features (and invariance) requested for face recognition. Hence, learning the features from scratch for this dataset seems more appropriate than adapting general features by fine-tuning.
The previous observation is corroborated by the really good performance of the VGG_Face + SVM and VGG_Face + FT strategies. In fact, since VGG_Face features have been learned in a face recognition task by using a dataset containing millions of faces, they are pretty effective for a transfer learning in the same domain.
Since the features are already optimal, VGG_Face + SVM seems to be the best choice both for the accuracy and the stability. It reaches an accuracy of 96,73 % that is 24,1 % better than accuracy reported in [21] for the same dataset (in the supervised learning scenario).

5 Conclusions

Incremental and on-line learning is still scarcely studied in the field of deep learning (especially for CNNs) but it is essential for many real-world applications. In this work we explored different strategies to train Convolutional Neural Networks incrementally. We recognize that the empirical evaluations carried out in this work are still limited, and further studies are necessary to better understand the advantages and weakness of each strategy. However, it seems that the lesson learned in classical transfer learning [13, 14, 22] holds here too:

Forgetting can be a very detrimental issue: hence, when possible (i.e., transfer learning from the same domain), it is preferable to use CNN as a fixed feature extractor to feed an incremental classifier. In general, this results in better stability and often in improved efficiency (i.e., tuning all CNN layers can be computationally expensive).
If the features are not optimized (transfer learning from a different domain), the tuning of low level layers may be preferable and the learning strength (i.e., learning rate, number of iteration, etc.) can be used to control forgetting.
Training a CNN from scratch can be advantageous if the problem patterns (and feature invariances) are highly specific and a sufficient number of samples are available.

In the near future, we plan to extend this work with a more extensive experimental evaluation, finding a more principled way to control forgetting and adapting the tuning parameters to the size (and bias) of each incremental batch. Lastly, we are interested in studying real-world applications of semi-supervised incremental learning strategies for CNNs, with approaches similar to [23].

Notes

1.
We used incremental SVM from LibLinear implementation [16].
2.
In terms of number of iterations and/or learning rate.
3.
By now, we used a simple thresholding approach where the learning rate was varied among three fixed values, since in these experiments we did not found any significant difference using a continuous approach.

References

Hinton, G., Deng, L., Yu, D., Dahl, G.E., Mohamed, A., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T.N., Kingsbury, B.: Deep neural networks for acoustic modeling in speech recognition. IEEE Signal Process. Mag. 82–97 (2012)
Google Scholar
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Distributed representations of words and phrases and their compositionality. Adv. Neural Inf. Process. Syst. (NIPS) 1–9 (2013)
Google Scholar
Krizhevsky, A., Sulskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Adv. Neural Inf. Process. Syst. (NIPS) 1–9 (2012)
Google Scholar
Bengio, Y., Courville, A., Vincent, P.: Unsupervised feature learning and deep learning: a review and new perspectives. IEEE Trans. Pattern Anal. Mach. Intell. 35, 1798–1828
Google Scholar
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet large scale visual recognition challenge. Int. J. Comput. Vis. 115, 211–252 (2015)
Article MathSciNet Google Scholar
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL visual object classes (voc) challenge. Int J Comput Vis. 88, 303–338 (2010)
Article Google Scholar
Mermillod, M., Bugaiska, A., Bonin, P.: The stability-plasticity dilemma: investigating the continuum from catastrophic forgetting to age-limited learning effects. Front. Psychol. 4, 504 (2013)
Article Google Scholar
Mccloskey, M., Cohen, N.J.: Catastrophic interference in connectionist networks: the sequential learning problem. Psychol. Learn. Motiv. 109–165 (1989)
Google Scholar
French, R.M.: Catastrophic forgetting in connectionist networks. Encycl. Cogn. Sci. Nadel/Cogn. (2006)
Google Scholar
Goodfellow, I.J., Mirza, M., Courville, A., Bengio, Y.: An empirical investigation of catastrophic forgetting in gradient-based neural networks (2012). arXiv preprint arXiv:1312.6211v3 (2015)
LeCun, Y., Huang, F.J., Bottou, L.: Learning methods for generic object recognition with invariance to pose and lighting. In: Proceedings of the 2004 IEEE Computer Society Conference on Computer Vision Pattern Recognition, 2004, CVPR 2004, vol. 2, pp. 97–104 (2004)
Google Scholar
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the ACM International Conference Multimedia, pp. 675–678 (2014)
Google Scholar
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: DeCAF: a deep convolutional activation feature for generic visual recognition. In: International Conference on Machine Learning (ICML), vol. 32, pp. 647–655 (2014)
Google Scholar
Razavian, A.S., Azizpour, H., Sullivan, J., Carlsson, S.: CNN features off-the-shelf: an astounding baseline for recognition. In: IEEE Computer Society Conference Computer Vision Pattern Recognition Workshops, pp. 512–519 (2014)
Google Scholar
Chatfield, K., Simonyan, K., Vedaldi, A., Zisserman, A.: Return of the devil in the details: delving deep into convolutional nets. In: Proceedings of the British Machine Vision Conference pp. 1–11 (2014)
Google Scholar
Tsai, C.-H., Lin, C.-Y., Lin, C.-J.: Incremental and decremental training for linear classification. In: 20th ACM SIGKDD International Conference Knowledge Discovery Data Mining, KDD 2014, pp. 343–352 (2014)
Google Scholar
Parkhi, O.M., Vedaldi, A., Zisserman, A.: Deep face recognition. In: Proceedings of the British Machine Vision Conference 2015, pp. 41.1–41.12 (2015)
Google Scholar
Pasquale, G., Ciliberto, C., Odone, F., Rosasco, L., Natale, L.: Real-world object recognition with off-the-shelf deep conv nets: how many objects can iCub learn? arXiv:1504.03154 [cs] (2015)
Franco, A., Maio, D., Maltoni, D.: The big brother database: evaluating face recognition in smart home environments. In: Tistarelli, M., Nixon, M.S. (eds.) ICB 2009. LNCS, vol. 5558, pp. 142–150. Springer, Heidelberg (1999)
Chapter Google Scholar
Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: Computer Vision Pattern Recognition, vol. 1, pp. I–511–I–518 (2001)
Google Scholar
Franco, A., Maio, D., Maltoni, D.: Incremental template updating for face recognition in home environments. Pattern Recognit. 43, 2891–2903 (2010)
Article MATH Google Scholar
Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? In: Advances Neural Information Processing Systems (NIPS), vol. 27, pp. 1–9 (2014)
Google Scholar
Maltoni, D., Lomonaco, V.: Semi-supervised tuning from temporal coherence. Technical report, DISI – University of Bologna, pp. 1–14 (2015). http://arXiv.org/pdf/1511.03163v3.pdf

Download references

Author information

Authors and Affiliations

DISI - University of Bologna, Bologna, Italy
Vincenzo Lomonaco & Davide Maltoni

Authors

Vincenzo Lomonaco
View author publications
You can also search for this author in PubMed Google Scholar
Davide Maltoni
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vincenzo Lomonaco .

Editor information

Editors and Affiliations

Ulm University, Ulm, Germany
Friedhelm Schwenker
Ain Shams University , Cairo, Egypt
Hazem M. Abbas
Cairo University , Orman, Egypt
Neamat El Gayar
Universitá di Siena , Siena, Italy
Edmondo Trentin

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Lomonaco, V., Maltoni, D. (2016). Comparing Incremental Learning Strategies for Convolutional Neural Networks. In: Schwenker, F., Abbas, H., El Gayar, N., Trentin, E. (eds) Artificial Neural Networks in Pattern Recognition. ANNPR 2016. Lecture Notes in Computer Science(), vol 9896. Springer, Cham. https://doi.org/10.1007/978-3-319-46182-3_15

Download citation

DOI: https://doi.org/10.1007/978-3-319-46182-3_15
Published: 09 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46181-6
Online ISBN: 978-3-319-46182-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)

Comparing Incremental Learning Strategies for Convolutional Neural Networks

Abstract

Similar content being viewed by others

Incremental learning without looking back: a neural connection relocation approach