Online Subclass Knowledge Distillation

doi:10.1016/j.eswa.2021.115132

Expert Systems with Applications

Volume 181, 1 November 2021, 115132

https://doi.org/10.1016/j.eswa.2021.115132 Get rights and content

Highlights

•
A novel distillation method aiming to reveal the subclass similarities is proposed.
•
The OSKD method derives the soft labels from the model itself, in an online manner.
•
The OSKD method is model-agnostic.
•
The experiments validate the effectiveness of the OSKD method.

Abstract

Knowledge Distillation has been established as a highly promising approach for training compact and faster models by transferring knowledge from more heavyweight and powerful models, so as to satisfy the computation and storage requirements of deploying state-of-the-art deep neural models on embedded systems. However, conventional knowledge distillation requires multiple stages of training rendering it a computationally and memory demanding procedure. In this paper, a novel single-stage self knowledge distillation method is proposed, namely Online Subclass Knowledge Distillation (OSKD), that aims at revealing the similarities inside classes, improving the performance of any deep neural model in an online manner. Hence, as opposed to existing online distillation methods, we are able to acquire further knowledge from the model itself, without building multiple identical models or using multiple models to teach each other, rendering the OSKD approach more effective. The experimental evaluation on five datasets indicates that the proposed method enhances the classification performance, while comparison results against existing online distillation methods validate the superiority of the proposed method.

Introduction

Deep Learning (DL) models (Deng, 2014), have been extensively used during the recent years in order to resolve a wide spectrum of visual analysis tasks, overthrowing previous solutions (Guo et al., 2016, Araque et al., 2017, Redmon and Farhadi, 2017, Graves et al., 2013, Nweke et al., 2018, Do et al., 2019). Generally, DL models owe their outstanding performance to their depth and complexity. This significantly hampers the applicability of state-of-the-art models on devices with limited computational resources, such as embedded systems or mobile phones, reasonably introducing a challenging demand for developing compact yet effective models, diminishing the storage requirements and the computational cost.

Several solutions have been proposed during the recent few years to accomplish this goal (Cheng, Wang, Zhou, & Zhang, 2017). For example, considerable research has been performed on developing compact and effective models by design so as to satisfy the memory and computation requirements and at the same time retain accuracy at high levels, (Howard et al., 2017, Zhang et al., 2018b, Sandler et al., 2018, Iandola et al., 2016, Han et al., 2016, Huang et al., 2018). Another line of research includes the parameter pruning, where the redundancy in the parameters of the model is investigated and the complexity of the model is reduced by removing the redundant parameters (Srinivas and Babu, 2015, Molchanov et al., 2017). Similarly, network quantization removes the number of bits for the parameter representation in order to compress the model (Wu et al., 2016, Han et al., 2016). Finally, Knowledge Distillation (KD) (also known as Knowledge Transfer) (Hinton et al., 2015, Romero et al., 2014, Buciluaˇ et al., 2006, Ba and Caruana, 2014, Chen et al., 2016, Chan et al., 2015, Tang et al., 2016, Passalis and Tefas, 2018, Passalis and Tefas, 2019, Kim et al., 2018) has been emerged as a highly promising approach to settle this issue proposing to transfer the knowledge from one, usually larger, model to a more compact model.

KD methods fall into two broad categories: online and offline KD. Offline KD refers to the multistage process of training first a heavyweight and complex model, known as teacher, which accomplishes high performance, and then transferring the knowledge to a more compact and faster model, known as student. More specifically, the student model is trained to regress the so-called soft labels, produced by softening the output distribution of the teacher model, that is by raising the temperature of the softmax activation function on the output layer of the teacher model. The motivation behind this practice, is that these soft labels, as opposed to the hard labels, can uncover information of the model’s generalization mechanism, aiming at implicitly recovering similarities over the data. Amongst KD methods, a distinct subcategory is the so-called self-distillation, where the knowledge is transferred from teachers to students of identical capacity (Furlanello et al., 2018, Lan et al., 2018).

Conventional offline KD is a research topic that has been flourishing in the recent years with a broad spectrum of applications ranging from classification (Mirzadeh et al., 2019, Passalis and Tefas, 2018) and semantic segmentation (Liu et al., 2019), to visual question answering (Mun, Lee, Shin, & Han, 2018) and top-N recommendation (Pan et al., 2019, Pan et al., 2020a, Pan et al., 2020b, Zhang and He, 2020).

However, offline KD is inherently accompanied by some major limitations. That is, offline distillation requires a two-step sequential training process which cannot be parallelized. As a result, offline distillation often doubles the training time, which could discourage the use of such methods in practice. Thus, another line of research attempts to mitigate these flaws by developing distillation methods that simplify the training pipeline to a single stage. That is, the so-called online KD describes the procedure where the teacher and the student networks are trained simultaneously, that is by omitting the stage of pretraining the teacher network. For instance, a recent online KD includes work proposes to train multiple models mutually from each other (Zhang, Xiang, Hospedales, & Lu, 2018a), while another approach proposes to create ensembles of multiple identical branches of a target network in order to build a strong teacher and distill the knowledge from the teacher to the target network (Lan, Zhu, & Gong, 2018).

It is noteworthy, as explained in lan et al. (2018) online distillation is able to readily scale up and parallelize the training process with virtually no effort and communication overhead, often matching the theoretical speedup ( $2 \times$ ). Also, apart from this, online distillation often allows for training more accurate models compared to offline distillation, since at any given time, the gap between the student and teacher model will be smaller compared to offline distillation (Mirzadeh et al., 2019).

Furthermore, in Furlanello et al. (2018) it is demonstrated that useful information about the similarities of the samples with the classes can be obtained even by transferring the knowledge through the class probability distribution from a teacher network of identical capacity to student. In addition, in Ba and Caruana (2014) it is demonstrated that small networks usually have the same representation capacity as large networks, however they are harder to train, compared to large networks. Therefore, taking these observations into consideration a question that arises is how can someone efficiently train small yet effective networks, deriving additional information beyond the hard labels from the model itself and also in an online manner.

Additionally, motivated by the basic KD intuition that it is useful to maintain the similarities of the data with the other classes instead of simply training with the hard labels, and also by the inherit inefficiency of the conventional KD when the number of classes is limited and hence the information to be transferred is limited, we advocate that inside the classes there are also subclasses that share semantic similarities, and it is also useful to maintain the similarities of the data with the subclasses. That is, the question that arises is how can someone efficiently train small networks with an additional supervision that conveys extra knowledge about the similarities of the data samples with the subclasses from the model itself and also in an online manner.

Thus, in this paper, we propose a novel online self-distillation approach namely Online Subclass Knowledge Distillation (OSKD). The intuition of the proposed distillation method considering a probabilistic view of KD is as follows. During the learning process, the probability distribution of the data is transformed layer by layer in a DL model, learning progressively more complex layer representations. Thus, considering a multiclass classification task, the data representations at the output layer of the model are forced by a regular supervised loss to become one-hot representations. However, the process of converting the complex data representations to one-hot representations usually leads to over-fitting and requiring also deeper and more complicated models. Thus, while the conventional KD methodology manifests that it is useful for each sample to maintain the similarities with the other classes, we argue that it is advantageous to maintain the similarities of the subclasses in order to further improve the generalization ability of the model.

More specifically, the proposed online distillation method considers that inside each class there is also a set of subclasses that share semantic similarities (e.g. blue cars, inflatable boats, etc.). Due to the fact that the subclasses inside each class are unknown and as a consequence it is not feasible to pursue a similar approach of softening their distribution as in the traditional KD, our goal is to discover them during the training procedure. To achieve this goal we propose to estimate them using the neighborhood of each sample. That is, we make the assumption that the nearest neighbors, in terms of a similarity metric, of each sample inside a class share the same semantic similarities, and thus they belong to the same subclass.

Therefore, apart from the regular classification objective, an additional distillation objective is introduced, which encourages the data representations to come closer to the nearest representations of the same class. In this way, the subclasses are revealed throughout the training procedure. At the same time, the data representations are forced to move further away from the nearest representations of the other classes, in order to ensure in this way that the distillation objective will prevent the representation entanglement. As is also validated through the conducted experiments, the proposed method is able to derive useful information and progressively uncover more meaningful subclasses throughout the training procedure, since they are driven by the supervised loss. It is, finally, noteworthy that subclass information has been successfully used to improve the accuracy of various learning problems (Nikitidis et al., 2012, Nikitidis et al., 2014, Maronidis et al., 2015), underlining the importance of harnessing subclass information during the training process of powerful, yet prone to over-fitting, DL models.

The main contributions and advantages of the proposed online distillation method can be summarized as follows:

•
The proposed OSKD method is the first KD method which aims at deriving additional knowledge by discovering the subclass structure of the data in an online manner and also from the model itself.
•
As opposed to the conventional KD methodology which comes with increased training cost both in terms of time and pipeline complexity, the OSKD method is faster and simpler, since it is single-stage online KD method, and thus it is also rendered as more commercially attractive (Anil et al., 2018). The absence of the stage of training first a strong and heavyweight teacher comes also with significant gains in terms of computation and memory cost.
•
The proposed method is capable of deriving additional knowledge beyond hard labels from the model itself. Surveying the relevant literature, we can observe that the competitive online distillation methods require multiple copies of the target network to build a strong teacher, or utilize multiple models to train each other in order to derive additional knowledge, leading to multiple times more computationally expensive training procedures. In opposition, the proposed method derive the additional knowledge from the model itself in an online manner, without the need of utilizing multiple models, and hence it has negligible additional computational cost.
•
The proposed method, as it is validated through the conducted experiments, is able to derive useful information about the similarities of the data, progressively more reliable throughout the training procedure, since it is driven by the supervised loss. On the contrary, competitive approaches, which for example include the mutual training of multiple students from a different initial condition may only provide restricted additional information.
•
The proposed online method does not require fine-tuning any other hyper-parameter, such as the temperature of the softmax activation function, which is in general crucial for obtaining remarkable improvement when applying the conventional distillation approach.
•
OSKD method is model agnostic, that is it can be applied to any DL model to improve its performance. In the performed experiments, several architectures have been utilized, varying from simple and lightweight models to deeper ones (e.g. ResNet (He, Zhang, Ren, & Sun, 2016)), considerably improving the classification performance in any considered case.
•
Another critical issue for the effectiveness of the conventional KD is the compatibility between the student and the teacher models. That is, the distillation process is not always effective, for example, it has been demonstrated that when the gap between the teacher and the student is large the latter’s performance degrades (Mirzadeh et al., 2019). Therefore, the self-nature of the proposed method inherently guarantees the extraction of useful knowledge compatible to the fast student model.
•
The OSKD method is capable of removing the dependency on using separate teacher models, which also reduces the required hardware resources by half, going beyond the state-of-the-art. As a result, the proposed method can provide the benefits of distillation without requiring a separate teacher model and increasing the resources/time needed during the training.
•
The proposed distillation method can be combined with any other method for developing effective and faster models, e.g. (Zhang et al., 2018b, Sandler et al., 2018).

The rest of the manuscript is structured as follows. Section 2 discusses previous distillation works. The proposed method is presented in Section 3. Subsequently, the experiments performed to evaluate the proposed method are presented in Section 4, and finally, the conclusions are drawn in Section 5.

Section snippets

Previous work

In this Section, recent works in the general area of KT, as well as on online KD, which is more relevant to our work, are presented.

Knowledge Transfer has been extensively studied during the recent years with a wide range of applications (Pan et al., 2018, Liu et al., 2019, Mun et al., 2018, Wang et al., 2018). Firstly in Buciluaˇ et al. (2006) and then in Hinton et al. (2015) the idea of distilling the knowledge from a powerful teacher to a weaker student by encouraging the latter to regress

Proposed method

In this paper, we propose a novel online subclass distillation method which allows for developing efficient and fast-to-execute models for various applications with computational and memory restrictions (e.g. generic robotics applications). Consider for example the problem of crowd detection for autonomous unmanned aerial vehicles (Tzelepi & Tefas, 2019). In such a problem, lightweight models, which should be able to operate on-board (that is on low power GPU) at sufficient speed, are required.

Experiments

First, a toy example, utilizing the MNIST dataset, LeCun, Bottou, Bengio, and Haffner (1998), is constructed so as to illustrate the effect of the proposed distillation method. Subsequently, three datasets were used to evaluate the performance of the proposed method. The descriptions of the datasets as well as the model architecture are presented in the following subsections. We performed four sets of experiments utilizing four different number of nearest neighbors, as well as for two different

Conclusions

In this paper a novel single-stage knowledge distillation method is proposed, namely Online Subclass Knowledge Distillation, that aims to reveal the similarities inside classes, improving the performance of any deep neural model in an online manner. As opposed to existing online distillation methods, the proposed method is capable of obtaining further knowledge from the model itself, without building multiple identical models or using multiple models to teach each other, rendering the OSKD

CRediT authorship contribution statement

Maria Tzelepi: Writing - original draft, Writing - review & editing, Methodology, Software. Nikolaos Passalis: Writing - review & editing, Methodology. Anastasios Tefas: Writing - review & editing, Methodology, Supervision.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This research is co-financed by Greece and the European Union (European Social Fund - ESF) through the Operational Programme “Human Resources Development, Education and Lifelong Learning 2014–2020” in the context of the project “Lightweight Deep Learning Models for Signal and Information Analysis” (MIS 5047925).

References (70)

O. Araque et al.
Enhancing deep learning sentiment analysis with ensemble techniques in social applications
Expert Systems with Applications
(2017)
H.H. Do et al.
Deep learning for aspect-based sentiment analysis: a comparative review
Expert Systems with Applications
(2019)
Y. Guo et al.
Deep learning for visual understanding: A review
Neurocomputing
(2016)
M. Kyperountas et al.
Salient feature and reliable classifier selection for facial expression classification
Pattern Recognition
(2010)
A. Maronidis et al.
Subclass graph embedding and a marginal fisher analysis paradigm
Pattern Recognition
(2015)
S. Nikitidis et al.
Subclass discriminant nonnegative matrix factorization for facial image analysis
Pattern Recognition
(2012)
H.F. Nweke et al.
Deep learning algorithms for human activity recognition using mobile and wearable sensor networks: State of the art and research challenges
Expert Systems with Applications
(2018)
Y. Pan et al.
A novel enhanced collaborative autoencoder with knowledge distillation for top-n recommender systems
Neurocomputing
(2019)
S. Ahn et al.
Variational information distillation for knowledge transfer
R. Anil et al.
Large scale distributed neural network training through online distillation

Ba, J., & Caruana, R. (2014). Do deep nets really need to be deep? In Advances in neural information processing systems...

C. Buciluaˇ et al.

Model compression

W. Chan et al.

Transferring knowledge from a rnn to a dnn

G. Chen et al.

Learning efficient object detection models with knowledge distillation

Cheng, Y., Wang, D., Zhou, P., & Zhang, T. (2017). A survey of model compression and acceleration for deep neural...

T. Chen et al.

Net2net: Accelerating learning via knowledge transfer

L. Deng

A tutorial survey of architectures, algorithms, and applications for deep learning

APSIPA Transactions on Signal and Information Processing

(2014)

Ding, Q., Wu, S., Sun, H., Guo, J., & Xia, S.-T. (2019). Adaptive regularization of labels. arXiv preprint...

T. Furlanello et al.

Born again neural networks

A. Graves et al.

Speech recognition with deep recurrent neural networks

S. Han et al.

Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding

Heo, B., Lee, M., Yun, S., & Choi, J.Y. (2019). Knowledge transfer via distillation of activation boundaries formed by...

K. He et al.

Deep residual learning for image recognition

Hinton, G., Vinyals, O., & Dean, J. (2015). Distilling the knowledge in a neural network. arXiv preprint...

Howard, A. G., Zhu, M., Chen, B., Kalenichenko, D., Wang, W., Weyand, T., Andreetto, M., & Adam, H. (2017). Mobilenets:...

G. Huang et al.

Condensenet: An efficient densenet using learned group convolutions

Iandola, F. N., Han, S., Moskewicz, M. W., Ashraf, K., Dally, W. J., & Keutzer, K. (2016). Squeezenet: Alexnet-level...

X. Jin et al.

Knowledge distillation via route constrained optimization

Kim, J., Hyun, M., Chung, I., & Kwak, N. (2019). Feature fusion for online mutual knowledge distillation. arXiv...

S. Kim et al.

Transferring knowledge to smaller network with class-distance loss

J. Kim et al.

Paraphrasing complex network: Network compression via factor transfer

A. Krizhevsky et al.

Learning multiple layers of features from tiny images

Technical Report Citeseer

(2009)

Lan, X., Zhu, X., & Gong, S. (2018). Self-referenced deep learning. In Asian conference on computer vision (pp....

Lan, X., Zhu, X., & Gong, S. (2018). Knowledge distillation by on-the-y native ensemble. In Advances in neural...

Y. LeCun et al.

Gradient-based learning applied to document recognition

Proceedings of the IEEE

(1998)

Cited by (0)

¹: ORCID: 0000-0001-5359-8315

View full text

Online Subclass Knowledge Distillation

Highlights

Abstract

Introduction

Section snippets

Previous work

Proposed method

Experiments

Conclusions

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgment

Expert Systems with Applications

Expert Systems with Applications

Neurocomputing

Pattern Recognition

Pattern Recognition

Pattern Recognition

Expert Systems with Applications

Neurocomputing

Variational information distillation for knowledge transfer

Large scale distributed neural network training through online distillation

Model compression

Transferring knowledge from a rnn to a dnn

Learning efficient object detection models with knowledge distillation

Net2net: Accelerating learning via knowledge transfer

A tutorial survey of architectures, algorithms, and applications for deep learning

APSIPA Transactions on Signal and Information Processing

Born again neural networks

Speech recognition with deep recurrent neural networks

Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding

Deep residual learning for image recognition

Condensenet: An efficient densenet using learned group convolutions

Knowledge distillation via route constrained optimization

Transferring knowledge to smaller network with class-distance loss

Paraphrasing complex network: Network compression via factor transfer

Learning multiple layers of features from tiny images

Technical Report Citeseer

Gradient-based learning applied to document recognition

Proceedings of the IEEE