Elsevier

Neurocomputing

Volume 531, 28 April 2023, Pages 87-99
Neurocomputing

Federated learning by employing knowledge distillation on edge devices with limited hardware resources

https://doi.org/10.1016/j.neucom.2023.02.011Get rights and content

Abstract

This paper presents a federated learning approach based on utilizing computational resources of the IoT edge devices for training deep neural networks. In this approach, the edge devices and the cloud server collaborate in the training phase while preserving the privacy of the edge device data. Owing to the limited computational power and resources available to the edge devices, instead of the original neural network (NN), we suggest to use a smaller NN generated using a proposed heuristic method. In the proposed approach, the smaller model, which is trained on the edge device, is generated from the main NN model. By the exploiting Knowledge Distillation (KD) approach, the learned knowledge in the server and the edge devices can be exchanged, leading to lower required computation on the server and preserving data privacy of the edge devices. Also, to reduce the knowledge transfer overhead on the communication links between the server and the edge devices, a method for selecting the most valuable data to transfer the knowledge is introduced. The effectiveness of this method is assessed by comparing it to state-of-the-art methods. The results show that the proposed method lowers the communication traffic by up to 250 × and increases the learning accuracy by an average of 8.9 % in the cloud compared to the prior KD-based distributed training approaches in CIFAR-10 dataset.

Introduction

Neural networks (NNs), especially deep NNs (DNNs), have become very popular in recent years owing to their high accuracy outputs in many applications. The high accuracy has been achieved as a result of a considerable increase in the complexity and parameter counts of the DNNs. Obviously, the increase in NN model size mandates higher computational and memory resources for the training and inference phases. To fulfill these requirements, currently, GPUs and special-purpose hardware accelerators are platforms that are used for the training of DNNs. At the same time, size of the training dataset impacts the quality of the trained DNN. This, again, makes storing and accessing large datasets for the training as another challenge in the design of DNNs.

One solution to tackle these challenges is a distributed learning paradigm where the required compute and/or memory resources are divided and distributed among some processing nodes. While the distributed learning can expedite the training phase, it faces some challenges. The main challenge is the integration of the trainings performed at different processing nodes. The state-of-the-art solutions for this challenge can be found in the transfer learning methods [1]. In transfer learning, the knowledge available in a first group of NNs is utilized to improve the accuracy of a second group of NNs without requiring one to perform a complete training operation with datasets of the first group of NNs. In addition, in distributed learning, each processing unit often needs to transfer its data to a central server or other processing units. The number of times that this data is sent through the network, the volume of the data, and the latency of these data transfers are the issues which are critical parameters in distributed learning methods. Furthermore, the timing protocol used for data transfer has been used to categorize the distributed learning approaches into synchronous and asynchronous types [2]. In the former type, the model update on a processing unit occurs when the updates on the other processing units take place, while in the latter data transfer protocol type this particular timing of the updates is not observed, and each processing unit can update itself independently.

One of the applications that makes use of NNs is IoT (Internet of Things). IoT plays a major role in improving the quality of the current datasets. The IoT applications include healthcare, smart homes, the smart vehicles, and even industrial applications in factories (industrial IoT) [3]. IoT applications are growing rapidly, especially with the expansion of the 5G coverage. The network of IoTs typically comprises several low-end devices (the edge devices) with low power consumption alongside one or more powerful central servers to which all edge devices are connected (the cloud). Conventionally, edge devices are responsible for gathering raw data, performing simple preprocessing operations on the data, and sending it to the cloud for more intensive processing of the data as required in the inference phase. After analyzing the data, the cloud servers send back the results to the edge devices for the next operations. Many approaches (in software and hardware platforms) have been introduced to move the processing of the inference phase to the edge devices to reduce the computational load on the cloud, lower the required data transfer bandwidth, and preserve the privacy of the edge user (see, e.g., [3], [4], [5]).

In addition to the inference, recently, some efforts have been suggested to use the computational resources of edge devices in training phase using distributed learning approaches (see, e.g., [4], [6]). More specifically, one may refer to the federated learning approaches as state-of-the-art approaches for the distributed learning on the edge [7], [8]. In these approaches, a server can be set up to aggregate the parameters obtained through training done at the individual nodes (edge devices). In the federated learning, special attention is paid to the data privacy by not transferring the raw data in each node to other nodes or to the server(s). After aggregating the data and updating the server weights, the server returns the new parameters to the nodes. The main challenge in this approach is the limited computational and memory resources of edge devices, making its use impractical for training of the large DNNs. One solution to overcome this problem is to use knowledge distillation approach where smaller NNs derived from the original one is trained on edge devices, and their knowledge is transferred to the NN on the server [9].

In this paper, we introduce an asynchronous distributed learning method which employs knowledge distillation to involve low-end edge devices in the training phase of DNNs used in the IoT network. This approach, which is an extension of the federated learning method, uses the computational resources of the edge devices for training. It thus lowers the workload on the cloud server while preserving the user’s privacy. In the proposed approach, the computational resources of the cloud are also used for training the DNN. The cloud has only access to its own local data. The use of transfer learning performed by the knowledge distillation approach enables us to use considerably smaller NNs on the edge devices for the training, resulting in a smaller bandwidth for transferring knowledge between the server and edge devices. In the proposed training method, edge devices have access to their local data as well as a global dataset, which helps improving the training quality. Additionally, it is possible to train the network on edge devices without accessing the labels of the dataset samples on edge devices. This feature makes the proposed asynchronous distributed learning method more practical since the edge device user does not need to be involved in the labeling process.

The contributions of this work may be summarized in the following:

  • 1-

    Proposing the approach of combining the computation resources of the server and edge devices to train the DNNs in distributed environment. In this structure, the weak edges with limited computational power also can participate in the training process using the KD approach.

  • 2-

    Reducing the communication load between the edges and the server during the learning process by reducing the number of parameters sent between the edges and the server. It should be noted that, in our proposed approach, the communication traffic between the cloud server and the edge devices does not depend on the DNN structure, thanks to logits transmission between nodes. The dependence of communication load to DNN structure/size is a challenge in most of the decentralized and the federated learning approaches (e.g., please see [9], [11]).

  • 3-

    Preserving the data privacies of the edge devices and the server.

  • 4-

    Training the edge devices without requiring to label of the samples which makes this approach more practical.

  • 5-

    Proposing an efficient heuristic method to generate small student DNNs similar to the large teacher DNN to deploy in the edges.

The rest of the paper is organized as follows. Training based on the knowledge distillation and also distributed learning methods are discussed in Section II and the details of the proposed method are provided in Section III. In Section IV, the effectiveness of the proposed method is assessed. Finally, the paper is concluded in Section V.

Section snippets

Background and related works

In this section, first, a brief background on the employed knowledge distillation technique is provided, and then, the prior works on distributed learning on edge devices are reviewed.

The proposed method

As mentioned before, the goal of the proposed distributed training approach is to make use of the weak and medium processing resources that may be available at the edge during the training phase. Hence, as the first step in the proposed approach, we must generate a small NN (student) from the structure of the main NN (teacher). The student is generated on the server and trained by a small portion of the local dataset that is available on the server. Next, this network is sent to edge devices to

Simulation setup

In this section, we study the proposed structure using an IoT network consisting of 5 edge devices and one server. Table 2 shows the number of parameters of each considered network in the comparative study. It includes the number of convolutional and fully connected layers, as well as the amount of the memory required for their training. The datasets used were CIFAR-10, CINIC-10 and the SVHN. The CIFAR-10 consists of 50,000 images for training and 10,000 images for testing. The CINIC-10 dataset

Conclusion

In this work, a distribution learning approach for the training of the DNNs on IoT edge devices was proposed. It was a stepwise method for extracting a proper smaller (student) network structurally similar to the original (teacher) DNN by considering the IoT device memory limitations. The teacher (student) network was trained on the cloud server (edge devices). By employing the KD-based transfer learning technique, the knowledge of the teacher (student) network(s) was transferred to the edges

CRediT authorship contribution statement

Ehsan Tanghatari: Conceptualization, Methodology, Software, Writing – original draft. Mehdi Kamal: Conceptualization, Methodology, Writing – review & editing. Ali Afzali-Kusha: Conceptualization, Writing – review & editing. Massoud Pedram: Writing – review & editing.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ehsan Tanghatari received his bachelor’s degree in Electrical Engineering from Shahid Beheshti University, Tehran, Iran, in 2014. In 2016, he awarded his M.Sc. degree in Digital Electronics Engineering from Sharif University of Technology and he is currently working toward the Ph.D. degree in Electrical Engineering in University of Tehran. His current research interests inlcude edge computing, neuromorphic computing and low-power design.

References (19)

  • E. Tanghatari et al.

    Distributing DNN training over IoT edge devices based on transfer learning

    Neurocomputing

    (2022)
  • M. Kaboli, “A review of transfer learning algorithms,”...
  • J. Chen, R. Monga, S. Bengio and R. Jozefowicz, “Revisiting Distributed Synchronous SGD,” in International Conference...
  • J. Chen and X. Ran, “Deep Learning With Edge Computing: A Review,” Proceedings of the IEEE, vol. 107, pp. 1655-1674,...
  • D. Benditkis, A. Keren, L. Mor-Yosef, T. Avidor, N. Shoham and N. Tal-Israel, “Distributed deep neural network training...
  • Z. Zhao et al.

    DeepThings: distributed adaptive deep learning inference on resource-constrained IoT edge clusters

    IEEE Trans. CAD of Integr. Circ. Syst.

    (2018)
  • Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. N. Mudge, J. Mars and L. Tang, “Neurosurgeon: Collaborative Intelligence...
  • J. Konecný, B. McMahan and D. Ramage, “Federated Optimization: Distributed Optimization Beyond the Datacenter,” CoRR,...
  • A. Krizhevsky I. Sutskever G.E. Hinton “ImageNet Classification with Deep Convolutional Neural Networks,” in Advances...
There are more references available in the full text version of this article.

Cited by (0)

Ehsan Tanghatari received his bachelor’s degree in Electrical Engineering from Shahid Beheshti University, Tehran, Iran, in 2014. In 2016, he awarded his M.Sc. degree in Digital Electronics Engineering from Sharif University of Technology and he is currently working toward the Ph.D. degree in Electrical Engineering in University of Tehran. His current research interests inlcude edge computing, neuromorphic computing and low-power design.

Mehdi Kamal received the B.Sc. degree from the Iran University of Science and Technology, Tehran, Iran, in 2005, the M.Sc. degree from the Sharif University of Technology, Tehran, in 2007, and the Ph.D. degree from the University of Tehran, Tehran, in 2013, all in computer engineering. He was an associate professor with the School of Electrical and Computer Engineering at the University of Tehran, Iran. He is currently a reseach scientist in Institute for Future of Computing at University of Southern California, Los Angeles, CA, USA. His current research interests include reliability in nanoscale design, approximate computing, neuromorphic computing, embedded systems design, and low-power design.

Ali Afzali-Kusha received the B.Sc. degree from the Sharif University of Technology, Tehran, Iran, in 1988, the M.Sc. degree from the University of Pittsburgh, Pittsburgh, PA, USA, in 1991, and the Ph.D. degree from the University of Michigan, Ann Arbor, MI, USA, in 1994, all in electrical engineering. Dr. Afzali-Kusha has been with the University of Tehran, since 1995, where he is currently a Professor of the School of Electrical and Computer Engineering and the Director of the Low-Power High-Performance Nanosystems and Silicon Intelligence Laboratories. His current research interests include low-power high-performance design methodologies from the physical design level to the system level for nanoelectronics era, efficient implementation of neural networks in different application domains, in-memory computing, and mixed signal computations.

Massoud Pedram received the Ph.D. degree in electrical engineering and computer sciences from the University of California at Berkeley, Berkeley, CA, USA, in 1991. He is currently the Stephen and Etta Varra Professor with the Ming Hsieh Department of Electrical Engineering, University of Southern California, Los Angeles, CA, USA. He holds 10 U.S. patents and has authored four books, 12 book chapters, and over 140 archival and 350 conference papers. His current research interests include low-power electronics, energy-efficient processing, and cloud computing to photovoltaic cell power generation, energy storage, and power conversion, and RT level optimization of VLSI circuits to synthesis and physical design of quantum circuits.

View full text