Abstract
This paper introduces a novel approach for training generative adversarial networks using federated machine learning. Generative adversarial networks have gained plenty of attention in the research community especially with their abilities to produce high quality synthetic data for a variety of use-cases. Yet, when combined with federated learning, those models suffer from degradation in both training time and quality of results. To address this challenge, this paper introduces a novel approach that uses hierarchical learning techniques to enable the efficient training of federated GAN models. The proposed approach introduces an innovative mechanism that dynamically clusters participant clients to edge servers as well as a novel multi-generator GAN architecture that utilizes non-identical model aggregation stages. The proposed approach has been evaluated on a number of benchmark datasets to measure its performance on higher numbers of participating clients. The results show that HFL-GAN outperforms other comparative state-of-the-art approaches in the training of GAN models in complex non-IID federated learning settings.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Avoid common mistakes on your manuscript.
1 Introduction
Deep Learning (DL) has opened many doors to advanced data analysis, classification, and prediction in unprecedented ways. However one consistent factor of DL algorithms is the enormous data and computational overhead required to train them. This data must be collected from appropriate sources, whether by the central organisation or group developing the algorithms, or from external users and client devices. The major source of data collection comes from devices such as mobile phones, distributed computational devices, or other interconnected IoT devices, like sensors [1]. This opens the door to a plethora of considerable privacy concerns for users and their data [2]. These concerns typically come with plenty of ethical and legal considerations. To address these concerns, decentralised and privacy-preserving machine learning paradigms, such as Federated Learning (FL) [3] have become of great interest to researchers, corporations, and individuals alike. This is coupled with the increasing worldwide interest in generative AI and its growing applications and challenges [4, 5].
Generative AI is an umbrella term used to describe a form of AI capable of generating unique and novel content and data [6]. Generative AI has been used in a wide array of applications and continues to be of increasing interest in numerous domains. Recent models such as ChatGPT [7] and DALL-E [8] along with other cutting edge models, are advanced generative AI capable of generating diverse content such as natural language and images that do not directly replicate training data. One prominent framework of Generative AI, is Generative Adversarial Networks (GAN) [9]. GAN are generative AI models that are able to produce synthetic data which closely matches, but does not directly repeat, the original data. They have been employed in a range of use-cases, often focused on image tasks. They have been shown to have value and potential in data augmentation [10] and image translation [11]. As data collection is a challenge in many DL algorithms, GANs potential to augment datasets to increase data quantity or balance non-IID data distributions can be of high value. For example, “Non-IID” data can result in slow convergence speed and difficulty in achieving an optimal global model [12, 13]. However, GAN models too require potentially large datasets to train accurately. As a result, they might suffer from the same privacy concerns of traditional DL models [14, 15].
FL is a technique that enables decentralised collaborative machine learning without requiring direct access to sensitive raw data. In FL, each client device trains its own local model using its own private data. Only the model parameters - not the raw data - are shared with the central server to be aggregated into a global model. This allows the central model to learn patterns from data distributed across many devices, without that data ever being directly revealed. This paradigm has the potential to alleviate the privacy concerns of both traditional DL models, as well as generative AI. Yet, FL has its own challenges that need to be addressed in order to train these models effectively. Due to being decentralised, data is fragmented across a number of clients, this creates a number of challenges such as non-IID data and data quantity concerns as data can not be shared. In addition, data can be heterogeneous both locally, and across clients adding complexity to both local training and global aggregation.
To address those challenges, this work proposes a novel hierarchical FL model that utilises hierarchical learning and a multi-layered approach to train GAN models in highly fragmented heterogeneous and non-IID FL settings. Our model, dubbed, HFL-GAN (Hierarchical Federated Learning - Generative Adversarial Network) introduces two distinct hierarchical steps as follows. (1) Multi-generator GAN model per client: This ensures more robust training, and by ensuring each generator is trained differently, allowing the model(s) to benefit from different methods of hierarchical FL and different stages of aggregation. (2) Hierarchical clustering per edge server: This allows clients to train alongside clients with a similar data distribution through statistical comparison of model weights, similarly this opens up the potential for a high level of communication efficiency.
The remainder of this paper is organised as follows. In Section 2, we present preliminary details on the technologies and concepts necessary to the contents of this research. We outline the concepts and importance of FL, Generative AI, and the proven benefits of hierarchical learning applied to these technologies. In Section 3, we introduce the details of the proposed approach, HFL-GAN, discussing algorithms and formulas important to the formulation of this research’s contributions. Section 4, consists of formulated experiment details and results, presenting clear findings and discussions as to the relevance and meaning of any results. In Section 5, we explore related works, both in the areas of decentralised GAN and hierarchical learning. Here, the works preceding this research are explored, outlining their contributions and findings, as well as any oversights and room for further research. In Section 6, the contents of this research and results will be summarised and discussed to concisely formulate the findings of this research.
2 Background
This section lays the technical foundations for the proposed approach. We start by introducing FL, then Generative AI, GANs and Hierarchical learning are introduced in order.
(a) Federated Learning FL has been an active and promising research area that aims to tackle distributed data in a privacy preserving manner. In FL, a fraction of clients are selected to participate in each communication round, those clients train locally on their private datasets. Plenty of work has been done in FL, focusing on areas of communication efficiency [16, 17], client selection [18,19,20], and improving training on non-IID data [21]. For example, FL with model averaging [3] has emerged as a promising approach to achieving privacy preserving machine learning that allows individual local clients to train a global model collaboratively without sharing their local data with the central server. This is done by client edge devices training a local model on their private dataset before sending the model updates to the server to be aggregated through a process called Federated Averaging [3]. Federated Learning faces plenty of challenges, such as data heterogeneity [12], this may mean, for example, that devices on the network hold data which is not balanced (either in terms of count or feature space) across all other collaborators, or that certain features are more prevalent across the entire dataset, causing the trained model to skew toward that feature. The research in this paper aims to tackle these issues.
FL is especially relevant and of increased importance due in part to the recent global interest in large generative AI models which need to store and learn from an increasingly large quantity of data. As such, applying FL to the domain of generative AI is a very important area of research to ensure that data privacy is maintained while also finding solutions to reducing the negative impact of fragmenting the training data.
(b) Generative AI Generative AI, as a general rule, requires large data quantities [13] in order to train their complex models. Plenty of work has been done to alleviate that requirement alongside training time in order to allow for their use in data augmentation tasks [1, 2]. As such, solving the issues of data privacy in the collection phase has become of high value to the field. Recent years have seen the growth in computational power and complex innovative generative models led to new models capable of effectively capturing very complex data. These advancements have led to models that can capture high-dimensional probability distributions of images, language, and other forms of data. The different forms of generative AI have different appropriate applications, this research focuses on GAN which has shown effectiveness in generating synthetic data along a defined data distribution.
(c) Generative Adversarial Networks GANs proposed by Goodfellow et al. [9] are generative models that consist of a pair of models, a Generator G and Discriminator D. These models compete to improve the generation of images to more represent the original dataset. The Discriminator estimates the probability of a given sample being synthetically generated, against the probability of it being real (i.e. from the base dataset), while the Generator generates the synthetic images in an attempt to fool the Discriminator. GANs are trained using (1) as follows [9]:
Traditionally, GANs require large quantities of data to be collected and stored on a centralised server for model training, collecting this data exposes data holders to privacy risks and as such, in many cases, is not feasible due to law and ethical concerns. In addition, mobile devices and other forms of edge devices have become a primary method of computing for the majority of users around the world [8]; There has been interest in the data generated by these devices for a considerable amount of time and ethically and legally collecting this data with minimal privacy concerns is an ongoing problem that researchers continue to tackle [2, 5].
(d) Hierarchical learning Hierarchical Federated Learning typically involves assigning client edge devices to a cluster or edge server based on a given criteria and/or similarity to collaborators. The aim being to train a model within their groups to produce stronger models with faster convergence and higher performance. Applying hierarchical methods to GAN architecture in centralised settings often focuses on utilising multiple Generators and/or Discriminators depending on application, this has been shown to in some cases provide a more robust and performative model [22] . Different forms of multi-generator and multi-discriminator models have been approached in GANs in both centralised and decentralised settings [22, 23].
Utilising hierarchical learning in FL enables grouping of collaborators in ways that can further balance data distributions towards IID and potentially enable further participation of clients while improving communication efficiency. Combining that with a multi-generator or discriminator GAN approach may achieve more robust GAN models across distributed clients and at the global server through data balancing. It can further stabilise the training through preventing common GAN issues such as mode collapse [24]. In the next section, we introduce a novel approach to train GANs in heterogeneous and non-IID FL settings.
3 Proposed approach
This section proposes HFL-GAN, a multi-layer hierarchical approach to train GAN models in federated heterogeneous settings with low quantity per-client data and a high client count which is an area that has been scarcely explored in related literature. This form of data scenario is more realistic to real world applications where data may be fragmented and individual clients may not hold large quantities of well balanced data. To this end, this approach applies methods of hierarchical FL and its potential improvements to training and communication efficiency, as well as multiple Generators per client to aid in robust training. The innovation in this approach is multifaceted. Firstly, it introduces a novel multi-generator GAN architecture to aid in FL training and aggregation weighting. Secondly, it applies hierarchical clustering techniques in order to enhance the efficiency of FL in dynamic and heterogeneous environments. Thirdly, it employs a novel training strategy that adapts aggregation with the multi-generator model with different aggregation stages to achieve robust performance in federated heterogeneous settings. To the best of our knowledge, this is the first approach to address these issues in such a way.
The proposed HFL-GAN approach works as follows. In the initial stage, clients are trained in a collaborative FL manner, communicating directly with the server. In the second stage, clients are clustered and assigned to edge servers using a similarity metric and selected cluster heads. Finally, clients are dynamically trained where they communicate only with their allocated edge server, with the edge server handling all global server communication at a reduced frequency. The number of edge servers/clusters to be included as part of the network depends on the topology and data distributions, as such, this is defined as a hyperparameter to be tuned through experimentation. The proposed model is demonstrated in Fig. 1 that shows the general structure of HFL-GAN. As shown in Fig. 1, a number of clients hold a GAN network with two generators and a discriminator, each client is placed under an edge server, clustered by a given similarity metric. The proposed approach works as follows.
(a) Multi-Generator GAN In the first stage, clients perform collaborative FL training to achieve a base level of performance as follows. Each client k\(_{\{1,2,3,\ldots K\}}\) is assigned a GAN model containing two Generators G and one Discriminator D as shown in Fig. 2. Each client’s \(G_{0}\), \(G_{1}\),and D are trained in parallel for an initial number of communication rounds \({r=10}\). After each round of training, each participating client calculates the average of the parameters of \(G_{0}\), and \(G_{1}\),creating a new set of model parameters to be transmitted to the global model. The global model then performs an averaging step, i.e. FedAvg, on all participating client parameters before distributing the updated global parameters back to the participating clients in the next communication round. This allows for the clients to train independently while benefiting from the global data distribution to reach a base parameter set for clustering in the next stage.
The global Generator for the round G\(_{r}\) at each communication round is assigned to G\((0)_{k}\), as demonstrated in (2).
The average of the local generator parameters post round-training is assigned to \(G(1)_{k}\), as demonstrated in (3).
Where \(G(0)_{k}\) and \(G(1)_{k}\) represent the first and second generator being held by a client respectively and G\(_{k}\) represents the count of the two generators in the model in the local model, this value would change in the inclusion of additional generators.
Each Generator contains an element of the local G model parameters and a different proportion of the global \(G_{r}\) model parameters allowing for more robust training in the early stages without the need for any additional model weighting algorithms that might add complexity to the model and often require additional information from the client to be transmitted to the global model, increasing potential privacy concerns. By one generator holding the most recent global update, and the second generator holding a lower weighting of the global parameters, theoretically local clients maintain more of their cluster-trained model parameters, leading to more robust local training. This is thanks to clustered clients holding a more similar local data distribution.
Once the initial rounds r are complete, the average of G\(_{0}\) and G\(_{1}\) for each client is taken again as shown in (3), as the model parameters to be used for comparison-based edge assignment to add a level of dynamic hierarchy to the Federated network. This process is demonstrated by Algorithm 1.
As shown in Algorithm 1, clients start by training in a traditional FL manner for an initial number of communication rounds (Lines 1-8). During those rounds, each client trains their Generator(s) and Discriminator, sending updated averaged Generator parameters to the global server for FedAvg aggregation (Lines 9-14).
The communication and data sharing pattern of the proposed multi-generator GAN are shown in Fig. 3.
The two generators, G(0) and G(1), held by each client are trained using the same discriminator D, but are handled differently within the larger federated environment. Generator G(0) is the only generator that receives the aggregate model parameters at the start of each participating communication round. This ensures that G(0) is the most up to date regarding the global knowledge. Both Generators are updated during local training, with G(0) updating the global model with their local data, and G(1) updating the client’s average from the end of the prior round. At the end of the round, the average is taken between G(0) and G(1) and assigned to G(1) for next round’s training. This ensures that a higher proportion of the local knowledge is maintained throughout training within G(1) while still benefiting from some of the shared global knowledge. This maintains a degree of personalisation in the model to achieve more robust training across communication rounds.
(b) Hierarchical Clustering In this stage, clients are assigned to an edge server, as shown in Fig. 5 based on a similarity metric comparing the similarity of client generator averages. Selecting the correct number of edge servers/clusters is largely dependent on the client quantity, network topology, and data distribution. This is defined as a hyperparameter to be tuned in the proposed solution. In the following results section, we will perform some experimentation on different numbers of edge servers to showcase the performance changes.
Firstly, a random initial client is chosen as the base for further edge assignment, the Cosine similarity is taken between this chosen client and all other clients, demonstrated in (4).
This similarity is based on the similarity of the average of the two Generator’s G model weights. Initial clients are assigned to the remaining edge servers, taking the least similar and even-distanced clients between the base and its least similar as demonstrated by (5) where k = client and S = edge server.
Through this method of client sorting and selection, a good foundation for further cluster head selection is created, due to the evenly spaced nature of the model parameters. This is depicted in Fig. 4.
As shown in Fig. 4, once the initial client is randomly selected, similarities are calculated using the chosen similarity metric, in this case Cosine similarity. These clients are sorted by similarity to enable evenly spaced selection across the client similarity vector.
Remaining clients are then assigned to servers based on which initial edge assignment its model parameters are most similar to, using Cosine similarity as the similarity metric, as per (6).
After all clients have been assigned to edge servers based on similarity, to benefit from the new distribution and train on comparative data, clients are trained within their edge servers for five communication rounds only. This disregards the global model until those five rounds have completed. The edge server allocation process is demonstrated in Algorithm 2.
As shown in Algorithm 2, initially a client x is selected at random from all clients K. This client is the base for clustering and the heads of each other cluster (edge server) are selected from this client’s parameters. In the next step, each client’s model parameters are compared (taken as the average of both Generator parameters) compared to client x’s (Lines 3 - 5) in order to allocate to the remaining edge servers (Lines 6 - 8). Finally the remaining clients are placed into clusters / edge servers based on the similarity to the initially allocated client’s parameters using a chosen similarity metric (Lines 9-11) (Fig. 5).
In the event of dynamically joining clients, where new clients join the pool of potential contributors, clients are trained on their own data for a selected number of epochs. Similar to existing clients, once a base level of training and performance is reached, the newly joining clients need to be compared with the cluster heads as defined in the initial clustering stage using the chosen similarity metric, in this case, Cosine similarity. They may then continue to train as part of the FL network as normal.
(c) Dynamic Training StrategyThis stage introduces a training strategy that allows clients to train with different local and global aggregation stages. In order to closely represent a realistic scenario, where not all clients will be able to participate in every round, only 80% of clients are taken per edge server to participate in each round of training.
In parallel, each client from each server trains their Generators \(G(0,1)_{k}\) and Discriminator D\(_{k}\). As per (4), G\(_{k}\) is calculated for each client at the end of the communication round, taking the average of \(G(0)_{k}\) and \(G(1)_{k}\) and then replacing the parameters of \(G(1)_{k}\) with G\(_{k}\). Rather than transmitting the parameters of G\(_{k}\) and D\(_{k}\) to the global server, each participating client for the round transmits to their respective edge servers S for model aggregation amongst clients with a theoretically similar data distribution. For all communication rounds r except for r % i==0, where i refers to the number of rounds between global aggregation stages, these aggregated G\(_{S}\) and D\(_{S}\) will be transmitted to the participating clients \(G(0)_{k}\) at the beginning of the following round.
In the case of r % i==0, after the edge server aggregation step is complete, each server transmits G\(_{S}\) and D\(_{S}\) to the global server for further aggregation using the standard FedAvg algorithm. This new global G\(_{r}\) and S\(_{r}\) for the following round are transmitted back to each edge server to be sent to the participating clients \(G(0)_{k}\) at the beginning of the following round. This structure allows HFL-GAN to benefit from numerous features: (1) harnessing the robust training benefits of multi-generator GAN models, (2) preserving privacy through hierarchical data clustering, particularly useful for training on non-IID data, (3) gaining global training benefits by learning from the full breadth of data across all clients, (4) enhancing communication efficiency by reducing communication with the global server and adding edge servers as an intermediary for client communication. Algorithm 3 outlines the algorithm for the dynamic training strategy.
As shown in Algorithm 3, clients are trained within their edge servers for a number of communication rounds (Lines 1 - 11), denoted by i, before global aggregation is done. This allows the GAN models to train on similar data for the remaining communication rounds. On each communication round that is divisible by i, after edge server aggregation, the edge servers send their aggregated models for further aggregation at the global server where FedAvg is performed with equal weighting (Lines 12 - 18).
4 Experiments and results
This section introduces the experimental results that are conducted to evaluate the proposed HFL-GAN approach. In order to thoroughly evaluate the performance of our proposed approach, we utilise two popular benchmark datasets for GAN performance analysis, MNIST [25] and SVHN cropped [26]. As part of the evaluation, the proposed approach is compared with state-of-the-art federated GAN algorithms, one a typical FL application for GAN (FL-GAN) and a single-generator algorithm that focuses on weighted aggregation using MMD as a statistical measure (IFL-GAN). This is to showcase the benefits of the proposed solution’s hierarchical multi-gen structure over other solutions.
Dataset description: MNIST is a dataset consisting of 60,000 training samples depicting hand-written digits 0-9, each image is grayscale and 28x28 pixels in resolution. This dataset is a popular choice for both classification evaluation and image generation result analysis.
SVHN cropped, similar to MNIST, consists of images of single digits 0-9, however these are cropped from street view house numbers and are therefore more indicative of real world images. This dataset holds a large quantity (600,000) digit images cropped to 32x32 and full colour with three RGB channels.
Non-IID data distribution: To simulate highly imbalanced/heterogeneous environments where clients are unlikely to have the same amount and variety of data, we employ a non-IID data distribution. In this setup, each digit has an even probability of being held by each client, with each client holding somewhere between 2 and 8 digit labels. For instance Client 1 may hold digits 0 and 4 with a total data count of 100 while Client 2 holds digit labels 1, 2, 4, 6, 7, 8, and 9 with a data count of 600. This uniqueness in data distribution is a challenge that many other algorithms struggle to handle effectively, due to the disparity in both quantity and label spread.
There a are a number of hyperparametres which can be tuned as part of our HFL-GAN algorithm, these parameters were shown to produce different results dependant on dataset and level of non-IID distribution. These suggested hyperparametres are given in Table 1.
The remainder of this section outlines the experiments and results using the experimental non-IID data settings on both datasets. Utilising MNIST and SVHN scores to quantitatively give a performance score to the HFL-GAN approach compared with similar non-hierarchical models. We also present a visual display of the generated images for a subjective analysis due to the imperfect, but still valuable, nature of the MNIST and SVHN score as a performance metric. Experiments are performed on a number of different client counts to demonstrate and analyse the scalability of our algorithm.
(a) MNIST Generated Image Analysis In order to evaluate the performance of the proposed approach, HFL-GAN, we apply it to the MNIST dataset following our proposed multi-stage hierarchical model and experimental setup which utilises hierarchical methods of both FL and GAN systems. We take the MNIST data, taking a random quantity (2 - 8) set of elements from the set of classes {0,1,2,3,4,5,6,7,8,9} and distribute it across all clients K [25,50,100]. Given training sample size N for each class, each of the first half of clientsK receive N / K [25,50,100] images from each held class, while the latter half contain half of that per class. This ensures heavy data imbalance in both classes held as well as quantity across clients. K\(_{6}\) may hold MNIST classes {0,4,5} of N = 60 per class while K\(_{60}\) may hold {1,2,3,4,7,8} of N = 30 per class.
Visual analysis of the generated MNIST images showcase a clear indication that our HFL-GAN greatly outperforms FL-GAN and IFL-GAN at higher client counts. Figure 6 shows 25 generations from all three models respectively at 100 clients. Only our HFL-GAN captures the majority of digits (0, 1, 2, 3, 4, 5, 7, 8, 9) with only digit 6 not being clearly represented in the generated samples. Meanwhile, FL-GAN and IFL-GAN produce very few recognisable digits with these complex and high dimensional non-IID client settings.
MNIST score [23] is a metric used to quantitatively measure the performance of MNIST-generating GAN models. A higher score indicates a better performing GAN as calculated from a high quantity of synthetic generated images. Our HFL-GAN shows a significant improvement after 300 rounds of training with an improvement of 29.32% over IFL-GAN results on K=100. Table 2 shows results given highly Non-IID data across K=100 clients, each holding a very low quantity of data. Three algorithms are compared given these settings.
The results in Table 2 show that at the lower client count of 25, HFL-GAN is outperformed by FL-GAN with these settings. However, as the client count increases, not only does HFL-GAN maintain a high level of performance and widen the performance gap, providing better performance than both FL-GAN and IFL-GAN, but our HFL-GAN performance improves as the client count increases and data becomes more fragmented. This improvement is likely present due to the more fine clustering opportunities given a higher degree of data fragmentation; The clustering algorithm is able to better group more similar clients together.
(b) MNIST Image Consistency In this experiment, MNIST score was calculated in batches, with the standard deviation being calculated across each result to quantitatively measure the consistency of the generated images. This metric establishes that not only is the model capable of producing high quality data, but rather produces that high quality consistently. Table 3 presents the standard deviation of a range of calculated MNIST scores.
As shown in Table 3, the proposed HFL-GAN at higher client counts produces the most consistently quality image generations when compared to FL-GAN and IFL-GAN. Only IFL-GAN produces improved consistency on the tested 25 client count settings. Similar to the fact that as client count increases, generator performance improves, the generator also improves in consistency of the quality of generated images. The proposed hierarchical solution does this in two distinct ways. Firstly, through the hierarchical clustering performed, models are trained primarily with participating clients with more similar theoretical data distributions, resulting in more stable training across communication rounds before benefiting from the wider federated network at the given intervals. Secondly, the multi-generator approach enables the participating clients to hold a higher percentage of their local training, rather than completely updating their model parameters with the global aggregate. This maintains a level of local adaptation and consistency in training through providing local clients with a more personalised model that still benefits from the global knowledge.
These results, coupled with the quality improvements of the comparison models, showcase a clear performance benefit from a hierarchical approach to FL combined with a multi-generator GAN model architecture.
(c) MNIST Generator Training In this experiment, we once again train our HFL-GAN on MNIST, taking the MNIST score metric every 10 communication rounds and compare against both FL-GAN and IFL-GAN to show the training improvements, not only in final result, but also training efficiency and consistency. Figure 7 shows the calculated MNIST scores across the duration of training (taken every 10 communication rounds).
As per the results shown in Fig. 7, our HFL-GAN outperforms FL-GAN and IFL-GAN at both of the presented higher client counts but shows considerably more stability and scalability the more clients are participating in the FL model. In the 100 client test, HFL-GAN showed a consistent and stable increase in results across communication rounds and achieved a much higher final level of performance than the comparative non-hierarchical models. While initially performing worse than competitors, once the clients have had time to train within their clusters, there is a spike in performance, especially on higher client counts. The results further demonstrate that our HFL-GAN consistently outperforms the more straightforward FL-GAN and the statistical property-based IFL-GAN on higher client counts, and at the very least closely matches performance on lower counts.
There is no perfect quantitative measure for GAN models, as such some visual comparison is required as in Fig. 6. Visually comparing results of all three algorithms show clearly that our HFL-GAN outperforms both FL-GAN and the MMD-based IFL-GAN on higher client counts with low quantity data per client, producing recognisable synthetical images covering the majority of MNIST classes present in the training data with good diversity.
(d) MNIST Cluster Setting Experimentation For this experiment, HFL-GAN is run with only the edge server count hyperparameter changed, ensuring all other variables are constant, this demonstrates fine tuning of this hyperparameter and the impact it has on training.
Prior results are produced assuming 5 edge servers for the clients to be distributed across. Depending on client count and data settings, edge server / cluster count can be a key hyperparameter to tune, as such results of experimentation on K = 50 and K = 100 can be seen in Table 4, with edge server count [2,3,5,10].
Results in Table 4 show that from the experimentation, 5 edge servers is the best number given 100 clients with the performance results following a simple curve, peaking at 5 edge servers before again starting to reduce in quality with higher amounts. These results are likely to change given different client quantities, data type, and data distribution as is represented by K = 50.
The results of K = 50 in Table 4 showcase the best performing architecture in the experiments performed consisting of 10 edge servers for 50 clients. Again following the pattern of consistently climbing performance until peaking at the higher server count.
(e) SVHN Generated Image Analysis As a further experiment, we demonstrate the ability of our HFL-GAN on a second, more complex colour image dataset, i.e. SVHN, in order to establish and demonstrate the accuracy and capabilities of our proposed approach. Figure 8 depicts visual results of generated SVHN_Cropped data utilising FL-GAN, IFL-GAN, and HFL-GAN, respectively.
Visual analysis of the results on the SVHN dataset show a similar trend to that of MNIST. Our HFL GAN appears to produce the most varied and recognisable generated digits, with 1, 2, 3, 5, 6, 7, 8 all being represented clearly with 0, 4, and 9 being either not present or unclear. FL-GAN displays the worst performance of the 3 models as expected due to its primitive structure. SVHN scores across a range of client counts are present in Table 5 as a quantitative measure of performance.
With a non-IID SVHN dataset across a high number of clients, different hyperparameters were selected. The same similarity metric of Cosine similarity is used, however less frequent global communication allowed for further cluster training which gave better results on the more complex colour images; In the case of K = 50, 10 rounds of cluster training before global aggregation.
The results in Table 5 show our HFL-GAN consistently outperforming both FL-GAN and IFL-GAN although with a lower quantitative margin compared to our MNIST score results. Our visual analysis results provide a better indicator in this case due to the imperfect nature of SVHN score and MNIST score as a measure and the difficulty in quantitatively assessing GAN quality.
(f) SVHN Image Consistency Similar to the prior experiments on MNIST, in this experiment, the SVHN score was calculated in batches with a calculated standard deviation to quantitatively measure the consistency of the generated images.
The results of these experiments in Table 6 showcase the considerable improvements to the generated synthetic image consistency in the proposed HFL-GAN approach. These improvements can be seen across the board, with the most notable being a \(31.03\%\) reduction in standard deviation between IFL-GAN and the proposed solution on 50 clients. This improvement is due to the collaborative training benefiting from the enhanced consistency of the two-stage multi-generator training approach, as well as the performance improvements from the hierarchical clustering, where clients are trained in clusters of clients who hold the most similar model parameters.
(g) SVHN Generator Training In this experiment, similar to on the MNIST dataset, SVHN score results are taken every 10 communication rounds for a total of 300 communication rounds. These results are compared against both FL-GAN and IFL-GAN to showcase the training and not only the final result. Figure 9 showcases the results across all 300 communication rounds.
The results in Fig. 9 showcase the training benefits of the proposed HFL-GAN, especially on higher client counts. In both \(K = 50\) and \(K = 100\), the proposed approach showcases a higher resulting SVHN score. On higher client counts, due to the nature of the clustering approach and availability of more fine-tuned clustering, training is more consistent. On this dataset, the comparison approach of IFL-GAN peaked higher, but dropped to the lowest quality of the three approaches in the final communication round, showcasing unstable training. HFL-GAN showcases a more stable climb, leading to the better results overall.
(h) SVHN Cluster Setting Experimentation In this experiment, identical to the cluster experimentation on the MNIST dataset, the proposed HFL-GAN cluster count hyperparameter is adjusted to demonstrate the impact on training.
Results in Table 7 clearly showcase the considerable difference that tuning the cluster count hyperparameter can make. On client count \(K = 50\) with no changes other than the number clusters, a cluster count of 5 shows the best result, as such this is the count that was used in all prior experimentations. For \(K = 100\) a cluster count of 10 results in the highest SVHN score. These number are the inverse of the MNIST experimentation, presenting the value in tuning this hyperparameter for the given dataset and distribution.
(i) Ablation Study In this section, we conduct an ablation study to evaluate the contributions of each component in the proposed approach, specifically the multi-generator architecture and the clustering mechanism. The goal of this study is to systematically strip different aspects of the architecture to showcase their individual and combined benefits. We compare three variants: (1) No Multigen & No Clustering: this variant represents a baseline without the proposed multi-generator framework or clustering, reflecting a traditional FL setup. (2) No Multigen: This variant removes the multi-generator aspect, retaining the clustering mechanism. (3) No Clustering: Here, clustering is removed while keeping the multi-generator architecture intact.
The experiments were conducted using the MNIST dataset on client counts of \(K = 25\), \(K = 50\), and \(K = 100\), and the MNIST score (MS) is used as a performance metric. The standard deviation of MNIST scores is also calculated to assess model consistency.
Across almost all of the experiments, both the multi-generator architecture, and clustering mechanism outperform the traditional FL architecture. Typically, the clustering is shown to hold the largest improvement on average in MNIST score, this is to be expected due to clients theoretically training on more similar data within their clusters. While the multi-generator architecture typically holds the highest improvement in model consistency as shown by the MS standard deviations, this too is in line with expectations, as the multi-generator architecture allows clients to hold more personalised models and aid in training consistency due to the reduced model parameters’ fluctuations between communication rounds and the more tailored model weights.
Following the MNIST Score ablation results, to further prove the benefits and contributions of the different components of the proposed approach, the same experiments are run on the SVHN dataset. These results can be seen in their entirety in Table 9.
The results in Table 8 show a similar trend of results on the colour SVHN dataset compared to the MNIST dataset. SVHN scores for the clustering mechanism, without the multi-generator architecture typically show a higher SVHN score than the other components. The multi-generator architecture alone showcases more consistent synthetic image generations with a lower SVHN score standard deviation than the clustering mechanism. Combining these approaches, as with MNIST, provides the benefits of both with higher image quality with enhanced consistency (Table 9).
For further depiction of the results, the MNIST Score and standard deviation over the course of 300 rounds of training are shown in Fig. 10.
Earlier experiments in this section demonstrate that the combination of these two approaches provide a more consistent training approach, resulting in more stable training and performant models overall. While in some cases, individual aspects such as the removal of the multi-generator approach on 25 clients, results in a better MNIST Score metric, the combination of results and in improvement in model consistency. In many cases, this combination performs better overall than the sum of its parts.
Figure 10 visualises both the MNIST Score (a) and the standard deviation (b) to showcase the value of each element of the proposed solution. In the MNIST scores graph, experiments with a single generator with clustering showcase an early increase in performance that is more pronounced. This approach peaks higher than the multi-generator architecture, but the multi-generator results show greater training consistency with less fluctuations across rounds.
5 Related works and discussion
Federated Learning is an active field of research, and while it is infrequently applied to GAN training, there have been some fundamental and important steps made in the field. This section outlines important and comparative works in the areas of Federated Learning and GAN.
(a) FL The original Federated Learning paper by McMahan et al. [3] proposed the core concept of FL with model averaging, with clients training in a decentralised fashion to achieve a trained global model. They say that their algorithm is robust to unbalanced and non-IID data however subsequent literature [12, 27, 28] have shown this to still be an ongoing challenge in FL.
In an experimental study performed by Li et al [27]., a number of FL algorithms focused on tackling the problem of non-IID data were tested on non-IID data silos; they found these imbalance data scenarios are still an ongoing challenge and that none of the proposed algorithms addressed the challenges in all cases. These algorithms proposed a number of different solutions to the non-IID problem, one such commonly proposed solution is to establish hierarchy in FL networks.
(b) Hierarchical learning in FL Hierarchical learning refers to adding a level of hierarchy to the FL model. Typically, hierarchical FL comes in the form of an added layer of edge servers between edge devices and the global server. However other forms of hierarchy such as clustering exist within the field.
One way accuracy and/or speed of convergence has been shown to be improved is through the employment of hierarchical systems as explored by Briggs, et al. [28]. In their work, they proposed FL with hierarchical clustering of model updates. This is performed by vectorising the local model updates to samples and taking the distance between all samples to judge similarity and merging the most similar into a cluster. The improvements found that, for classification tasks, training these clusters independently resulted in faster convergence and fewer necessary training rounds on their experimental data settings.
Abdellatif et al. [29] considered a hierarchical FL structure consisting of a number of clients and edge servers, as well as the global server. In their implementation, client edge devices are assigned to edge nodes based on both network topology, and statistical properties of the local datasets. This ensured both higher statistical stability in the face of non-IID data, as well as a more robust system of communication across the entire federated network of edge devices. The edge servers are responsible for synchronising user models, while the main cloud server synchronises the edge server models.
Hierarchical FL has been typically applied to traditional classification and prediction models to assess performance of the techniques, however interest in generative AI has been growing and as data privacy concerns become a larger issue, finding ways to apply FL to create decentralised GAN models is of high value.
(c) Decentralised GAN Similar to traditional deep learning models, GANs are typically centralised on a server or other host and require all of the data to be held by that same host. Decentralised GAN consists of training the GAN across many devices, which is where paradigms such as FL come in.
Federated GAN models include FL-GAN, proposed by Hardy, et al. [30] which proposes a simple adaptation of the original FL paper by McMahan, et al. [3]. with a federated network applied to GAN. A number of client edge devices each containing their own GAN model are trained on their local datasets. It then leverages the standard FedAvg algorithm to aggregate a global model. This is very effective providing that the data held across clients is IID, however it is not realistic to assume IID data distribution in real world settings, as such FL-GAN fails in many scenarios where IID data is not present or the data quantity is low. Each client in FL-GAN trains their tightly coupled generator and discriminator to train a central model on a parameter server by averaging the parameters at the server level.
Proposed alongside FL-GAN as an alternative decentralised training model that adapts the federated learning paradigm, MD-GAN or Multi-Discriminator Generative Adversarial Networks [30]. The aim of MD-GAN is to reduce computation on client edge devices by operating a single generator G, while discriminators D are held by the participating clients. As GAN is by nature an iterative and competitive process, in MD-GAN the generator is faced against each client’s discriminator. Each discriminator is trained on a client’s local dataset to learn to predict whether data is real or generated by the global generator. Once all discriminator predictions are complete across participating clients, the server computes the gradient for its parameters using all of the given client discriminator feedback and updates the generator parameters using the chosen optimizer algorithm.
Improving upon the works of FL-GAN, Li, et al. [8] proposed IFL-GAN (Improved Federated Learning GAN), this method applied a weighted averaging mechanism to the model aggregation step in FL-GAN, utilising client maximum mean discrepancy (MMD) score [9] to derive the scaling factor. This ensures that clients who generate the highest quality images are given priority at the aggregation step. This works very well for lower client counts and was only tested with low client numbers in their experimentation, however as per the experiments given, it also assumes that feature labels are split fairly evenly across the clients with only consistently disparate quantity differences, and it struggles to produce quality varied image generations in a reasonable number of communication rounds given a larger number of clients with an extreme non-IID data distribution, both along quantity as well as labels. This form of statistical property focused FL aggregation may similarly prove challenging and non-scalable in larger FL networks.
These models proved effective in testing, however all similarly showed a lack of scalability and lack of high quantity client tests. It is likely that they suffer from training difficulties at the high end and require further GAN specific development and exploration such as the employment of additional generators as explored by our HFL-GAN.
(d) Multi-generator GAN Multi-generator and multi-discriminator GANs are adaptations of the traditional GAN network to expand with additional generator and/or discriminator models.
Hoang, et al. [31] proposed adapting GAN by adding additional generators to the network as a means to overcome the mode collapse problem and stabilise training. They found this method to be scalable and to demonstrate improved performance over other state-of-the art GAN models on various 2D image datasets.
Likewise, Ghosh et al. [23] explored multi-generator GAN to similar results, showcasing good performance in diverse and challenging datasets. They also proposed another algorithm which encouraged different generators to generate diverse samples based on a similarity metric which showed effectiveness in image-to-image translation tasks.
Our adapted version of a multi-generator GAN relates directly to the distributed nature of federated learning and the need for model aggregation. This proves to adapt well and provide a level of robust training to federated networks in our tests. This was empirically explored throughout the discussion and results sections, future work is set to include detailed theoretical analysis of all components of this research.
6 Conclusion
This paper introduced HFL-GAN, a novel approach to training GAN models across highly fragmented Non-IID training data across large numbers of participating clients in a way that is both efficient and highly scalable. Experiments have demonstrated clear improvements on non-IID data and large client counts compared to similar algorithms like FL-GAN and IFL-GAN. Our HFL-GAN, in the same number of communication rounds, produced more recognisable and diverse synthetic data than comparative federated GAN models. Our approach is robust and scalable to large federated networks. Experimentation also shows the impact of hyper parameters such as cluster/server count on different client counts given our experimental settings. Future work is set to explore further methods of clustering taking network topology into account in order to improve communication efficiency as well as training performance. In addition, further work could be directed towards developing a dynamic clustering algorithm allowing the federated network to adapt across successive communication rounds. Further experimentations and considerations could be taken in exploring the benefits of similar hierarchical multi-generator solutions on different data distributions and quantities.
Data Availability
The benchmark datasets MNIST and SVHN-Cropped are publicly available and can be downloaded at http://yann.lecun.com/exdb/mnist/ and http://ufldl.stanford.edu/housenumbers/ respectively
References
Jiang L, Dai B, Wu W, Loy CC Deceive D (2021) Adaptive Pseudo Augmentation for GAN Training with Limited Data. arXiv. https://doi.org/10.48550/arXiv.2111.06849
Liu B, Ding M, Shaham S, Rahayu W, Farokhi F, Lin Z (2021) When machine learning meets privacy: A survey and outlook 54(2):31–13136 https://doi.org/10.1145/3436755
McMahan HB, Moore E, Ramage D, Hampson S, Arcas BAy (2023) Communication-efficient learning of deep networks from decentralized data. arXiv. https://doi.org/10.48550/arXiv.1602.05629
Saxena D, Cao J (2021) Generative adversarial networks (GANs): Challenges, solutions, and future directions 54(3):63–16342 https://doi.org/10.1145/3446374
Fui-Hoon Nah F, Zheng R, Cai J, Siau K, Chen L (2023) Generative AI and ChatGPT: Applications, challenges, and AI-human collaboration 25(3):277–304 https://doi.org/10.1080/15228053.2023.2233814
Wang K, Gou C, Duan Y, Lin Y, Zheng X, Wang F-Y (2017) Generative adversarial networks: introduction and outlook 4(4):588–598 https://doi.org/10.1109/JAS.2017.7510583. Conference Name: IEEE/CAA Journal of Automatica Sinica
OpenAI (2022) Introducing ChatGPT
Li W, Chen J, Wang Z, Shen Z, Ma C, Cui X (2022) IFL-GAN: Improved federated learning generative adversarial network with maximum mean discrepancy model aggregation, pp 1–14 https://doi.org/10.1109/TNNLS.2022.3167482. Conference Name: IEEE Transactions on Neural Networks and Learning Systems
Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Courville A, Bengio Y (2014) Generative adversarial networks. arXiv. https://doi.org/10.48550/arXiv.1406.2661
Biswas A, Nasim MDAA, Imran A, Sejuty AT, Fairooz F, Puppala S, Talukder S (2023) Generative adversarial networks for data augmentation
Krizhevsky A, Sutskever I, Hinton GE (2012) ImageNet classification with deep convolutional neural networks. In: Advances in neural information processing systems, vol 25. Curran Associates, Inc
Zhu H, Xu J, Liu S, Jin Y (2021) Federated Learning on Non-IID Data: A Survey. arXiv. https://doi.org/10.48550/arXiv.2106.06843
Nuha FU (2018) Afiahayati: Training dataset reduction on generative adversarial network 144:133–139 https://doi.org/10.1016/j.procs.2018.10.513
Budach L, Feuerpfeil M, Ihde N, Nathansen A, Noack N, Patzlaff H, Naumann F, Harmouch H (2022) The effects of data quality on machine learning performance. arXiv. https://doi.org/10.48550/arXiv.2207.14529
Karras T, Aittala M, Hellsten J, Laine S, Lehtinen J, Aila T (2020) Training generative adversarial networks with limited data. In: Advances in neural information processing systems, vol 33, pp 12104–12114. Curran Associates, Inc
Shahid O, Pouriyeh S, Parizi RM, Sheng QZ, Srivastava G, Zhao L (2021) Communication efficiency in federated learning: achievements and challenges. arXiv
Wu C, Wu F, Lyu L, Huang Y, Xie X (2022) Communication-efficient federated learning via knowledge distillation 13(1):2032 https://doi.org/10.1038/s41467-022-29763-x. Number: 1 Publisher: Nature Publishing Group
Nishio T, Yonetani R (2019) Client selection for federated learning with heterogeneous resources in mobile edge. In: ICC 2019 - 2019 IEEE International Conference on Communications (ICC), pp 1–7. ISSN: 1938-1883
Cho YJ, Wang J, Joshi G (2020) Client selection in federated learning: convergence analysis and power-of-choice selection strategies. arXiv
Tang M, Ning X, Wang Y, Sun J, Wang Y, Li H, Chen Y (2022) FedCor: Correlation-based active client selection strategy for heterogeneous federated learning. In: 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10092–10101. https://doi.org/10.1109/CVPR52688.2022.00986. ISSN: 2575-7075
Gao L, Fu H, Li L, Chen Y, Xu M, Xu C-Z (2022) FedDC: Federated learning with non-IID data via local drift decoupling and correction. In: 2022 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10102–10111. https://doi.org/10.1109/CVPR52688.2022.00987. ISSN: 2575-7075
Thanh-Tung H, Tran T (2020) On catastrophic forgetting and mode collapse in generative adversarial networks. arXiv. https://doi.org/10.48550/arXiv.1807.04015
Ghosh A, Kulharia V, Namboodiri V, Torr PHS, Dokania PK (2018) Multi-agent diverse generative adversarial networks. arXiv. https://doi.org/10.48550/arXiv.1704.02906
Al-Rubaie M, Chang JM (2019) Privacy-preserving machine learning: Threats and solutions 17(2):49–58 https://doi.org/10.1109/MSEC.2018.2888775. Conference Name: IEEE Security & Privacy
Ramesh A, Dhariwal P, Nichol A, Chu C, Chen M (2022) Hierarchical Text-Conditional Image Generation with CLIP Latents. arXiv. https://doi.org/10.48550/arXiv.2204.06125
Meskó B, Topol EJ (2023) The imperative for regulatory oversight of large language models (or generative AI) in healthcare 6(1):1–6 https://doi.org/10.1038/s41746-023-00873-0. Number: 1 Publisher: Nature Publishing Group
Li Q, Diao Y, Chen Q, He B (2021) Federated Learning on Non-IID Data Silos: An Experimental Study. arXiv. https://doi.org/10.48550/arXiv.2102.02079
Briggs C, Fan Z, Andras P (2020) Federated learning with hierarchical clustering of local updates to improve training on non-IID data. arXiv. https://doi.org/10.48550/arXiv.2004.11791
Abdellatif AA, Mhaisen N, Mohamed A, Erbad A, Guizani M, Dawy Z, Nasreddine W (2022) Communication-efficient hierarchical federated learning for IoT heterogeneous systems with imbalanced data 128:406–419 https://doi.org/10.1016/j.future.2021.10.016
Hardy C, Le Merrer E, Sericola B (2019) MD-GAN: Multi-discriminator generative adversarial networks for distributed datasets. In: 2019 IEEE international parallel and distributed processing symposium (IPDPS), pp 866–877. https://doi.org/10.1109/IPDPS.2019.00095. ISSN: 1530-2075
Hoang Q, Nguyen TD, Le T, Phung D (2017) Multi-generator generative adversarial nets. arXiv. https://doi.org/10.48550/arXiv.1708.02556
Author information
Authors and Affiliations
Contributions
Lewis Petch contributed to the formulation of the proposed approach and experimentation performed as well as initial drafting and further editing of the manuscript. Ahmed Moustafa contributed substantially to the revision of the manuscript and provided ongoing assistance through approach formulation. Xinhui Ma contributed to the revision of the manuscript. Mohammad Yasser contributed to the revision of the manuscript.
Corresponding author
Ethics declarations
Conflicts of interest
The authors confirm that they have no financial or non-financial conflicts of interest.
Data Informed Consent
All data used is publicly available with no ethical concerns attached.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.
About this article
Cite this article
Petch, L., Moustafa, A., Ma, X. et al. HFL-GAN: scalable hierarchical federated learning GAN for high quantity heterogeneous clients. Appl Intell 55, 170 (2025). https://doi.org/10.1007/s10489-024-05924-x
Accepted:
Published:
DOI: https://doi.org/10.1007/s10489-024-05924-x