skip to main content
research-article
Open Access

Heterogeneous Energy-aware Load Balancing for Industry 4.0 and IoT Environments

Published:10 August 2022Publication History

Skip Abstract Section

Abstract

With the improvement of global infrastructure, Cyber-Physical Systems (CPS) have become an important component of Industry 4.0. Both the application as well as the machine work together to improve the task of interdependencies. Machine learning methods in CPS require the monitoring of computational algorithms, including adopting optimizations, fine-tuning cyber systems, improving resource utilization, as well as reducing vulnerability and also computation time. By leveraging the tremendous parallelism provided by General-Purpose Graphics Processing Units (GPGPU) as well as OpenCL, it is possible to dramatically reduce the execution time of data-parallel programs. However, when running an application with tiny amounts of data on a GPU, GPU resources are wasted because the program may not be able to fully utilize the GPU cores. This is because there is no mechanism for kernels to share a GPU due to the lack of OS support for GPUs. Optimal device selection is required to reduce the high power of the GPU. In this paper, we propose an energy reduction method for heterogeneous clustering. This study focuses on load balancing; resource-aware processor selection based on machine learning is performed using code features. The proposed method identifies energy-efficient kernel candidates (from the employment pool). Then, it selects a pair of kernel candidates from all possibilities that lead to a reduction in both energy consumption as well as execution time. Experimental results show that the proposed kernel approach reduces execution time by 2.23 times compared to a baseline scheduling system. Experiments have also shown that the execution time is 1.2 times faster than state-of-the-art approaches.

Skip 1INTRODUCTION Section

1 INTRODUCTION

National advances in manufacturing development strategies may be able to help nations design as well as build high-quality machines so they may be able to scale. The Industrial Internet, as well as manufacturing systems based on CPS (Cyber-Physical Systems), are helping to develop smart manufacturing as well as the smart factory [32]. As a result, both production time, as well as costs, are reduced. However, without intelligent decision-making, cyber-physical systems are incomplete [29]. This has led to the demand for energy-efficient algorithms as well as tools necessary for efficient production as well as use. Load-balancing usage, in addition to energy-efficient methods, may be able to help smart factories increase production and reduce operating costs.

Programmers typically map OpenCL programs to GPUs to take advantage of massive parallelism through high-speed execution. The architectural differences between a CPU as well as a GPU affect the suitability of tasks, as only a few OpenCL workloads are suitable for GPU-based execution. Others, however, are suitable for CPU-based execution. Given the increasing number of ported OpenCL programs, an efficient scheduler is required to balance kernels of different OpenCL applications (i.e., a job pool of submitted applications) on heterogeneous CPU-GPU systems. The task scheduler makes kernel mapping decisions based on application requirements as well as device suitability to optimize performance in terms of execution throughput.

Parallel architectures may be able to be strengthened through the use of specialized processing devices such as GPUs (Graphical Processing Units). GPUs were originally developed for processing graphical tasks. However, they are currently being used for general-purpose programming as well. GPU devices offer a high degree of parallelism. The enormous computing power of GPUs makes them ideal for data-parallel as well as throughput-oriented applications. GPUs often consist of multiple cores capable of executing instructions in a single instruction multiple data (SIMD) form. Then perhaps we may be able to argue that GPU processing, known for its graphics rendering capabilities, might also be able to perform mathematical computations with large amounts of data. At the same time, the CPU consists of a limited number of highly clocked cores that might be able to deliver fast response times to any activity while incorporating systematic innovations such as out-of-order execution, branch prediction, as well as superscalar capabilities, to name a few.

Most modern systems contain both multi-core CPUs as well as GPUs. This requirement is driving heterogeneous computing, a new paradigm for computing operations. A heterogeneous system architecture (HSA) may be able to be defined as a system that uses multiple processors to process tasks. The processors in the HSA may be able to be either GPUs or CPUs. These multi-core systems may be able to improve performance not only by adding more cores but also by introducing specialized in-process processing of complicated tasks while remaining energy efficient. This fascinating program is not only incredibly processor-intensive, but it is also constantly evolving, bringing with it new areas of work with increasingly diverse requirements. However, if only CPUs are used to perform the tasks in the system, this will not be enough to meet the growing diversity of requirements as information technologies grow rapidly. We may be able to develop heterogeneous computing to enable more efficient use of processors (CPU/GPU) to handle new workloads.

The ability to make sense of a wide range of processors may be able to improve processing capacity while maintaining high throughput as well as low turnaround time. Customers may be able to find unique solutions to even the most complicated tasks by using a range of processors. It is not possible to design a system with only a single processor to access and perform all tasks. Some processors excel in some areas while failing miserably in others. As long as we may be able to figure out the capacities of different processors, it is possible to choose the optimal combination of processors for specific tasks as well as jobs. Diversification initiatives like these allow heterogeneous computing to thrive in today’s technologically advanced world. With research and development in heterogeneous computing, we may be able to build processors that are able to communicate with each other, which could enable new experiences for consumers in the coming decades.

Details of the OpenCL architecture workflow are found in Figure 2. Stone et al. [33] developed OpenCL, which is considered a standard platform that allows both administrators as well as users to create an environment in which similar programs may be able to run on different types of processing units. Although it provides convenient utility, execution will most definitely vary between different parts of the heterogeneous framework. This is because one application may run significantly faster on GPUs than on CPUs. However, the performance of another application may degrade significantly if it is assigned to the GPUs while running faster on the CPUs. Programmers typically assign tasks to either the CPUs or the GPUs, leaving another processing unit unused.

Skip 2MOTIVATION Section

2 MOTIVATION

Implementing advanced technologies to improve platform operations is critical in the age of Industry 4.0. Due to the importance of platforms and the widespread use of advanced technologies in the context of Industry 4.0, data analytics requires the processing of large as well as diverse data sets using appropriate computational techniques. This type of data mapping requires intelligent mapping to minimize overload problems. The scenario, as well as the motivation, are shown in Figure 1. Programmers typically map OpenCL programs to GPUs to take advantage of massive parallelism through high-speed executions [22]. The architectural differences between a CPU and a GPU affect task compatibility, as only a few OpenCL workloads are suitable for GPU-based execution. Others, however, are suitable for CPU -based execution [3]. Given the increasing number of ported OpenCL programs, there is a need for an efficient scheduler that balances the kernels of numerous OpenCL applications (i.e., a job pool of submitted applications) on heterogeneous CPU-GPU systems.

Fig. 1.

Fig. 1. Overloading issue in the heterogeneous computing environment.

Fig. 2.

Fig. 2. A workflow architecture of OpenCL.

For data-parallel applications, the GPU is always preferred, although the CPU is preferred for certain scientific applications (e.g., the dot product or breadth-first search). When running applications with different input sizes, the same set of applications benefit from the multi-core GPU device [3]. Depending on the processing units (CPU favors sequential tasks, and GPU favors data-parallel tasks), the size of the input data, as well as the type of operations, the program achieves fewer benefits. Suppose that one of the resources (GPU) in the cluster of the heterogeneous system is powerful in terms of execution time. In this case, the processor overloads the job pool as well as makes the execution take a long time. Therefore, the problem may be able to be solved by selecting the application type as well as the computational processor based on the application workload. The purpose behind selecting an energy-efficient device is that it helps to allocate the workload to the appropriate processor in the cluster based on the application requirements. Our goal is to accelerate application acquisition by leveraging the concentrated resources of a heterogeneous device cluster. The acceleration is achieved by considering task allocation in addition to limiting energy consumption.

2.1 Energy Consumption of Jobs on Devices

Consider the watt consumption per millisecond of the programs in Figure 4. We ran each program with the same data size on both the CPU as well as the GPU. The Y-axis shows the watt consumption of the program in milliseconds (msec). The graph shows that poly bench (3mm, 2mm as well as gemm), AMD (BO -Binomial Option Multi GPU have significantly longer execution times on the CPU, while poly bench (covariance as well as metric vector multiplication), and also AMD (matrix multiplication), usually have significantly longer execution times on each GPU. Based on the experimental analysis in Figure 4, we may be able to clearly see that some programs have both faster execution as well as lower power consumption on the CPU itself. Moreover, other programs perform much better on each GPU device, which in contrast clearly shows that a generic job scheduler is not able to randomly as well as blindly assign active programs to CPUs/GPUs. This kind of methodology hampers the system and leads to long execution times as well as higher power consumption. Therefore, there is an urgent need for a reliable scheduling heuristic that is able to assign programs to any device through which they may be able to be executed faster than through other devices.

2.2 Device Suitability

To illustrate the effects of the devices on both load balancing as well as suitability, Figure 3 clearly shows an example of actual execution using a collection of four scenarios for pool-based job scheduling for thirty-four OpenCL applications. Of these 34 jobs, 20 are suitable for GPUs, while the remaining 14 applications are suitable for CPUs. With device-based scheduling, a given job is first assigned to a device on which it may be able to execute faster. Here, the GPU-only scheduler represents the jobs that exclusively use the GPU. In Figure 3, the fourth scheduler shows the assignment of the job to suitable devices. In addition, for each assignment that represents load balancing, a collection of jobs with lower execution jobs are shifted from overloaded devices to less loaded devices. From Figure 3, we can see that any scheduling system that performs load balancing may be able to achieve 2 to 3.5 times higher performance in direct comparison with GPU-only scheduling or device suitability schemes.

Fig. 3.

Fig. 3. Scheduling methods and the energy usage.

Fig. 4.

Fig. 4. Obtained watt consumption of GPU as well as CPU for different data sizes.

2.3 Contribution

The current trend, prevalent in heterogeneous computing, is to burden the programmer with scheduling tasks. Programmers usually use a standard scheduling strategy [1]. In this scheduling, the parallel component of the program (kernel) is assigned to the GPU, while the CPU executes the serial part of the program (kernel management). The CPU is idle while the graphics processor performs all calculations. However, it is possible to run a program on both the CPU as well as the GPU with OpenCL. This is a waste of valuable CPU resources, consuming both powers, and energy, but not performing any useful task [4].

This study addresses the clustering-based heterogeneous system, each containing a multi-core CPU as well as a multi-core GPU. Our goal is to accelerate a set of applications by performing an aggregate operation on the resources available in a heterogeneous cluster of devices. The acceleration is then performed considering the workload distribution to reduce the energy consumption of the task pool while improving the resource utilization. Based on the obtained results, a new scheduling strategy for a heterogeneous system cluster has been developed that predicts as well as integrates a prioritized list of highly acceptable resources for a given application into the system. Our main contributions are as follows:

  • To build a resource classification model, we first need to develop a feature extraction tool to extract code features from the proposed model.

  • To determine the suitability for a particular processor, we created an energy-based resource classification.

  • We proposed a heuristic-based load balancer that uses energy as an optimization parameter to allocate workloads to the appropriate devices.

Skip 3LITERATURE REVIEW Section

3 LITERATURE REVIEW

Luk et al. [28] developed a method that may potentially be used in conjunction with application core scheduling. The application-specific scheduling mechanism assigns the kernel to a specific processor. The schedule function is used to classify kernel-based programs (across processors). The kernel of your choice is then allocated to both the CPU as well as the GPU. The execution time is recorded using the Qilin index. The recorded execution is used to project as well as schedule the new program when the hardware configuration changes, resulting in a new training session for the scheduling algorithm. The drawback of index-based Qilin scheduling is that it incurs an overhead in code instrumentation, while our proposed approach does not.

Huchant et al. [19] provide a scheduler for heterogeneous OpenCL devices. The approach solves the problem of iterative computing with heterogeneous communication requirements as well as load balancing. The method consists of two parts. First, it provides a way to compute a partitionable kernel that is mapped to the two separate processors. The execution time for the partitioned kernel is then logged. During scheduling, the kernel device queue is adjusted to achieve high throughput. Huchant et al.’s [19] approach differs from our strategy in that it maps a single OpenCL kernel, while our proposed method schedules the application pool based on processor energy as well as execution time compatibility.

Albayrak et al. [8] developed a paradigm for scheduling heterogeneous devices. The multi-application scheduling approach is based on the many aspects of application requirements. Some applications may require a larger number of CPU cycles than others, which may not consume all of them. Therefore, it is important to find an effective approach for scheduling these applications to reduce the waste of resources. Also, iterative scheduling based on data dependency and execution time profiles are used here. The kernel was then mapped to devices using the greedy technique. The proposed method, on the other hand, does not require profiling because the learning process generalizes as well as predicts the best set of devices based on the configuration.

Augonent et al. [9] proposed a novel method for scheduling a device based on the CPU workload as well as application parameters. The execution environment for the runtime kernel is provided by the StarPU [9] model. Of the four methods identified, the priority-based strategy was used to select the device with the best performance. The approach includes priority, non-priority-based, ws-policy, w-rand-policy, as well as heft-tm-policy. The priority-based system prioritizes specific tasks, while the non-priority-based approach does not use a priority list. The WS policy uses the work-stealing strategy, which means that the job is assigned to the processor as soon as it is available in the system. The w-rand policy strategy uses probability-based job allocation to achieve high throughput. In heft-tm policy, profiling is used to assign the processing unit to the available resource. In the designed model, mathematical applications were used. In contrast, our proposed model may be able to schedule a wide range of jobs because the hardware supports all types of tasks in multi CPU-GPU systems.

Becchi et al. [10] indicates that optimized results may be able to be obtained by minimizing hardware features. The proposed schedule uses both execution time as well as data-transfer-based methods to order tasks on the given configuration. To optimize the tasks, the approach requires code instrumentation that may be able to be used for a given configuration. The technique delivers high performance because computational data is allocated close to the processing unit. However, the technique [10] necessitates offline profiling expenses, while the proposed scheduler might be able to use data-parallel application allocation without profiling.

Belviranli et al. [11] address the under utilization of heterogeneous resources in another paper titled “A dynamic self-scheduling system for heterogeneous multiprocessor architectures”. To achieve short execution time as well as high throughput, the proposed HDSS approach partitions the job between the CPU and the GPU. HDSS is divided into two stages: Profiling and Adaptive. In the profiling stage, assigned benchmark operations evaluate the processor’s processing performance. At the same time, adaptation is achieved through the use of looping operations. Both stages contribute to load balancing on heterogeneous devices. However, it was advocated that job splitting, as well as kernel code translation, are not required.

The distribution of workload is crucial in a diversified condition, and Binotto et al. [12] discussed the cost implications of using a work scheduling system. In addition, the dynamic properties of the kernel code may affect the scheduling approach. As a result, a data-parallel approach to work allocation was developed. For some tasks, the approach uses code instrumentation. Performance, as well as execution time, are then indexed by the profiling approach. This novel application is mapped to the profiled code, as well as scheduled based on performance on the device.

Gregg et al. [17] profile processing performance as well as address performance-related concerns when creating code partitions. Therefore, they have developed a dynamic workload partitioning solution that does not require profiling or training. The kernel is partitioned based on the capabilities of the CPU. The size of the profile is determined by profiling. Comprehensive instrumentation of the kernel space has the potential to improve performance. This achieves load balancing by allocating small portions to a slower computational processor as well as large portions to a faster processing unit.

A heterogeneous computing system with parallel computing for distributed systems uses both CPU as well as GPU computing capabilities. Choi et al. [13] address crucial difficulties in processor selection. For example, they predicted the execution time of the kernel and also used it to schedule the program on a CPU or a GPU device. The labelled dataset is used to train the model as well as the prediction, which is then used to map to parallel computing machines.

The heterogeneous system uses both the CPU power as well as the computational capacity of the homogeneous system. According to Grewe et al. [18], higher performance may be able to only be achieved if the program in question is executed on a suitable processing unit. Therefore, they advocate the use of a heterogeneous system by splitting the OpenCL programs. The application operations are extracted as features during the compilation process. Then, a pre-trained support vector machine is used to train as well as predict the appropriateness of the application device. The developed GPU only model achieves only 91% accuracy, while the CPU model achieves only 95% accuracy. In this study, both execution time, as well as energy consumption, are considered as clear device selection criteria to reduce both execution time as well as energy consumption.

Ghose et al. [16] proposed a predictive model based on previous works [18] for determining performance. However, they performed a detailed analysis of the divergence of control streams as well as their impact on planning. They used branching divergence analysis to train the classification model. In the created model, both a decision tree, as well as a radial base network, are used, and the application is characterized as CPU-GPU inclusive, partition CPU-GPU, or a mixed model. The performance of the model is validated using the cross-validation method. The CPU-GPU inclusive model had an accuracy of 89%, while the partition CPU-GPU model had an accuracy of 80.84% and 81.23%, respectively. In this study, our model uses an energy-efficient load balancing strategy to organize a large distributed workload stack with multiple applications.

According to Kofler et al. [25], low throughput may be caused by processor performance, memory availability, as well as communication latencies between computer resources. The kernel code is then divided into dynamic parts for the different types of processing, as well as a neural network is developed to predict the adequacy of the device. The computational framework uses the source code compiler to convert the kernel code into a multi-device kernel code. The neural network is trained using static features as well as data transmission size. The model has a prediction accuracy of 0.87. Our proposed approach balances the task by using both executions as well as energy consumption as performance counters on a cluster of heterogeneous devices. Our approach does not require code instrumentation or kernel partitioning.

Wen et al. [37] examined the challenges of using heterogeneous system resources. Application-specific processor selection based on the proposed model helps to increase system throughput as well as reduce turnaround time. The code characteristics, such as the number of instructions, load/store operations, as well as input, output, and global work size, may be able to help predict the CPU acceleration. If and only if the GPU acceleration is more than 4, the program is classified as high acceleration, otherwise it is classified as low acceleration. The GPU device is assigned to the high-speed application, while the CPU device is assigned to the low-speed program. The OpenCL kernel was trained as well as assigned using a radial-based support vector machine. The assignment is determined by the requirements of the application as well as the computational capabilities of the device.

Wen et al. [36] discussed the kernel combination model as well as presents a predictive approach for scheduling applications by combining two kernels. The method helps in reducing the transfer cost. The kernel as well as the competing features are trained with the decision tree-based classification model. The model selects eligible kernel devices, such as CPU and GPU, based on predicted device affinity. If and only if the model may not be able to discover the kernel, it runs it as a separate kernel to achieve high throughput. It then searches for the second kernel combination. Our proposed scheduler, on the other hand, assigns devices for kernel-based load balancing based on optimization parameters such as energy, execution time, as well as throughput.

3.1 Critical Analysis

Scheduling of computational kernels has been explored in the literature, with numerous strategies offered [8, 9, 16, 18, 19, 28]. Table 1 describes many models based on scheduling type, some tasks required both code instrumentation as well as application profiling [6, 7, 23, 27, 31]. With the many heterogeneous computers, load imbalance [35] persists. Kernels are assigned to the computing device by the scheduler (CPU or GPU). Scheduling is done according to the workload per device [10, 11, 12]. The lower overhead of the scheduling method helps to reduce the execution time. However, load balancing is not achieved for a heterogeneous system cluster. Supervised learning models are proposed [16, 18] as well as the kernel computation method is used for predictive modelling. The OpenCL kernel is assigned to a specific processor via the prediction model. This response is classified as a classification problem. The learning model discovers the mapping function to reduce the amount of unseen data, which is evaluated by the loss function. No other method is able to predict the compatibility of a kernel with energy devices without separating code, profiling, or indexing the execution time of an application. In addition, no other technique schedules a cluster of CPUs or GPUs with different hardware characteristics as well as includes load balancing as an evaluation criterion. This study provides a method for allocating applications based on the amount of processing required as well as the energy efficiency of the processor. To balance the load, the proposed load balancer uses a prioritized list of resources as well as energy consumption.

Table 1.
ReferenceScheduling TypeScheduling methodLoad BalancingCode instrumentationResource-AwareApplication-aware approachImplementationData size considerationArchitecture supportMulti processorEnergy reduction considerationData labelingProvenance Data UsageCode feature extractionMachine LearningEnergy suitability
ProposedStaticJob PoolRuntime SystemAny
[23]StaticJob PoolRuntime SystemAny
[31]HybridSingle jobLibraryAny
[19]HybridSingle jobRuntime SystemAny
[6]HybridSingle jobRuntime SystemAny
[7]HybridSingle jobRuntime SystemAny
[27]DynamicSingle jobLibraryAny
[35]HybridJob PoolRuntime SystemAny
[21]HybridJob PoolRuntime SystemAny

Table 1. A Summary of Literature Review Techniques

Skip 4METHODOLOGY Section

4 METHODOLOGY

This section provides an overview of the development of the model. The workflow is shown in Figure 5, which is divided into three parts. The dataset is obtained in the first stage using a static feature extractor for CPU as well as GPU-compatible application features. The second stage is based on a computation filter that includes hardware, code, as well as runtime, as shown in CPU & GPU Execution block as well as Execution time in Figure 5. We used the two techniques shown in Tables 2 and 3. We ran 137 kernels on two nodes, each with its own personal CPU as well as GPU with different data sizes. After feature extraction, features are labelled by the device with the less execution time as well as lowest energy consumption, as shown in the Dataset in Figure 5. As described in block TPOT in Figure 5, the tree-based pipeline optimization approach is used for both feature reduction as well as classification model selection. The machine learning model is then trained using the parameters set with TPOT as well as the features obtained in the last phase. The trained model is then used to build a device fitness model in online mode. Each and every individual experiment is described in detail, and the results of the evaluation are given.

Fig. 5.

Fig. 5. Flowchart of the designed model.

Table 2.

Table 2. GTX 760 Machine Details

Table 3.
DeviceCPUGPU
ArchitectureSkylake i7-6700GeForce (GT 740)
Total Cores4 cores \( \times \) 8 thread384 (CUDA cores)
Memory4 GB2 GB
Memory Speed2.133 GHz1.8 Gbps

Table 3. GTX 740 Machine Details

4.1 Dataset & Feature Extraction

We used two datasets for the experimental setup: AMD and Polybench, [18, 23, 25, 36]. As indicated in Table 5, a total of 155 data-parallel kernel codes are used. The program is executed with various input sizes, as shown in Table 5. Two CPUs (Haswell 3.2 GHz and Skylake i7-6700 3.4 GHz), as well as two GPUs (Nvidia Geforce 760 and 740), were utilized. We employed the minimal execution time and energy as output labels, as specified in the labelling block of Figure 5. We generated our LLVM passport for feature extraction in Table 4. The static code analyzer’s objective is to gather information about the kernel code. These groupings of attribute values determine the application’s behaviour. The static analyzer is made up of two parts: a clang LLVM parser [26] and a Python script. The OpenCL kernel is first produced just-in-time (JIT) using clang to guarantee that it is error-free (front-end compiler). The intermediate LLVM representation is then utilized to identify features by clang’s LLVM parser (IR). Using regular expressions, the Python script finds characteristics that are not available or may not be able to be detected by LLVM (IR). We utilized 653 (70%) instances for training, 277 (30%) instances for testing, as well as 277 (30%) instances for the testing set.

Table 4.

Table 4. Full Features Set

Table 5.

Table 5. Benchmarks Details

4.2 Feature Selection

The selection of the key attribute is crucial for the features recovered by non-specialists or specialists. Figure 6 shows the correlation matrix of the code features used, as well as the relative relevance of the features. We reduced the set of features from 24 to 10 for the following features: 1, 3, 5, 6, 9, 12, 16, 20, 21, and 22. The selection criteria for the features to be selected include strongly related relevance as well as a negative correlation with other features, as shown in the feature selection block of Figure 5. The model over-fits and has low predictive value due to the strongly related features. As shown in Figure 5, features 23, 24, 12, 16, 8, 22, as well as 20 have negative correlation. The relevant related features supported the result by putting the similar features at the top, as shown in Table 6.

Fig. 6.

Fig. 6. Correlation matrix.

Table 6.
FeatureFeature importanceFeatureFeature importance
10.54560.0128
210.081130.01
40.071290.00775
180.0505150.00396
190.039370.00301
20.037210.00267
160.0285170.00216
200.0279220.00201
80.0227120.00189
140.0188230.00119
50.015430.000634
110.0149101.09E-05

Table 6. The Importance of the Features Mentioned in Table 4

As shown in the block feature selection of Figure 5, the selection criteria for the features to be selected include high related relevance and a negative correlation with other features. Due to the highly related features, the model over-fits and has poor predictive value. Features 23, 24, 12, 16, 8, 22, and 20, have a negative correlation as shown in Figure 6. The related features that are important confirmed the observation by putting the identical features at the top, as also mentioned in Table 6.

4.3 Data Labeling

Concurrent program training runs extract both codes as well as runtime features. The feature vector as well as the execution time information is used to create the dataset. Similarly, the appropriateness classifier is created by combining the feature vector with the appropriate class/label. All programs are executed on all \( CPU i \) and \( GPU k \) devices to label the data. The device with the shortest execution time for that application is selected as the device. The important code factors are listed in Table 6, where the feature 1 data size is critical for both CPU as well as GPU selection.

4.4 ML-classifier Phase

For the last phase, it is necessary to create an ML-based classifier. This approach is then used to determine the most appropriate model for application device compatibility. The benchmark data is collected with both three CPUs as well as three GPUs. Consequently, the multi-class problem is defined by six output classes. Five classification models are then used in the designed framework, e.g., Random Forest, Gradient Boosting, Tree-based Pipeline Optimization, and KNN.

Random Forest: This consists of decision tree-based models [2, 34], which are also considered classifiers in the classification approach. Each tree-based classifier is assigned a random number of features. When predicting an unknown class in Random Forest, the voting mechanism is used.

KNN: One of the simplest classification methods is \( \mathbf {K} \)-nearest neighbour (KNN). KNN is a fast learning method, also known as “lazy learning” classification [20].

TPOT: TPOT [1, 34] uses genetic programming to generate features, select a machine learning model, as well as optimize the model parameter of choice. The hyper-parameter tuned model is shown in the TPOT-Hyper Parameter Tuning block of Figure 5.

Gradient boosting: Gradient boosting solves the underlying problem by incorporating gradients into the loss function [14]. To maximize its loss function, the weak learner is then involved. Gradient boosting is then used to optimize the loss function as well as the weak learners are then included in the gradient descent model.

4.5 Model Training with Testing

After selecting the best classification model, as well as hyperparameters, we train, test and evaluate the designed system with each extracted feature subset repository. TPOT is used here in the developed model to train and evaluate 137 benchmark applications. In the experiments, 80% of the data is used to train the model and then the remaining 20% is considered as the resting state.

Skip 5LOAD BALANCER Section

5 LOAD BALANCER

This section explains how the load balancer works as well as presents a mathematical model of the scheduling strategy that illustrates the benefits of the algorithm. Consider \( T = [t_1, t_2, \dots , t_k] \) as a set of k submitted applications to be scheduled on a cluster with a \( CPU_{i} \), as well as \( GPU_{k} \), as shown in \( P = [P_1,P_2, \dots , P_n] \). \( P_{i} \) is the allocation of the application to the processors predicted by the previously mentioned learned classifier \( Prediction model(J) \). The \( AR \) is used to calculate the energy consumption and represents the usage of \( P \) with \( ARUR_{old}\leftarrow \frac{mean(ready_{time}- M_{i})}{makespan} \). The \( makespan\ = \ max_{\forall j \in 1,2, \dots , i} \ {P}_{i} \) refers to the maximum completion time of any processor \( M \). \( \mathbb {S}_{Load}\ = (\frac{\sum _{i=0}^{m} {{(ET\_ P)}_{i}}}{Total_{ET}}) \) denotes the processor load, which is calculated by dividing the processor completion time by the total energy consumption. The \( sel_{processor} \) is a specific processor to which the load must migrate. It is calculated by taking the maximum value of the set \( \mathbb {S}_{non\_sel\_processor}\ = max{(\mathbb {S}_{Load})} \). The migration job that must be moved to minimize the load on the \( sel_{processor} \) is \( mig_{job}\ = J_{min}(sel_{processor}) \). The \( mig_{processor}\ = {mig_{job}}_{min}(\mathbb {S}_{non\_sel\_processor}) \iff non\_sel\_processor_{load} \lt Id\_load \) (total load/processor) is the processor on which the \( mig_{task} \) is to be mapped. The \( mig_{processor} \) calculated by selecting job minimum completion time from the set \( \mathbb {S}_{non\_sel\_machine} \). The \( mig_{machine} \) is selected if and only if as well as only if and only if it has load less than Id_load as well as has minimum completion time for \( mig_{task} \). Thus, the method continues to transfer the load to the \( {non\_sel\_processor} \) until the load on the \( sel_{processor} \) becomes less than Id_load. Each and every step is described in detail in Algorithms 1 and 2.

The algorithm takes into account the results of classification by machine learning and energy consumption (line 1, Algorithm 1). The termination criterion convergence factor is defined as given in Table 7 as well as (line 2, Algorithm 1). It is the point at which the progress towards high energy reduction reaches an optimal point. We first initialized it with zero values (line 2, Algorithm 1). If and only if there is no energy reduction in each and every repetition (on each and every processor) of the load transfer, the convergence is increased by one. In the empirical analysis, we choose the optimal point reached when the value of convergence is half the number of applications in the stack.

Table 7.
SymbolDescription
\( P_{i} \)Job assigned to \( i^{th} \) processor.
\( ET-job_{P_{i}} \)Execution time of the jobs on the \( i^{th} \) processor.
\( Total_{EN} \)Total execution time of the processor.
\( Total-Load \)Cluster total load.
\( Task-map \)Set of machines job mapping.
\( Con \)Convergence criteria: Convergence value is repeated to the \( \frac{number of jobs}{2} \).
\( AR_{Old} \)Before convergence average resource utilization ratio.
\( AR_{New} \)After convergence average resource utilization ratio.
\( mig-Processor \)A processor that migration needed.
\( new-machine \)A processor which is overloaded.

Table 7. The Acronyms for the Symbols used in Algorithms 1 and 2

The convergence factor helps optimize the performance of the algorithm. The software then requests the allocation of each and every processor (line 1, Algorithm 1) based on the minimum power requirement of the task. We choose the processor with the highest utilization (lines 5-12, Algorithm 1). Algorithm 2 moves the program from one CPU to the second-best processor. The expected energy, available application, as well as total load are all inputs to the Algorithm 2. The migration job is then selected as the data collection task that requires the least energy (line 2, Algorithm 2). The work is then transferred to the processors that require less than (total load/number of processors) as well as the least amount of energy to complete the critical application data (second-best energy consumption). Suppose the balancer is unable to find a processor, in such a situation, the last processor with the lowest energy consumption is selected and the cycle is repeated until all application processes have been moved (lines 4-10, Algorithm 2). The selected processor is notified after the transfer that the required data request has been withdrawn. The loop continues with the remaining applications until the total execution time divided by the number of processors equals the processor utilization. After the applications are migrated, the loop selects another processor as well as begins migrating the applications to the next processor. As a result, we may be able to omit some processor data as the output of the load allocation. The algorithm returns the new allocation between \( P_{i} \) as well as the Algorithm 1. When the resource utilization improves, the new resource utilization value is set to the previous value, the processor data allocation is kept, as well as the convergence point is reset to zero (lines 9-20, Algorithm 1). Resource allocation leads to the construction of the mapping of the load balancing application. We may be able to use this approach to generate a new network mapping as the output of Algorithm 1.

The required data request is removed from the specified processor after the transfer. In this way, we remove selected processor data from load balancing. This procedure is continued until we have distributed the data to all computers. Then, the ratio of resource usage is calculated as well as compared. We propose a method based on convergence. When the resource utilization improves, the new resource utilization value is set to the previous resource utilization value and the processor data mapping is saved. The convergence point has been set to zero. if and only if the resource utilization does not improve, the convergence value is increased. We set the convergence value to half of the required data of the source application during the empirical study. The convergence value indicates that we maximized the resource usage. The mapping of the load balancing application is created as a result of resource allocation. The remaining program is executed in a loop until the processor utilization equals the total execution time divided by the number of processors. After migrating the applications, the loop selects another processor as well as starts migrating the application of the newly selected processor. As a result, we get a new mapping for the network. The ratio of resource usage is then determined using the formula as: \( \frac{mean(ready_{time}- processor)}{makespan} \). If and only if the numbers increase, the convergence point is set to zero and the mapping is maintained. If and only if there is no improvement, the convergence point is increased.

Skip 6EXPERIMENTAL RESULTS Section

6 EXPERIMENTAL RESULTS

The proposed scheduling method was evaluated with a heterogeneous system that included both an Intel Core i7-5460 CPU and an NVIDIA GeForce GTZ 860 GPU. Table 2 describes our experimental setup used during our experimental analysis. Ubuntu test configuration management system 18.04 was used. GCC 5.4.0 was used to integrate all systems used in testing. We performed an energy analysis to evaluate the effectiveness of the method we developed. We use cross-validation with parameter 10 to test the proposed model. The original database is randomly divided into k equal parts for k-fold cross-validation. The approach is verified using a subset of k, with the remaining set (k-1) serving as the training set. The reverse verification approach is then repeated k times (fold value). Therefore, each set k is used only once as the confirmation set [5]. This method has the advantage that all observations may be able to be used for training and evaluation and that each observation may be able to be validated once and for all. The maximum accuracy of the section is calculated after testing k.

6.1 Performance Metrics

To evaluate the performance of the designed model, Receiver Operating Characteristic Curves (ROC) is the most commonly used metric for performance evaluation of ML-based models [4, 5]. As for performance measures, we used the ROC curve, the rate of correct positive examples \( (TPR=TP/(TP+FN)) \) as well as the rate of false-negative examples \( (FPR=FP/(FP+TN)) \). The recall rate is the rate of correct positive example classifications, while the FPR is based on the rate of false-negative example classifications. The precision-recall curve shows a tradeoff between precision as well as recall, while the ROC curve shows a tradeoff between recall as well as FPR. Precision refers to the frequency with which relevant results are obtained. To demonstrate the performance of TPOT for the reduced feature set, we use both the ROC curves as well as the precision-recall curves. (1) \( \begin{eqnarray} Precision = \frac{True Positive}{True Positive + False Positive} \end{eqnarray} \) (2) \( \begin{eqnarray} Recall = \frac{True Positive}{True Positive + False Negative} \end{eqnarray} \) (3) \( \begin{eqnarray} F\hbox{-}measure = \frac{2\times Precision\times Recall}{Precision + Recall} \end{eqnarray} \)

The performance of the baseline model is shown in Figure 7. The test accuracy was 0.55, while the training accuracy was 0.66. The ROC for training accuracy is 0.84. The model outperforms as well as shows a precision/recall curve in the upper left corner with a low false-positive rate for different thresholds. The model was poorly implemented. Consequently, KNN is not the best algorithm for this data.

Fig. 7.

Fig. 7. The results of KNN model.

The gradient model is shown in Figure 8. It is ideal for both developing as well as training losses. In turn, the model achieves a Receiver Operating Characteristic (ROC) of 0.84. Both the accuracy as well as detection curves change, indicating that the model may be able to learn successfully.

Fig. 8.

Fig. 8. The results of Gradient boosting model.

The y-axis in Figure 9(a) indicates accuracy, while the x-axis represents training instances. The y-axis of Figure 9(b) indicates R-square, while the x-axis represents training examples. The y-axis of Figure 9(c) indicates accuracy, while the x-axis represents recall. The y-axis of Figure 9(d) indicates the true positives, while the x-axis represents the false positives. In Figure 9, the random forest model achieves a high degree of accuracy. This is because the model is used to integrate the weekly learner. It consists of individual trees trained with random samples of training data. From this combination, it shows better performance regarding the decision tree. The unpredictability of the training samples as well as the tuning procedure help to solve the overfitting problem. The error is lowest in the training and evolved sets. The precision-recall curve in the upper right corner shows a good recall and precision rate as well as a low rate of false positives and false-negatives.

Fig. 9.

Fig. 9. The results of random forest model.

In Figure 10, we then perform the genetic algorithm-based model to tune the parameters of the classification by the TPOT approach. The model uses high-quality parameters that are both understandable and correlate with the classifier. The model may have a small error in the training and development sets. The ROC value obtained by the training model is 0.87. Due to the large power, the true positive rate is relatively high. On the other hand, the model has a poor precision-recall curve as well as a low R-square value.

Fig. 10.

Fig. 10. The results of TPOT model.

Due to data parallelism, the use of GPUs in computer systems has taken GPU applications in a new direction. However, concurrent execution of kernels (GPU sharing between kernels) is not allowed due to architectural constraints. As a result, valuable resources are wasted. To solve this problem, this research proposes a kernel merge strategy to maximize GPU utilization as well as reduce GPU energy waste. The machine learning (ML) based kernel merge finds a kernel pair from a batch of provided tasks and improves resource utilization by combining their results. The features were found to contribute to a high F-measure for device selection of 0.87. The drawback of this approach is that the size of the provisioned data must be within the global memory space of the GPU.

Table 8 is a summary of all results. The Gradient boosting model achieved the highest classification score. Precision refers to the accuracy with which a prediction is made. The precision indicates how accurate the classifier’s prediction of energy suitability is. In general, the higher the precision score (in this case 87%, as indicated in Table 8), the better the suitability classifier may be able to predict whether or not an application should be assigned to the processor. If and only if the precision value is poor, the fusion suitability classifier is not accurate and may not be able to determine whether an application is suitable for a device.

Table 8.

Table 8. The Cross Validation Evaluation with Respect to Precision-Recall, ROC Curve, and F1 Score

In ML mode, the recall indicates the accuracy value of the prediction results. The recall value indicates the number of accurate identification tasks that fall into the corresponding fusion class. Consequently, a higher recall value (which in this case is 87%, as indicated in Table 8), as also indicated in Table 9, indicates that the suitability classifier is able to correctly predict applications belonging to the same class in general, e.g., the appropriate class if and only if the precision value is less than one, it indicates that the suitability classifier is not thorough enough and also may not be able to distinguish applications belonging to the same class.

Table 9.

Table 9. Hold-on-based Evaluation with Respect to Precision-Recall and ROC Curve

Precision quantifies useful outcomes (predictive accuracy) rather than irrelevant outcomes, while recall quantifies the sensitivity of the most important outcomes (completeness of prediction). According to Geron, [15], it is straightforward to integrate both the Precision as well as the Recall into a single statistic, the F1 score. The F1 score is the harmonic mean of accuracy as well as recall. Compared to the ordinary mean, a harmonic mean gives more weight to lower scores. The F1 measure is a well-known statistic for estimating the overall performance of a system by combining both the Accuracy as well as recall. The scale of F1 is defined by the following concept. First, 1.0 indicates that the prediction is perfect; second, 0.9 may be able to be used for excellent prediction; third, 0.8 means that the prediction is acceptable; fourth, 0.7 means that the prediction is mediocre. Fifth, if the prediction is poor, then the prediction result is 0.6; Sixth, if the prediction result drops to 0.5, it means that the system uses random prediction. In general, a result of less than 0.5 indicates that the system may provide a very poor prediction [30]; if and only if precision, and recall for the suitability classifier are lower, the applications are not optimally suited for the processor. This processor would have resulted in longer execution times for job pools as well as lower system throughput. Suitability classifiers only receive a high F1 score if they have both high accuracy as well as high recall. Thus, in our scenario, recall is 0.88, precision is 0.87, and our F1 score is 0.87.

6.2 Load Balancer Comparison

We compared the proposed load-balancer against the following three techniques for scheduling:

(1)

CPU-Only: Jobs are sent directly to a given CPU device for execution. This is a naïve methodology that creates a baseline [24, 35];

(2)

GPU-Only: Jobs are directly transferred to GPU [24, 35] that programmers may tend to map the computationally intensive apps directly to GPUs, that in turn will lead to the underuse of other available devices;

(3)

Device suitability: This is based on the classification of the application-aware machine learning classification model.

By using the load balancer as well as a classifier (as in Figure 11) for execution time and energy consumption, the model may be able to achieve good results with a minimum completion time algorithm. The resource utilization, as well as execution time, verify the performance. Figure 11 illustrates the experimental results based on the watts per second of order in the order pool. As shown in Figure 11, the proposed scheduling system consumes 11.36 joules less than the GPU-only or CPU-only scheduling heuristics.

Fig. 11.

Fig. 11. Comparison of load balancer.

Skip 7CONCLUSION Section

7 CONCLUSION

With the increasing integration of GPUs into computer systems, they have become a viable alternative to CPUs for running data-parallel applications. However, GPUs do not currently allow concurrent execution of applications (GPU sharing between kernels). This may result in GPU resources being wasted when an application with a small data size is run on a GPU. OpenCL programs may be able to run on a CPU, GPU, or other supported accelerator, allowing true heterogeneity in the execution of OpenCL applications. An OpenCL application consists of two components: Host code and kernel code. The host code always runs on the CPU and is responsible for managing the overall execution of an OpenCL program.

In contrast, the device code is a data-parallel part of the application that may be able to run on any compatible processor. Programmers often map their OpenCL application kernel to the GPU to take advantage of the higher performance of GPU execution. As more and more programs migrate to OpenCL to speed up execution, a scheduler is needed to balance the kernels of multiple OpenCL applications (i.e., a job pool of applications) on a heterogeneous system CPU-GPU. Such a work scheduler would relieve the programmer from choosing the kernel mapping to the device and determine the suitability of a device for running OpenCL applications. This results in better device utilization as well as higher work pool throughput. Due to the architectural differences between a CPU and a GPU, some OpenCL operations are better suited for execution on the GPU than others. Therefore, an OpenCL task pool scheduler should be able to assign jobs to either a CPU or a GPU depending on whether the DeviceDevice is suitable for the work.

Programmers in a heterogeneous system divide the application according to the needs of the application. Since it is difficult to improve the input data, this choice is not ideal for the multimode environment. Each and every individual node is responsible for a large number of jobs. The allocation of jobs should be balanced to achieve optimal throughput while consuming as little energy as possible. This study categorizes OpenCL applications based on their ideal energy consumption. We used a gradient-based model hyper-tuned with the TPOT approach. We selected features based on correlation analysis as well as the proportional value of the features. The categorization model predicts energy-efficient processors. We proposed an LLVM-based code feature extraction approach that extracts operational code features at the machine level. The size of the delivered data is also used as a dynamic feature. The model is both trained offline as well as predicted online using OpenCL benchmarks. Our model achieved 0.85 ROC, while the load balancer consumed 2.1 times less energy. The proposed approach is also applicable to other complex systems. Model performance may also be improved by evolutionary computation.

Skip 8FUTURE WORK Section

8 FUTURE WORK

There are several areas of this research where progress may be made. Intelligent load balancing, where deep learning, as well as other learning approaches, may be able to help allocate processes to the “best” devices, could be a lucrative area for future efforts. In the future, it is proposed to develop a machine learning-based approach that may be able to predict the percentage of GPU resources (i.e., computational kernels) that should be allocated to each merged pair of a kernel in order to efficiently utilize GPU resources. In the future, the effects of merging more than two kernels could also be studied. In addition, as research progresses from CPU-GPU, more strategies for smooth job scheduling will be presented that improve heterogeneous computing as well as the methods explicitly addressed in this study.

REFERENCES

  1. [1] Ahmed Usman, Aleem Muhammad, Khalid Yasir Noman, Islam Muhammad Arshad, and Iqbal Muhammad Azhar. [n.d.]. RALB-HC: A resource-aware load balancer for heterogeneous cluster. Concurrency and Computation: Practice and Experience ([n. d.]), e5606.Google ScholarGoogle Scholar
  2. [2] Ahmed Usman, Liaquat Humera, Ahmed Luqman, and Hussain Syed Jawad. 2019. Suggestion miner at SemEval-2019 task 9: Suggestion detection in online forum using word graph. In The International Workshop on Semantic Evaluation. 12421246.Google ScholarGoogle Scholar
  3. [3] Ahmed Usman, Lin Jerry Chun-Wei, and Srivastava Gautam. 2022. A ML-based resource utilization OpenCL GPU-kernel fusion model. Sustainable Computing: Informatics and Systems 35 (2022), 100683.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Ahmed Usman, Lin Jerry Chun Wei, Srivastava Gautam, and Aleem Muhammad. 2020. A load balance multi-scheduling model for OpenCL kernel tasks in an integrated cluster. Soft Computing (2020), 114.Google ScholarGoogle Scholar
  5. [5] Usman Ahmed, Jerry Chun-Wei Lin, and Gautam Srivastava. 2021. Network-aware SDN load balancer with deep active learning based intrusion detection model. In 2021 International Joint Conference on Neural Networks (IJCNN’21), IEEE, 1–6.Google ScholarGoogle Scholar
  6. [6] Aji Ashwin Mandayam, Pena Antonio J., Balaji Pavan, and Feng Wu chun. 2015. Automatic command queue scheduling for task-parallel workloads in OpenCL. In IEEE International Conference on Cluster Computing.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. [7] Aji Ashwin M., Peña Antonio J., Balaji Pavan, and Feng Wu-chun. 2016. MultiCL: Enabling automatic scheduling for task-parallel workloads in OpenCL. Parallel Comput. 58 (2016), 3755.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. [8] Albayrak Omer Erdil, Akturk Ismail, and Ozturk Ozcan. 2012. Effective kernel mapping for OpenCL applications in heterogeneous platforms. In The International Conference on Parallel Processing Workshops. 8188.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. [9] Augonnet Cédric, Thibault Samuel, Namyst Raymond, and Wacrenier Pierre-André. 2011. StarPU: A unified platform for task scheduling on heterogeneous multicore architectures. Concurrency and Computation: Practice and Experience 23, 2 (2011), 187198.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Becchi Michela, Byna Surendra, Cadambi Srihari, and Chakradhar Srimat. 2010. Data-aware scheduling of legacy kernels on heterogeneous platforms with distributed memory. In ACM Symposium on Parallelism in Algorithms and Architectures. 8291.Google ScholarGoogle Scholar
  11. [11] Belviranli Mehmet E., Bhuyan Laxmi N., and Gupta Rajiv. 2013. A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Transactions on Architecture and Code Optimization 9, 4 (2013), 57.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. [12] Binotto Alecio P. D., Pereira Carlos E., Kuijper Arjan, Stork Andre, and Fellner Dieter W.. 2011. An effective dynamic scheduling runtime and tuning system for heterogeneous multi and many-core desktop platforms. In IEEE International Conference on High Performance Computing and Communications. 7885.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. [13] Choi Hong Jun, Son Dong Oh, Kang Seung Gu, Kim Jong Myon, Lee Hsien-Hsin, and Kim Cheol Hong. 2013. An efficient scheduling scheme using estimated execution time for heterogeneous computing systems. The Journal of Supercomputing 65, 2 (2013), 886902.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. [14] Friedman Jerome H.. 2002. Stochastic gradient boosting. Computational Statistics & Data Analysis 38, 4 (2002), 367378.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. [15] Géron Aurélien. 2019. Hands-on Machine Learning with Scikit-Learn, Keras, and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. [16] Ghose Anirban, Dey Soumyajit, Mitra Pabitra, and Chaudhuri Mainak. 2016. Divergence aware automated partitioning of OpenCL workloads. In The India Software Engineering Conference. 131135.Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. [17] Gregg Chris, Boyer Michael, Hazelwood Kim, and Skadron Kevin. 2011. Dynamic heterogeneous scheduling decisions using historical runtime data. In The Workshop on Applications for Multi-and Many-Core Processors.Google ScholarGoogle Scholar
  18. [18] Grewe Dominik and O’Boyle Michael F. P.. 2011. A static task partitioning approach for heterogeneous systems using OpenCL. In International Conference on Compiler Construction. 286305.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Huchant Pierre, Counilh Marie-Christine, and Barthou Denis. 2016. Automatic OpenCL task adaptation for heterogeneous architectures. In European Conference on Parallel Processing. 684696.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. [20] Ishtiaq Asra, Islam Muhammad Arshad, Iqbal Muhammad Azhar, Aleem Muhammad, and Ahmed Usman. 2019. Graph centrality based spam SMS detection. In The International Bhurban Conference on Applied Sciences and Technology. 629633.Google ScholarGoogle ScholarCross RefCross Ref
  21. [21] Khalid Yasir Noman, Aleem Muhammad, Ahmed Usman, Islam Muhammad Arshad, and Iqbal Muhammad Azhar. 2019. Troodon: A machine-learning based load-balancing application scheduler for CPU–GPU system. J. Parallel and Distrib. Comput. (2019).Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. [22] Khalid Yasir Noman, Aleem Muhammad, Ahmed Usman, Prodan Radu, Islam Muhammad Arshad, and Iqbal Muhammad Azhar. 2021. FusionCL: A machine-learning based approach for OpenCL kernel fusion to increase system performance. Computing 103, 10 (2021), 21712202.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. [23] Khalid Yasir Noman, Aleem Muhammad, Prodan Radu, Iqbal Muhammad Azhar, and Islam Muhammad Arshad. 2018. E-OSched: A load balancing scheduler for heterogeneous multicores. The Journal of Supercomputing 74, 10 (2018), 53995431.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Khan Naajil Aamir, Latif Muhammad Bilal, Pervaiz Nida, Baig Mubashir, Khatoon Hasina, Baig Mirza Zaeem, and Burney Atika. 2019. Smart scheduler for CUDA programming in heterogeneous CPU/GPU environment. In The International Conference on Computer Modeling and Simulation. 250253.Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. [25] Kofler Klaus, Grasso Ivan, Cosenza Biagio, and Fahringer Thomas. 2013. An automatic input-sensitive approach for heterogeneous task partitioning. In ACM International Conference on Supercomputing. 149160.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. [26] Lattner Chris. 2008. LLVM and clang: Next generation compiler technology. In The BSD Conference, Vol. 5.Google ScholarGoogle Scholar
  27. [27] Lee Janghaeng, Samadi Mehrzad, and Mahlke Scott. 2015. Orchestrating multiple data-parallel kernels on multiple devices. In The International Conference on Parallel Architecture and Compilation.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Luk Chi-Keung, Hong Sunpyo, and Kim Hyesoon. 2009. Qilin: Exploiting parallelism on heterogeneous multiprocessors with adaptive mapping. In The Annual IEEE/ACM International Symposium on Microarchitecture. 4555.Google ScholarGoogle Scholar
  29. [29] Lv Zhihan, Lou Ranran, Feng Hailin, Chen Dongliang, and Lv Haibin. 2021. Novel machine learning for big data analytics in intelligent support information management systems. ACM Transactions on Management Information System (TMIS) 13, 1 (2021), 121.Google ScholarGoogle Scholar
  30. [30] Narudin Fairuz Amalina, Feizollah Ali, Anuar Nor Badrul, and Gani Abdullah. 2016. Evaluation of machine learning classifiers for mobile malware detection. Soft Computing 20, 1 (2016), 343357.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. [31] Pérez Borja, Bosque José Luis, and Beivide Ramón. 2016. Simplifying programming and load balancing of data parallel applications on heterogeneous systems. In Annual Workshop on General Purpose Processing using Graphics Processing Unit. 4251.Google ScholarGoogle Scholar
  32. [32] Sharma Neha V., Yadav Narendra Singh, and Sharma Saurabh. 2022. Machine learning and security in cyber physical systems. In Cyber-Physical Systems. Elsevier, 171187.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Stone John E., Gohara David, and Shi Guochun. 2010. OpenCL: A parallel programming standard for heterogeneous computing systems. Computing in Science & Engineering 12, 3 (2010), 66.Google ScholarGoogle ScholarCross RefCross Ref
  34. [34] Tchernykh Andrei, Lozano Luz, Schwiegelshohn Uwe, Bouvry Pascal, Pecero Johnatan E., Nesmachnow Sergio, and Drozdov Alexander Yu. 2016. Online bi-objective scheduling for IaaS clouds ensuring quality of service. Journal of Grid Computing 14, 1 (2016), 522.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. [35] Wen Yuan and O'Boyle Michael F. P.. 2017. Merge or separate?. In The General Purpose GPUs.Google ScholarGoogle Scholar
  36. [36] Wen Yuan and O’Boyle Michael F. P.. 2017. Merge or separate?: Multi-job scheduling for OpenCL kernels on CPU/GPU platforms. In The General Purpose GPUs. 2231.Google ScholarGoogle Scholar
  37. [37] Wen Yuan, Wang Zheng, and O’Boyle Michael F. P.. 2014. Smart multi-task scheduling for OpenCL programs on CPU/GPU heterogeneous platforms. In The International Conference on High Performance Computing. 110.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Heterogeneous Energy-aware Load Balancing for Industry 4.0 and IoT Environments
            Index terms have been assigned to the content through auto-classification.

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Management Information Systems
              ACM Transactions on Management Information Systems  Volume 13, Issue 4
              December 2022
              255 pages
              ISSN:2158-656X
              EISSN:2158-6578
              DOI:10.1145/3555789
              Issue’s Table of Contents

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 10 August 2022
              • Online AM: 11 June 2022
              • Accepted: 1 June 2022
              • Revised: 1 May 2022
              • Received: 1 December 2021
              Published in tmis Volume 13, Issue 4

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Qualifiers

              • research-article
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            HTML Format

            View this article in HTML Format .

            View HTML Format