KubePipe: a container-based high-level parallelization tool for scalable machine learning pipelines

Suárez, Daniel; Almeida, Francisco; Blanco, Vicente; Toledo, Pedro

doi:10.1007/s11227-025-06956-x

KubePipe: a container-based high-level parallelization tool for scalable machine learning pipelines

Open access
Published: 31 January 2025

Volume 81, article number 451, (2025)
Cite this article

Download PDF

You have full access to this open access article

The Journal of Supercomputing Aims and scope Submit manuscript

KubePipe: a container-based high-level parallelization tool for scalable machine learning pipelines

Download PDF

Daniel Suárez¹^na1,
Francisco Almeida¹^na1,
Vicente Blanco¹ &
…
Pedro Toledo¹

348 Accesses
Explore all metrics

Abstract

As the complexity and scale of machine learning applications continue to grow, the need for efficient training methodologies becomes increasingly critical. Traditional training processes can be time-intensive, often limiting rapid development and deployment. In response to this challenge, we present KubePipe, a high-level tool that abstracts parallelism and containerization from the user, allowing non-expert users to leverage advanced parallel architectures without requiring deep knowledge of parallel computing or container orchestration. KubePipe enables the concurrent execution of multiple machine learning workflows within a Kubernetes cluster, optimizing computational resources and significantly reducing training times. By leveraging containerized environments, KubePipe ensures a high degree of modularity, scalability, and portability, making it adaptable to various machine learning frameworks and tasks. Our experimental results demonstrate substantial performance improvements when using KubePipe compared to conventional pipeline implementations. This paper explores the architecture and functionality of KubePipe, providing insights into its integration with existing machine learning systems and highlighting its potential to streamline the training process in high-performance computing environments.

Memento: Facilitating Effortless, Efficient, and Reliable ML Experiments

Bioinformatics Application with Kubeflow for Batch Processing in Clouds

CEML: a Coordinated Runtime System for Efficient Machine Learning on Heterogeneous Computing Systems

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

1 Introduction

In recent years, the rapid advancement of machine learning has revolutionized various industries and research domains, from healthcare and finance to autonomous systems and beyond. As these applications become increasingly complex, the demand for faster, more efficient training processes has grown. Traditional model training methods often struggle with long processing times, especially when working with large datasets or intricate models. For instance, the training of GPT-3, which has 175 billion parameters, necessitates substantial computational power and time, often requiring weeks to months of training on advanced hardware setups, such as distributed GPU clusters [1]. These delays can slow down iteration cycles, impeding the continuous refinement of machine learning algorithms and ultimately stifling innovation.

Parallelization offers a compelling solution to the challenge of lengthy training times by allowing multiple tasks to be processed at the same time, which speeds up the overall training process [2]. However, using parallel architectures, even in traditional development environments, comes with significant challenges. These include the complexities of coordinating tasks and managing data across different computing resources. In addition, the variety of modern hardware, such as CPUs, GPUs, and specialized accelerators, makes efficient programming and resource management even more difficult.

In machine learning, parallelism is especially important due to the increasing size of datasets and the complexity of models being trained [3]. Although parallel execution can greatly improve performance, managing machine learning workloads across distributed systems remains a difficult task. Even with advances in hardware and algorithms, allocating resources efficiently across different systems is still a major challenge.

Despite these advancements, orchestrating parallel workloads across distributed systems remains a complex endeavor. Efficient resource allocation is essential for handling the demands of modern machine learning tasks. Containerization technologies, particularly when combined with Kubernetes, have emerged as powerful tools for addressing these challenges. Kubernetes can help simplify the distribution of workloads across multiple nodes, allowing for more efficient resource utilization and potentially reducing training times by automating the scaling and scheduling of containers. The widespread adoption of machine learning has introduced professionals from diverse disciplines to its use, many with limited computational expertise [4, 5]. This broad accessibility is made possible by high-level tools and platforms that abstract much of the underlying complexity, allowing users to focus on their specific applications. These tools lower the barriers to entry, enabling users to harness powerful machine learning models without needing extensive programming knowledge or expertise in distributed computing.

This paper presents KubePipe, a high-level tool that addresses the parallelization of machine learning pipelines in Kubernetes environments. Its design allows for the concurrent execution of multiple workflows, reducing processing times and simplifying common tasks. KubePipe is built to scale with modern machine learning requirements and is particularly beneficial for hyperparameter tuning and training multiple models in parallel. By supporting the simultaneous execution of entire pipelines, it helps optimize resource usage and minimize overall execution times. In addition, KubePipe offers fine-grained control over CPU and GPU allocations, enabling efficient distribution of workloads to meet specific task requirements.

The primary contributions of this work are the design, evaluation, and demonstration of KubePipe as a flexible, scalable solution for orchestrating and parallelizing machine learning pipelines across distributed systems. Leveraging Kubernetes, KubePipe abstracts container management, dynamic scaling, and resource allocation, allowing workloads to run seamlessly on diverse hardware architectures including CPUs and GPUs. Notable features include custom scheduling strategies, dynamic resource management, and integration with popular frameworks like scikit-learn, TensorFlow, and PyTorch. By enabling concurrent pipeline execution, it reduces training times and improves resource efficiency, particularly for tasks such as hyperparameter tuning. Moreover, encapsulating applications and dependencies in containers fosters reproducibility and portability, ensuring consistent execution across different platforms and simplifying experiment replication.

In contrast to traditional pipeline execution, KubePipe provides a high-level Python API that automates parallelization with minimal setup. It manages containerized workloads in a way that remains accessible even to users without deep infrastructure expertise. As a direct alternative to conventional methods, KubePipe can be integrated into existing workflows with minimal effort, promoting efficient resource utilization.

An additional feature of KubePipe is its adaptability to heterogeneous computing environments. By abstracting the infrastructure layer, KubePipe facilitates transitions between resource-limited clusters (e.g., Raspberry Pi clusters) and high-performance computing systems, without requiring manual reconfiguration. It manages dependency handling and container deployment automatically, and its modular design allows for the addition of new features or execution strategies as needed. The Python API simplifies usage for less experienced users while still offering advanced control over system resources for those with technical expertise. These characteristics make KubePipe suitable for managing and scaling machine learning workflows across diverse computational platforms.

The structure of this work is divided as follows. In Sect. 2, we provide a brief overview of the current state of parallelization in machine learning pipelines and the use of containerization with Kubernetes. In Sect. 3, we detail the design and implementation of KubePipe, including the key technologies and methodologies employed. In Sect. 4, we explore the core functionality and practical usage of KubePipe, emphasizing how it simplifies the orchestration and management of machine learning pipelines in distributed environments, while highlighting its modularity and ease of extension.

Next, in Sect. 5, we present the experimental results obtained from evaluating KubePipe’s performance and energy efficiency along with the overhead introduced by using this tool. Finally, in Sect. 6, we present the conclusions drawn and discuss potential avenues for future work to further enhance the efficiency and scalability of machine learning pipelines using KubePipe.

2 Related work

Containerization has emerged as a transformative technology in high-performance computing (HPC), offering significant advancements in the parallelization of workloads across multiple containers. This approach enhances the efficiency and flexibility of resource utilization, addressing the growing demands for computational power and data processing in scientific applications [6,7,8].

One of the significant performance benefits of containerization in HPC comes from fine-grained scheduling and resource allocation. Multi-container deployments have been shown to significantly improve the performance of HPC applications by partitioning processes into containers, each constrained to a single NUMA (non-uniform memory access) domain. This technique improves affinity management between processes and the hardware, which is crucial in optimizing performance for multi-core and multi-processor environments [6, 9].

In conjunction with orchestration tools like Kubernetes, the management of containerized HPC workloads becomes more efficient. Fine-grained scheduling within Kubernetes clusters has demonstrated improvements in execution time for HPC applications. Kubernetes not only maximizes resource utilization in multi-container environments but also enables dynamic workload distribution, ensuring scalable and flexible operations [10].

Empirical studies have further validated the minimal overhead of container runtimes. Container runtimes such as Singularity [11] and Charliecloud [12] exhibit negligible performance impact compared to traditional virtualization methods [13]. Containerization’s role in addressing the growing complexity of HPC frameworks is especially important as the field moves toward exascale computing [14, 15].

The combination of containerization, orchestration, and parallelization tools not only enhances the performance of HPC applications but also significantly promotes reproducibility and portability. By encapsulating applications and their dependencies within containers, consistent execution across diverse computing environments is ensured, which is essential for collaborative research [16]. Moreover, containerized deployment workflows have been developed to adapt to various performance and resource constraints, providing the flexibility needed to leverage cloud infrastructures while minimizing performance penalties [17, 18]. This adaptability allows HPC applications to be seamlessly deployed across different environments, including supercomputers and non-HPC systems, thereby enabling researchers to efficiently utilize a wide range of computational resources [19, 20].

In the context of evaluating virtualization and containerization performance, several studies have provided valuable insights. An analysis using Sysbench highlighted the performance variations of guest virtual machines on a VirtualBox hypervisor, demonstrating the impact of virtualization on system resources under identical conditions [21]. Another study assessed the performance of Docker-in-Docker (DinD) containers within microservice-based architectures, revealing that nested containers introduce measurable startup delays and increased memory consumption compared to standard Docker containers, though no significant differences were observed in disk and network input/output performance [22]. These findings emphasize the importance of understanding the performance trade-offs associated with different virtualization and containerization strategies, particularly in high-performance computing environments where resource optimization is critical.

From the perspective of parallelization, a variety of tools and frameworks have been developed to address the challenges of HPC by facilitating the transition from sequential to parallel workflows. OpenMP [23], as a high-level programming model, has become one of the most widely adopted approaches for parallelization. It has evolved to support task parallelism, allowing developers to express parallel computations with simple directives [24]. The integration of OpenMP with container orchestration systems, such as Kubernetes, further enhances the efficiency of parallel workloads by dynamically scaling resources based on computational demands [25]. Additionally, orchestration frameworks like CWL-PLAS facilitate parallel task execution by leveraging resources from multiple hosts, significantly reducing workflow duration [26].

Parallelization tools such as Cetus [27], Par4all [28], Pluto [29], Mallba [30], and DPSKEL [31] have also played a pivotal role in enhancing HPC performance. These tools achieve this by transforming sequential code into parallel code through source-to-source transformations [32, 33]. They simplify the complexities of parallel programming, enabling researchers to focus on their scientific work instead of low-level coding intricacies. By automating the process of parallelization, these tools have significantly accelerated the development cycles for high-performance applications [34].

In the realm of machine learning, several tools have been developed to streamline the orchestration and execution of complex workflows. Notably, Kubeflow pipelines has emerged as a popular solution for managing end-to-end machine learning workflows within containerized environments [35]. It leverages Kubernetes to efficiently scale tasks such as data preprocessing, model training, and evaluation. Similarly, Apache Airflow provides workflow scheduling capabilities for general-purpose pipeline orchestration [36], while MLflow offers experiment tracking and lifecycle management, supporting reproducibility and deployment of machine learning models [37]. Tools such as Nextflow [38], Argo Workflows [39], and Dask [40] further extend the capabilities for parallelizing workflows. Nextflow is widely used for scientific workflows, with built-in support for Kubernetes and containerized environments. Argo Workflows, a Kubernetes-native engine, allows for robust orchestration of parallel tasks, and Dask provides a flexible Python library for distributed parallel computing that can integrate with containerized systems. In addition, AWS SageMaker Pipelines delivers a managed solution for end-to-end machine learning on the Amazon Web Services (AWS) platform, offering integrated features for data preprocessing, model training, and deployment [41]. Azure Machine Learning Pipelines provides similar functionality within the Microsoft Azure ecosystem, enabling users to create and manage machine learning workflows with tight integration into Azure services [42]. Google’s Vertex AI offers a fully managed platform for building, training, and deploying machine learning models, including orchestration of workflows [43]. Furthermore, Domino Data Lab offers an enterprise-focused platform for managing data science workflows with an emphasis on automation and scalability [44]. While these tools are powerful and widely used, they often require significant expertise in infrastructure management, container orchestration, and integration with cloud-specific services.

In contrast, KubePipe simplifies these processes with a high-level Python API that automates the parallelization of machine learning pipelines. Leveraging Kubernetes’ scalability, it manages containerized workloads with minimal setup, making it accessible to non-expert users. Acting as a direct replacement for traditional pipeline execution methods, KubePipe integrates seamlessly into existing workflows, optimizing resource use without requiring deep infrastructure expertise. By bridging the gap between high-performance computing and everyday machine learning tasks, KubePipe offers a scalable, user-friendly solution for parallel execution.

3 KubePipe software architecture

This section outlines the design and implementation of KubePipe, a high-level Python API for parallelizing machine learning pipelines using Kubernetes. The methodology is divided into several subsections, each detailing specific aspects of KubePipe’s architecture, execution flow, integration with machine learning frameworks, and workload distribution. KubePipe is designed to abstract the complexities of Kubernetes, allowing users to run multiple pipelines concurrently without requiring detailed knowledge of container orchestration. As shown in Fig. 1, KubePipe follows a layered architecture, comprising a user layer, an application layer, a pipeline layer, an orchestration layer, and an infrastructure layer.

Each layer in this architecture provides services to the layer above it while hiding the implementation details of the layers below, allowing for a modular and scalable design. This multi-layered approach offers several advantages: it allows for the independent development and optimization of different layers, improving the performance of lower layers without affecting higher ones. Additionally, new elements can be added to any layer transparently, meaning, for example, that the user layer can be extended to accommodate new strategies or functionalities without disrupting the underlying architecture.

At the top of the architecture, the user layer represents the interaction point for machine learning engineers and data scientists. Users interact with KubePipe through a simple Python API, which abstracts the complexities of parallel execution, making it accessible even to non-expert users. This API, provided by the kube_pipe package, allows users to define and submit machine learning pipelines, configure hyperparameter grids, and manage the parallel execution of tasks. Users interact with the API using the KubePipe object, which acts as a reference to the KubePipe controller. Pipelines can be defined as lists or objects of one of the pipeline implementations, such as PipelineMinio, PipelineHTTP, or PipelineTCP. Once defined, users can submit the pipelines by calling the fit() method with the training data. While the technical details of parallelism and Kubernetes-such as resource allocation, container deployment, and scheduling-are automatically handled by KubePipe, advanced users have the flexibility to customize and manage these aspects if needed, offering greater control over the execution process.

The application layer contains the core KubePipe Controller, that acts as the central hub of the system. The KubePipe Controller receives pipeline definitions from the user via the Python API and interacts with Kubernetes to orchestrate the execution of these pipelines. It is responsible for submitting jobs to Kubernetes, managing resources, and tracking the status of pipeline executions. The controller ensures that machine learning pipelines are deployed across the Kubernetes cluster efficiently, with support for both CPU and GPU resources. Additionally, the controller enables users to execute multiple pipelines in parallel, which is crucial for tasks such as hyperparameter tuning or training multiple models simultaneously. The modular design allows KubePipe to execute machine learning workflows using different backends, ensuring flexibility and adaptability. The pipelines are executed in isolated containers, providing resource isolation and ensuring that parallel workflows do not interfere with each other. This layer abstracts the specifics of container execution, enabling the seamless parallelization of machine learning pipelines.

In the pipeline layer, KubePipe supports various pipeline execution strategies. These strategies are encapsulated within different classes such as PipelineMinio, PipelineTCP, and PipelineHTTP, each defining how the pipeline is executed. Additionally, KubePipe offers powerful tools for model and hyperparameter optimization, including KubeGridSearch and KubeRandomSearch. These tools enable users to efficiently search for the best model configurations by parallelizing the evaluation of multiple hyperparameter sets across Kubernetes nodes. KubeGridSearch performs an exhaustive search across all combinations, while KubeRandomSearch explores random subsets of the parameter space, making it suitable for larger search spaces. This extensible and modular architecture allows users to easily add new execution or search strategies, providing flexibility for advanced machine learning workflows.

Below the pipeline layer is the orchestration layer, which interfaces directly with Kubernetes through its API. This layer is responsible for submitting and managing containers within the Kubernetes environment. It works in close coordination with the pipeline layer, receiving execution strategies defined at the pipeline level and translating them into concrete resource management tasks. Kubernetes handles the scheduling and distribution of containers across the available nodes in the cluster, ensuring optimal resource usage and scalability. The orchestration layer abstracts the interaction with Kubernetes, allowing KubePipe to dynamically scale pipelines based on the available infrastructure while ensuring seamless communication with the pipeline layer to maintain execution flow.

Finally, the infrastructure layer represents the physical or virtual resources where the machine learning pipelines are executed. This layer includes Kubernetes nodes, pods, and containers that run the pipelines in a distributed environment. It interacts directly with the orchestration layer, which sends resource allocation and scheduling instructions based on the pipeline execution strategies. Kubernetes ensures that resources such as CPU, memory, and GPUs are efficiently allocated, and it provides fault tolerance and scaling as required. By managing the infrastructure layer entirely through Kubernetes, KubePipe focuses on orchestrating the execution of machine learning tasks without needing to interact directly with the underlying hardware.

3.1 KubePipe components

KubePipe, while functioning primarily as a Python library, integrates several Kubernetes components to effectively manage and execute machine learning pipelines. All of these components are automatically deployed within the Kubernetes cluster during the installation of KubePipe, ensuring a seamless setup for users. Next, it is provided an overview of the essential components that KubePipe interacts with throughout the pipeline lifecycle.

KubePipe Image Creator: When a user submits a pipeline that includes dependencies not present in any of the existing images, KubePipe automatically launches the Image Creator pod to build the necessary Docker image. This pod dynamically generates a Dockerfile, installs the required dependencies, and pushes the image to the private registry for future use. The process is fully automated and requires no user intervention. Once the image is successfully created and pushed to the registry, the Image Creator pod is deleted, as it is ephemeral and only exists for the duration of the image-building process. A new image is typically created when the user changes the version of a library in their environment or introduces a new library (e.g., switching from scikit-learn to TensorFlow), as the image replicates the versions and dependencies from the user’s environment.
Private Image Registry: KubePipe uses a private image registry within the Kubernetes cluster. When a pipeline is submitted, KubePipe analyzes its dependencies and, if necessary, automatically generates a Docker image using the Image Creator pod. The image is stored in the registry and reused for future executions, reducing redundancy. The private image registry is a persistent component that remains available throughout the lifecycle of the system.
Minio Object Storage: For pipelines handling large datasets or requiring intermediate storage, KubePipe integrates with Minio [45], an object storage service deployed within the cluster. Minio ensures efficient data transfer between stages and provides persistent storage for pipelines, particularly those handling data-intensive tasks. This component is continuously available and does not get deleted after execution.
Kubernetes Namespace: KubePipe operates within its own namespace (typically kubepipe), ensuring resource isolation and independent management of its components. This helps prevent interference with other workloads and simplifies resource cleanup after execution. The namespace persists throughout the operation of KubePipe and is not deleted after tasks are completed, maintaining the integrity of its isolated environment.

3.2 Pipeline communication strategies

KubePipe is designed to handle various communication strategies between the machine learning pipelines and the Kubernetes pods that execute them. This flexibility is implemented using the Strategy design pattern, allowing KubePipe to easily extend its functionality by adding new types of communication pipelines. Currently, KubePipe supports several pipeline types, each handling the transfer of data and models between the pods and the storage systems in different ways. Each pipeline can be executed independently, and a single KubePipe execution can combine multiple communication strategies, allowing different pipelines to use different methods of data transfer within the same workflow.

PipelineMinio: This pipeline type uses Minio for data transfer between the pods and the storage system. In this approach, training data, models, and any intermediate outputs are stored in Minio, and the pods retrieve and save data by interacting with the Minio bucket. This method is particularly useful for handling large datasets and ensuring persistent storage throughout the pipeline execution.
PipelineHTTP: The HTTP-based pipeline sets up an HTTP server within each pod, allowing data to be transferred to and from the pods via HTTP requests. This method is lightweight compared to Minio, but may not be optimal for very large datasets. It is best suited for scenarios where quick and direct communication is needed between the pipeline execution environment and the storage.
PipelineTCP: The TCP pipeline communicates using direct TCP socket connections between the pods and the storage system. This method provides lower latency than HTTP and is more efficient for transferring large amounts of data. However, it requires more complex socket management. TCP is particularly beneficial for workloads where low-latency data transfer is critical.

Each pipeline type abstracts away the complexities of managing data transfer between the storage systems and the pods executing the machine learning pipelines. This design allows KubePipe to be easily extended with new communication strategies by simply implementing new pipeline classes that follow the established strategy pattern. By decoupling the communication logic from the core pipeline execution, KubePipe ensures that the system remains flexible and scalable. Furthermore, the ability to combine different pipeline types within a single execution enables users to tailor the communication strategy to the specific needs of each pipeline, optimizing resource usage and performance.

3.3 Parameter search with KubePipe

One of the core capabilities of KubePipe is its support for model and parameter search, which is a crucial task in machine learning workflows. Optimizing model performance often involves tuning hyperparameters, and KubePipe simplifies this process by parallelizing the execution of multiple configurations using its built-in search strategies. Two of the primary tools for this task in KubePipe are KubeGridSearch and KubeRandomSearch.

KubeGridSearch: This method is based on the exhaustive search approach commonly found in machine learning libraries like scikit-learn. For a given model (or estimator), KubeGridSearch evaluates all combinations of hyperparameters defined in a parameter grid. KubePipe leverages Kubernetes to execute these combinations in parallel across multiple nodes, significantly reducing the overall time required to find the optimal hyperparameter configuration.
KubeRandomSearch: This tool implements a random search strategy for hyperparameter optimization. Instead of evaluating every possible combination of parameters, it samples a random subset of the parameter space. This method is more efficient in cases where the parameter space is large, allowing the user to explore a broader range of configurations without the computational cost of an exhaustive search.

Both KubeGridSearch and KubeRandomSearch take full advantage of KubePipe’s parallel execution framework, enabling users to scale their hyperparameter tuning process across multiple nodes. The flexibility of KubePipe’s architecture allows these searches to be combined with any pipeline type (such as PipelineMinio, PipelineHTTP, or PipelineTCP), further optimizing resource usage and speeding up the search process. Additionally, KubePipe is designed to support the incorporation of more advanced search strategies, such as heuristic or metaheuristic-based methods, allowing for intelligent optimization beyond traditional grid or random searches.

3.4 Cluster abstraction

KubePipe provides a unified interface that abstracts the underlying Kubernetes cluster infrastructure, allowing users to focus on defining and executing their workflows without needing to manage cluster-specific configurations. This abstraction simplifies the deployment process by ensuring that pipelines can be reused across different Kubernetes clusters with minimal modification.

One of the key advantages of Kubernetes is its standardized approach to configuring and creating clusters, which remains consistent regardless of the underlying hardware or architecture. Whether you are deploying to a high-performance computing environment with powerful nodes or to low power devices such as Raspberry Pis, the fundamental steps to set up a Kubernetes cluster (initializing the control plane, joining worker nodes, and configuring networking) remain largely the same. This consistency across diverse environments means that users can rely on identical tools and commands, ensuring a familiar and reliable experience every time.

With KubePipe, users can write pipeline definitions that are portable across all of these environments, relying on Kubernetes to handle the distribution of tasks and the allocation of resources, including CPU, memory, and GPU, as long as those resources are available in the cluster. Because Kubernetes manages these resources uniformly across different architectures, KubePipe seamlessly integrates them into the pipeline once specified. This design streamlines execution, enabling workloads to run effectively on anything from high-performance computing systems to clusters of Raspberry Pis.

By abstracting the underlying Kubernetes mechanics, KubePipe reduces complexity and enhances usability. Users can maintain a single, consistent codebase and scale workflows or migrate them to new clusters as needed. This approach ensures that users can leverage the full potential of Kubernetes, providing a smooth path to efficiently execute workflows in a wide range of hardware configurations without requiring intricate knowledge of each cluster’s internal setup.

4 Functionality and usage

In this section, we explore the core functionality and practical usage of KubePipe, focusing on how it simplifies the management and orchestration of machine learning pipelines in distributed environments.

4.1 Execution flow

As shown in Fig. 2, the execution flow of KubePipe begins when a user defines and submits machine learning pipelines via the Python API using the Kubepipe Controller. Each pipeline consists of a sequence of machine learning operations, such as data preprocessing and model training, which are executed concurrently across Kubernetes nodes. KubePipe handles the orchestration of these tasks by managing container deployment, monitoring, and resource allocation, abstracting the complexity from the user. Once the pipelines are submitted, KubePipe analyzes the dependencies required by the pipeline functions, ensuring that the same versions of the libraries used in the user’s environment are included. If necessary, KubePipe generates a Docker image that contains all required dependencies by dynamically creating a Dockerfile and launching an auxiliary pod to build the image. This image is then pushed to a private image registry within the Kubernetes cluster, ensuring it can be reused for future executions. After the image is successfully created or found in the local registry, KubePipe communicates with the Kubernetes API to deploy the pipeline as containers within pods across the available nodes. Each pipeline is executed in isolation, ensuring that resource usage is optimized and parallel execution is efficient. While the Kubernetes API handles the overall scheduling of pods, KubePipe allows for additional control by enabling users to specify custom scheduling strategies.

For example, KubePipe can direct pipelines to specific nodes, such as those equipped with GPUs, to ensure that computational resources are allocated according to the task’s requirements, maximizing efficiency and performance. During the execution, KubePipe continuously monitors the status of the running pipelines. The Kubepipe controller communicates with the Kubernetes API to gather updates on the progress of each pod, ensuring that the user is informed about the status of their jobs, whether they are still running, have completed successfully, or have failed. Once the execution is complete, KubePipe aggregates the results, such as model performance metrics or predictions, and returns them to the user. After the execution concludes, KubePipe performs a cleanup process, removing temporary files and containers created during the execution to ensure that the cluster remains free of unnecessary data. The generated images, however, remain in the private registry for future use, streamlining subsequent executions of similar pipelines.

4.2 Example usage

The following example (Listing 1) demonstrates how KubePipe is used to train and evaluate machine learning models in parallel across Kubernetes nodes. This example highlights the tool’s flexibility in combining different communication strategies such as PipelineHTTP, PipelineTCP, and PipelineMinio, while showcasing its capability to perform hyperparameter tuning using KubeGridSearch. In this example, the Iris dataset [46] is loaded and split into training and testing sets using train_test_split from scikit-learn. The machine learning models being trained are AdaBoostClassifier, LogisticRegression, and RandomForestClassifier, each paired with preprocessing steps like StandardScaler, OneHotEncoder, and MinMaxScaler. Additionally, hyperparameter tuning is applied to AdaBoostClassifier using KubeGridSearch.

The first three pipelines, defined on lines 20–22, are executed as independent containers within the Kubernetes cluster. Each container runs its assigned pipeline in isolation, with Kubernetes orchestrating their deployment and ensuring efficient resource allocation across the cluster. The fourth pipeline, defined on line 23, is a special case that uses KubeGridSearch to perform hyperparameter optimization. The provided param_grid_adaboost defines six unique parameter combinations, and for each combination, a separate container is launched.

This results in a total of nine containers running concurrently (three standalone pipelines and six from the grid search). Kubernetes dynamically schedules these containers across available nodes in the cluster, ensuring optimal utilization of computational resources such as CPUs, GPUs, and memory. Each container operates independently, executing its specific task without interference from others.

The configuration for each pipeline is straightforward and intuitive, as shown in the code below. Programmers can define pipelines and their corresponding configurations directly within the KubePipe object. Automation is built into KubePipe, such as the automatic allocation of hardware resources, while still allowing manual specification for fine-grained control when needed.

To further showcase KubePipe’s versatility, we also provide an example using TensorFlow to train a neural network (Listing 2). This example highlights how KubePipe can manage more complex machine learning workflows, such as deep learning models, within Kubernetes. The neural network is trained on the CIFAR-10 dataset, a common benchmark for image classification tasks, using hyperparameter tuning with KubePipe to optimize the model across different nodes.

In this example 2, the CIFAR-10 dataset is first loaded and split into training and test sets. The build_model function defines a simple convolutional neural network (CNN) model. The model is then wrapped in a KerasRegressor to integrate with scikit-learn, allowing for easy hyperparameter tuning using KubeGridSearch.

A grid search is performed to optimize the batch size and learning rate for the model, using a Kubernetes pipeline managed by KubePipe. On line 39, a pipeline is defined with a parameter grid containing 9 unique combinations of batch size and learning rate. This grid search generates a total of 9 pipelines, each corresponding to one combination from the grid. Each pipeline includes a MinMaxScaler to normalize the data before feeding it into the model. These pipelines are executed in parallel within the Kubernetes cluster, leveraging the distributed nature of KubePipe to efficiently allocate resources across available nodes.

Once the models are trained, KubePipe evaluates them on the test data and returns the scores, identifying the model configuration that achieved the best result. This example demonstrates KubePipe’s capability to seamlessly manage data preprocessing, hyperparameter tuning, and model evaluation for deep learning models in a distributed Kubernetes environment.

4.3 Integration with machine learning frameworks

KubePipe is designed to integrate seamlessly with machine learning frameworks, allowing users to parallelize and distribute machine learning pipelines with minimal modifications to their existing workflows.

At its core, KubePipe is compatible with scikit-learn-like estimators, making it an ideal choice for users already familiar with scikit-learn and its API. This ensures that pipelines can be executed in parallel across a Kubernetes cluster without the need for extensive changes. The tool’s integration with existing frameworks leverages the familiarity and widespread use of these APIs, allowing users to adopt it without significant overhead or the need to learn new paradigms.

In addition to supporting scikit-learn [47] estimators, KubePipe extends its capabilities to deep learning frameworks such as TensorFlow [48] and PyTorch [49]. It achieves this by utilizing community-maintained wrappers like Scikeras [50] and Skorch [51]. These wrappers enable neural network models from TensorFlow and PyTorch to function as scikit-learn estimators, allowing users to integrate deep learning tasks into their machine learning pipelines with ease. This design provides a consistent interface for both traditional machine learning and deep learning models, simplifying workflow development.

The use of containerization in KubePipe further enhances its adaptability to changes in these frameworks. Each container replicates the user’s environment, including specific versions of libraries and frameworks. This process is automated by the tool: when a change in the version of any library in the user’s environment is detected, KubePipe automatically creates a new container image with the updated library versions. As a result, updates or changes to frameworks like scikit-learn, TensorFlow, or PyTorch are managed seamlessly, without requiring manual intervention or modifications to the tool itself. By encapsulating dependencies within containers, KubePipe isolates workflows from system-level changes, ensuring consistent behavior and reproducibility even as underlying frameworks evolve.

This design ensures that KubePipe remains compatible with current machine learning and deep learning frameworks, supporting a broad range of models and tasks. The combination of integration with widely used APIs and adaptability through containerization allows users to benefit from a scalable and distributed pipeline execution environment while minimizing disruptions from framework updates or changes.

4.4 Deployment and installation

KubePipe has been designed for straightforward installation as a Python package and seamless integration with Kubernetes clusters. To begin, an operational Kubernetes cluster must be in place, and the local environment must reference a valid Kubernetes context. The installation process is as simple as executing the following command:

Once installed, KubePipe automatically generates the necessary Kubernetes objects (e.g., namespaces, deployments, services) to orchestrate pipelines, requiring no additional manual configuration apart from a functional Kubernetes environment. Pipelines can be submitted immediately, leveraging the cluster’s computational resources for efficient parallel processing.

This streamlined setup contrasts with traditional pipeline execution tools like Dask, Apache Spark, or TensorFlow’s distributed runtime. For example:

Dask [40]: While Dask simplifies Python-native parallel computing, deploying a distributed Dask cluster typically requires configuring a scheduler, setting up worker nodes, and ensuring consistent environments across all nodes. Dependencies usually must be installed and maintained manually.
Apache Spark [52]: Spark is a powerful tool for distributed data processing, but its deployment often involves setting up a cluster manager, managing resource allocation, and ensuring all nodes have access to the same code and dependencies.
TensorFlow Distributed [48]: TensorFlow’s distributed runtime provides flexibility for machine learning tasks but demands careful orchestration of tasks across workers, parameter servers, and devices. This setup often includes configuring environment variables and manually specifying the cluster topology. Furthermore, TensorFlow Distributed is limited to TensorFlow-based models and workflows, making it unsuitable for other type of workflows.

In contrast, KubePipe abstracts much of this complexity by leveraging Kubernetes’ built-in capabilities for container orchestration and resource management. Users can focus on defining pipelines without worrying about the underlying deployment mechanics, making it particularly suitable for heterogeneous environments or workflows requiring minimal setup.

5 KubePipe performance evaluation

This section presents the results of our benchmark tests, evaluating KubePipe’s performance across different machine learning pipelines over different parallel architectures. The experiments were conducted on two distinct computational setups: a cluster of Raspberry Pi Compute Module 4 units and a traditional high-performance computing (HPC) environment.

Both setups were used to assess the efficiency of KubePipe in managing containerized tasks, with a focus on energy consumption, resource utilization, execution time, and the overhead introduced by container orchestration and virtualization layers.

5.1 Experimental setup

Our benchmarking tests were carried out on two different hardware configurations: a Raspberry Pi-based cluster and an HPC machine equipped with multi-core CPUs and GPUs. These setups were chosen to evaluate the performance of KubePipe under resource-constrained conditions as well as high-performance environments. Table 1 summarizes the hardware specifications for both configurations.

Table 1 Hardware specifications for Raspberry Pi cluster and HPC machine

Full size table

The Raspberry Pi cluster consisted of 8 Compute Module 4 (CM4) units distributed across two Turing Pi 2 boards, while the HPC machine was equipped with a multi-core Intel Xeon CPU and an NVIDIA Tesla GPU. In both setups, KubePipe was deployed to manage and orchestrate containerized machine learning tasks.

5.2 Kubernetes and container management

To manage container orchestration across both the Raspberry Pi cluster and the HPC environment, a lightweight Kubernetes distribution, K3s [53], was utilized. This distribution was chosen for its low resource overhead, making it particularly suitable for the Raspberry Pi setup, while still being robust enough to manage the larger, more powerful HPC machine. The lightweight nature of K3s allowed for efficient management of containerized tasks, facilitating parallel execution and optimizing resource utilization across nodes.

Each task in the pipeline was encapsulated within a container, enabling parallel execution across the 8 Raspberry Pi nodes and leveraging the multi-core CPU and GPU resources in the HPC environment. The scheduling strategy used was round-robin, ensuring even task distribution across available nodes, enhancing performance and resource utilization.

5.3 Energy monitoring and measurement

To accurately measure energy consumption, different techniques were employed for the two setups. For the Raspberry Pi cluster, the AccelPowerCape module, integrated with a BeagleBone Black [54], was used to measure power consumption directly from the Compute Modules. The AccelPowerCape employs INA219 sensors, which provide precise readings of current, voltage, and power consumption [55]. Power measurements were taken exclusively from the compute modules by bridging the 12V lines from the ATX power supply through the AccelPowerCape, ensuring accurate data [56].

In the HPC setup, energy consumption was monitored using the EML library [57], which provides access to energy metrics through the Running Average Power Limit (RAPL) interface for CPU energy and the NVIDIA Management Library (NVML) for GPU energy consumption. These tools provided a comprehensive view of energy usage across both setups.

Energy consumption data were collected continuously during the execution of tasks and were processed using the pmlib [58] library to provide real-time monitoring of power usage throughout the pipeline lifecycle.

5.4 Dataset and model specifications

The benchmark tests were conducted on several datasets, each representative of different machine learning tasks. All datasets were sourced from the TensorFlow Datasets library. The datasets used include:

CIFAR-10 [59]: 60,000 32x32 color images for 10-class classification.
MNIST [60]: 70,000 grayscale images of handwritten digits (0–9).
Fashion MNIST [61]: 70,000 grayscale images of fashion items for 10-class classification.
IMDB [62]: 50,000 movie reviews for binary sentiment classification.
AG News [63]: 120,000 text samples for 4-class news topic classification.

The models used in this study were tailored to the nature of the dataset. Convolutional neural networks (CNNs) were employed for image classification tasks (CIFAR-10, MNIST, and Fashion MNIST). These CNN models consisted of two convolutional layers followed by max-pooling layers, with fully connected dense layers and a softmax output for classification. For text classification tasks (IMDB and AG News), long short-term memory (LSTM) networks were utilized. The LSTM models included an embedding layer followed by LSTM units, dense layers, and a softmax output layer for classification.

5.5 Hyperparameter tuning and pipeline execution

The pipelines for each task were managed using KubePipe and executed across the Kubernetes-managed cluster, mentioned earlier. Each one with a different architecture and characteristics. Hyperparameter tuning was performed using KubeGridSearch, which evaluated combinations of optimizer types, learning rates, and decay settings.

The following hyperparameters were evaluated:

Batch Size: Fixed at 64.
Optimizers: Adam, RMSprop, SGD.
Learning Rates: 0.01, 0.001, 0.0001.
Decay: 0.0, 0.01, 0.001, 0.0001.

This setup resulted in a total of 36 executions, as it involved evaluating all combinations of 3 optimizers, 3 learning rates, and 4 decay values, with the batch size fixed at 64.

The pipelines were executed in parallel across the Raspberry Pi cluster and the HPC machine, with compute times, energy consumption, and resource utilization monitored during execution using a profiler that aggregates resource usage metrics, including CPU, GPU, and RAM. The results of the grid search were analyzed across datasets to evaluate the effectiveness of the pipeline configurations.

To further optimize the pipeline execution, we scaled the number of concurrent pipelines to find the optimal configuration for our problem and clusters. On the Turing Pi cluster, we scaled up to 16 concurrent pipelines due to memory limitations, while the HPC cluster allowed scaling up to 36, the maximum number of executions.

5.6 Results and discussion

This section presents a detailed analysis of the benchmark results comparing the performance of the Raspberry Pi and HPC clusters. These tests evaluate the efficiency of each cluster in processing various datasets, with a focus on the impact of KubePipe’s parallelization across different workloads and hardware configurations.

5.6.1 Elapsed time

The elapsed times obtained for both the Raspberry Pi and HPC clusters are shown in Fig. 3 and more detailed in Table 2. The figure illustrates the time required to process different machine learning datasets with varying numbers of concurrent pipelines.

Table 2 Table of comparison of elapsed time (seconds) between Raspberry Pi and HPC clusters

Full size table

The results demonstrate clear differences in processing times across datasets and clusters. The AG News dataset, being the largest and most complex, consistently took the longest to process due to the computational demands of text operations such as tokenization and embedding. In contrast, smaller text datasets like IMDB required less time, while image-based datasets (e.g., CIFAR-10, MNIST, Fashion MNIST) had significantly shorter processing times due to their simpler inputs and operations.

The differences in processing times can also be attributed to the underlying models used. Text classification tasks, which employed LSTM models, are inherently more computationally intensive compared to image classification tasks that utilized CNNs. LSTMs require sequential processing and rely heavily on embedding operations and memory retention across sequences, leading to increased resource demands. Conversely, CNNs process images in a parallelized manner, which is computationally efficient and aligns well with GPU acceleration, thereby reducing processing times.

KubePipe significantly improved performance by orchestrating parallel execution and optimizing resource utilization. On the Raspberry Pi cluster, where hardware resources are more constrained, KubePipe’s ability to manage multiple tasks simultaneously led to substantial reductions in elapsed time, especially for larger datasets like AG News. The introduction of more concurrent pipelines resulted in steep performance improvements, allowing even computationally intensive tasks to complete efficiently.

On the HPC cluster, the impact of KubePipe was less pronounced but still valuable. Equipped with advanced computational resources, the HPC cluster achieved lower processing times overall. KubePipe’s parallelism ensured efficient scaling, particularly for large datasets, although the incremental gains diminished as the hardware approached full resource utilization.

Across both clusters, increasing the number of concurrent pipelines steadily decreased elapsed times. However, as concurrency reached a threshold where resources were fully utilized, the marginal benefits began to plateau. This effect was more evident on the HPC cluster due to its optimized hardware, but it also appeared on the Raspberry Pi cluster as task parallelism reached its limit.

5.6.2 Resource usage metrics

To investigate the asymptotic behavior noted in Fig. 3, we collected detailed profiling data for CPU, GPU, RAM, and VRAM usage while processing the IMDB dataset on the HPC cluster under varying concurrency levels. This dataset serves as a medium-sized text workload, which involves enough preprocessing (tokenization, embedding) to stress system resources without being as large as AG News.

We used a Python-based monitoring script that periodically samples CPU utilization and RAM consumption (via psutil), and we relied on nvidia-smi polls at the same interval to track GPU utilization and VRAM usage.

Figure 4a, b illustrates how these metrics evolve as concurrency changes. The x-axis in each figure corresponds to sampling intervals over time, while the left and right y-axes in each subplot show utilization percentages and memory usage (MB), respectively. Shaded regions in the background indicate different concurrency levels (e.g., concurrent_pipelines=1, concurrent_pipelines=8, etc.).

In Fig. 4a, CPU utilization increases steadily and begins to plateau around 16 concurrent pipelines, suggesting that the system’s cores and memory bandwidth are approaching full capacity. RAM usage similarly peaks in these higher-concurrency intervals, reflecting the memory-intensive nature of text tokenization and embedding.

Figure 4b shows that GPU utilization, while relevant for certain operations (e.g., embeddings or inference), is not the primary bottleneck in this setup. Given the limited VRAM capacity and configuration on our HPC system, only one pipeline is able to run on the GPU at any given time. Consequently, VRAM usage remains relatively stable, while overall throughput becomes primarily constrained by CPU availability and RAM bandwidth once concurrency exceeds roughly 8–16 pipelines. This resource saturation aligns with the asymptotic performance trends observed in Sect. 5.6.1.

5.6.3 Average power consumption

The average power consumption results for the Raspberry Pi and HPC clusters are shown in Fig. 5 and Table 3. The figure illustrates how the power consumption changes with an increasing number of concurrent pipelines for different machine learning datasets.

Table 3 Table of comparison of average power (Watts) consumption between Raspberry Pi and HPC clusters

Full size table

From the graphs, we observe that the average power consumption steadily increases as the number of concurrent pipelines rises for both the Raspberry Pi and HPC clusters. However, there are clear differences in scale and behavior between the two environments.

In the Raspberry Pi cluster, the average power consumption starts at around 12 W with 1 pipeline and increases to just over 20 W with 16 pipelines. This gradual increase in power consumption reflects the cluster’s limited processing capabilities and smaller node resources. We can see that the power consumption plateaus at higher numbers of pipelines, indicating that the cluster reaches a maximum level of power efficiency.

In the HPC cluster, the power consumption starts at a higher baseline of around 120 W and rises to over 190 W as the number of pipelines increases to 36. This sharp rise is due to the significantly more powerful hardware in the HPC setup, which consumes more power as it utilizes its available resources. The difference in power consumption between the datasets is more pronounced on the HPC cluster, likely because of the larger variations in computational requirements across the datasets.

For both clusters, the AG News and IMDB datasets show relatively higher power consumption, which can be attributed to the more complex text processing tasks involved. On the other hand, datasets like CIFAR-10 and MNIST show lower power consumption, reflecting the relatively lightweight nature of image-based classification tasks.

As the number of pipelines increases, we see that the power consumption scales in a more linear fashion on the HPC cluster, while on the Raspberry Pi cluster, the scaling is more modest. This suggests that the HPC cluster has more room to leverage its resources effectively as the workload increases.

5.6.4 Energy consumed

The energy consumption results for both the Raspberry Pi and HPC clusters are shown in Fig. 6 as well as in Table 4. These metrics represent the total platform energy consumption, including all components of the machine during the execution of the test.

Table 4 Table of comparison of energy consumed (joules) between Raspberry Pi and HPC Clusters

Full size table

From the graphs, it is evident that the AG News dataset consumes significantly more energy compared to the other datasets across both clusters. This is expected given the size and complexity of the AG News dataset, which involves large text processing tasks and takes the longest time to complete. The IMDB dataset follows a similar trend, though its smaller size leads to relatively lower energy consumption.

For the Raspberry Pi cluster, the energy consumption starts high with fewer pipelines, particularly for AG News, where the initial energy consumption exceeds \(1.0 \times 10^6\) joules. As more pipelines are added, the energy consumption decreases gradually, reflecting more efficient resource usage as concurrency increases. However, the Raspberry Pi cluster’s constrained resources limit the reduction in energy consumption, and it plateaus as the number of pipelines reaches 16. Despite taking more time to complete tasks, the Raspberry Pi cluster consistently consumed less energy compared to the HPC one due to its lower power usage.

On the HPC cluster, the energy consumption also starts high but becomes more efficient as the number of pipelines increases. The AG News dataset initially consumes over \(2.6 \times 10^6\) joules, but this drops sharply as concurrency increases. By the time 36 pipelines are running, the energy consumption has reduced significantly, indicating the HPC cluster’s ability to better distribute workload across its more powerful resources.

The other datasets, such as CIFAR-10, MNIST, and Fashion MNIST, consume far less energy on both clusters. This can be attributed to the combination of simpler data processing requirements and the efficiency of CNNs in handling image classification tasks. Text-based datasets, in contrast, require more resources due to the inherent complexity of LSTM-based models and the sequential nature of text processing.

This reduction in energy consumption is aided by KubePipe’s parallelization, which ensures efficient resource usage by distributing tasks across multiple nodes as the number of pipelines increases.

5.6.5 Energy efficiency

The energy efficiency results for both the Raspberry Pi and HPC clusters are shown in Fig. 7, as well as in Table 5. The pipelines per joule metric measures how many pipelines can be executed per unit of energy (joule), providing a direct indicator of energy efficiency. A higher value indicates better energy efficiency, as more pipelines can be executed for the same amount of energy.

Table 5 Table of comparison of energy efficiency (pipelines per joule) between Raspberry Pi and HPC clusters

Full size table

The data from the Raspberry Pi cluster show competitive pipelines-per-joule values compared to the HPC cluster, particularly at lower concurrency levels, highlighting the benefits of its lower-power ARM architecture. However, as concurrency increases, the HPC cluster surpasses the Raspberry Pi cluster in energy efficiency, especially at higher levels of concurrency. This advantage may be attributed to the HPC cluster’s ability to leverage more powerful hardware, including the use of GPUs, which can accelerate processing and improve energy efficiency for certain workloads.

The variation in pipelines-per-joule metrics is also influenced by the models employed for different datasets. For more complex datasets, such as AG News and IMDB, which are processed using LSTM models, pipelines-per-joule values are lower across both clusters due to the increased computational demands and longer execution times inherent to LSTMs. At lower concurrency levels, the Raspberry Pi cluster remains competitive, reflecting its energy-efficient design. However, as the number of concurrent pipelines grows, the HPC cluster demonstrates its ability to scale energy efficiency more effectively, likely due to its superior hardware capabilities and ability to distribute the computational load efficiently.

Simpler datasets, such as CIFAR-10 and MNIST, which are processed using CNN models, exhibit higher pipelines-per-joule values across both clusters. CNNs, optimized for parallel processing, require less computational power and consume less energy compared to LSTMs. Notably, the HPC cluster continues to improve its energy efficiency at higher concurrency levels, showcasing its scalability and enhanced performance with increased workloads.

By leveraging KubePipe to optimize resource usage across multiple pipelines, both clusters demonstrate significant improvements in energy efficiency as concurrency increases. The Raspberry Pi cluster reaches a plateau at moderate concurrency levels, likely due to hardware limitations. In contrast, the HPC cluster continues to scale effectively with higher concurrency, reflecting its capacity to handle larger workloads and its ability to utilize GPUs for additional efficiency. Nonetheless, both clusters showcase the effectiveness of KubePipe in enhancing resource utilization and energy efficiency within their respective operational limits.

5.6.6 Absolute overhead

The absolute overhead results for the Raspberry Pi and HPC clusters are shown in Fig. 8 and Table 6. Since each pipeline runs inside containers, the main source of overhead in KubePipe comes from the time required to create, initialize, and orchestrate these containers. Absolute overhead represents the time spent in container creation and orchestration during the execution of pipelines. To calculate the absolute overhead, we measured the time both inside and outside the container for each pipeline. We then identified the critical path, that is the path that took the longest time and determines the total duration due to parallel execution. The difference between the actual total time and the time spent inside the container gives the time used for container creation, which we refer to as the absolute overhead.

Table 6 Table of comparison of absolute overhead (seconds) between Raspberry Pi and HPC clusters

Full size table

From the analysis, we can observe that the absolute overhead decreases significantly as the number of concurrent pipelines increases for both clusters. This trend is consistent for all datasets, as more concurrent pipelines allow better resource utilization and reduce the relative time spent on container creation.

One of the key strengths of KubePipe is its ability to reuse container images for pipelines that have the same dependencies, such as those being executed for the datasets used in this study. This results in similar absolute overhead across different datasets, as the container creation process is largely standardized and repeated for pipelines with the same environment setup. As a result, the absolute overhead is not strongly influenced by the specific nature of the dataset (text-based vs. image-based), but rather by the number of pipelines and the efficiency of container reuse.

In the Raspberry Pi cluster, the initial overhead is quite high, especially for the AG News dataset, which exhibits the largest overhead among the datasets. However, as the number of pipelines increases, the overhead drops, indicating that the cluster becomes more efficient at managing container creation as more pipelines are executed in parallel. This improvement is further enhanced by KubePipe’s ability to parallelize not only task execution but also container creation, which is further streamlined by reusing existing container images.

In contrast, the HPC cluster starts with a lower absolute overhead compared to the Raspberry Pi cluster. For AG News, the initial overhead is around 200 s, and this decreases sharply as the number of pipelines increases. By the time all pipelines are running, the overhead has reduced to approximately 20 s. This highlights the superior container management capabilities of the HPC cluster, which handles container creation more efficiently even at lower levels of concurrency. KubePipe’s ability to reuse container images in the HPC environment ensures that container creation overhead is minimized as the number of pipelines increases.

It is worth noting that while the absolute overhead can become significant when executing multiple pipelines, particularly due to the time required for container creation, the critical factor is how much of the total elapsed time is spent on this process. Higher absolute overhead can lead to longer overall execution times, but its impact on performance trends depends on its proportion relative to the total runtime. This relationship is explored further in the next section, which examines how overhead correlates with elapsed time and contributes to overall performance observations.

5.6.7 Percentage overhead

The percentage overhead results for both the Raspberry Pi and HPC clusters are shown in Fig. 9 and Table 7. The percentage overhead represents the ratio of the time spent in container creation relative to the total execution time, expressed as a percentage.

Table 7 Table of comparison of percentage overhead between Raspberry Pi and HPC clusters

Full size table

For the Raspberry Pi cluster, we observe that the percentage overhead decreases steadily as the number of concurrent pipelines increases. Initially, with two pipelines, the overhead can be as high as 6% for datasets like MNIST and Fashion MNIST. As the number of pipelines increases to 16, the percentage overhead reduces to around 3–4%, indicating that the cluster is becoming more efficient as it handles more concurrent tasks.

This reduction in percentage overhead is particularly noticeable for text-based datasets, such as IMDB and AG News, which initially experience higher overhead due to the complexity of handling text data within containers. As the number of pipelines increases, these datasets see a significant drop in overhead, reflecting improved resource utilization in the Raspberry Pi cluster.

In contrast, image-based datasets like CIFAR-10 and MNIST exhibit consistently low percentage overheads throughout the tests. This is expected, as these datasets involve simpler operations that require less container setup time relative to the overall execution time.

For the HPC cluster, however, the percentage overhead tends to be higher compared to the Raspberry Pi cluster. This behavior can be attributed to the fact that tasks on the HPC cluster often run significantly faster, particularly when leveraging GPU acceleration. Since the percentage overhead represents the ratio of the time spent in container creation relative to the total execution time, a shorter execution time makes the container creation time comparatively larger. For example, text-based datasets such as IMDB and AG News exhibit higher overheads initially, and while the overhead decreases with more concurrent pipelines, it remains higher than on the Raspberry Pi cluster for most configurations.

It is important to note that the overhead is not particularly large and remains manageable, especially for larger datasets like AG News. In such cases, spending a small amount of time on container creation is negligible when compared to the overall compute time required to process the data. This makes the overhead introduced by KubePipe acceptable, even for more resource-intensive tasks.

6 Conclusion

In this study, we presented and evaluated KubePipe, a high-level parallelization tool for containerized machine learning workflows, across two distinct computing architectures: a resource-constrained Raspberry Pi cluster and a high-performance computing cluster. Our results demonstrate that KubePipe is a powerful and flexible tool, offering significant benefits in terms of ease of use, efficient resource management, and parallel task execution.

KubePipe simplifies the parallelization process through virtualization and container orchestration, allowing even non-expert users to harness the power of Kubernetes for managing complex machine learning pipelines. By abstracting the complexities of container creation, scheduling, and task distribution, KubePipe enables users to easily deploy parallel workloads without requiring deep knowledge of distributed computing or containerization. Moreover, KubePipe’s ability to handle heterogeneity between different computing environments allows users to switch between clusters without having to manually manage dependencies. This seamless transition between heterogeneous systems is a significant advantage, as dealing with dependencies across different environments often takes more time than the development of the models themselves.

From our results, several conclusions can be drawn. The absolute and percentage overheads associated with container creation were found to be manageable across both clusters. KubePipe’s ability to reuse container images for pipelines with the same dependencies minimized the time spent on container setup. This overhead became negligible for larger datasets like AG News, where the overall compute time far outweighed the container creation time.

On the Raspberry Pi cluster, KubePipe was instrumental in reducing elapsed time and improving efficiency as the number of concurrent pipelines increased. Despite the hardware limitations, KubePipe’s parallelization capabilities significantly improved resource utilization, particularly for more complex datasets. On the HPC cluster, KubePipe scaled effectively, further reducing overhead and energy consumption. The cluster’s superior resources, combined with KubePipe’s orchestration, ensured that even large workloads were handled efficiently.

KubePipe also contributed to energy efficiency. As the number of pipelines increased, energy consumption decreased due to more efficient workload distribution. This effect was more pronounced on the Raspberry Pi cluster, where KubePipe enabled better resource utilization, making the cluster more energy-efficient even when handling large datasets.

Text-based datasets, such as AG News and IMDB, showed higher energy consumption and overheads compared to image-based datasets like CIFAR-10 and MNIST. However, KubePipe’s parallelization helped to mitigate these impacts by reducing elapsed time and distributing the workloads more evenly.

In conclusion, KubePipe proves to be a robust and flexible tool for parallelizing machine learning workflows. It allows non-expert users to efficiently manage and scale containerized tasks across a variety of computing architectures, from small clusters like Raspberry Pi to large-scale HPC systems, with its ability to handle complex workloads, minimize overheads, improve energy efficiency, and simplify the process of switching between heterogeneous environments.

Data Availability

The datasets generated during and/or analyzed during the current study are available from the corresponding author on reasonable request.

References

Kaplan J, McCandlish S, Henighan T, Brown TB, Chess B, Child R, Gray S, Radford A, Wu J, Amodei D (2020) Scaling laws for neural language models. https://doi.org/10.48550/arxiv.2001.08361
Nagrecha K (2023) Systems for parallel and distributed large-model deep learning training. arXiv:2301.02691
Zhang H (2021) Machine learning parallelism could be adaptive. Composable Automated. https://doi.org/10.1184/R1/14402450.v1
Article MATH Google Scholar
Pumplun L, Fecho M, Wahl N, Peters F, Buxmann P (2021) Adoption of machine learning systems for medical diagnostics in clinics: qualitative interview study. J Med Internet Res 23(10):29301
Article Google Scholar
Yang J, Blount Y, Amrollahi A (2024) Artificial intelligence adoption in a professional service industry: a multiple case study. Technol Forecast Soc Chang 201:123251. https://doi.org/10.1016/j.techfore.2024.123251
Article Google Scholar
Liu P, Guitart J (2020) Performance comparison of multi-container deployment schemes for HPC workloads: an empirical study. J Supercomput 77(6):6273–6312. https://doi.org/10.1007/s11227-020-03518-1
Article MATH Google Scholar
Liu P, Guitart J (2021) Performance characterization of containerization for HPC workloads on infiniband clusters: an empirical study. Clust Comput 25(2):847–868. https://doi.org/10.1007/s10586-021-03460-8
Article MATH Google Scholar
Elia D, Fiore S, Aloisio G (2021) Towards HPC and big data analytics convergence: Design and experimental evaluation of a HPDA framework for escience at scale. IEEE Access 9:73307–73326. https://doi.org/10.1109/access.2021.3079139
Article Google Scholar
Zhou N, Zhou H, Hoppe D (2023) Containerization for high performance computing systems: survey and prospects. IEEE Trans Softw Eng 49(4):2722–2740. https://doi.org/10.1109/tse.2022.3229221
Article MATH Google Scholar
Liu P, Guitart J (2022) Fine-grained scheduling for containerized HPC workloads in kubernetes clusters. In: Proceedings of the 2022 International Conference on High Performance Computing & Simulation (HPCS), p 00068. https://doi.org/10.1109/hpcc-dss-smartcity-dependsys57074.2022.00068. IEEE
Kurtzer GM, Sochat V, Bauer MW (2017) Singularity: scientific containers for mobility of compute. PLoS ONE 12(5):1–20. https://doi.org/10.1371/journal.pone.0177459
Article Google Scholar
Priedhorsky R, Randles T (2016) Charliecloud: unprivileged containers for user-defined software stacks in HPC. https://doi.org/10.2172/1296650
Torrez A, Randles T, Priedhorsky R (2019) Hpc container runtimes have minimal or no performance impact. In: Proceedings of the 2019 IEEE/ACM Workshop on Containers and New Orchestration Paradigms for Isolated Environments in HPC (CANOPIE-HPC), pp 1–5. https://doi.org/10.1109/canopie-hpc49598.2019.00010. IEEE
Hob M (2023) Performance characterization of containerized HPC workloads. https://doi.org/10.22323/1.434.0003
Sharma N, Tiwari S, Thakur M, Disawal R, Pathak R (2024) Study of exascale computing: advancements, challenges, and future directions, pp 17–27. https://doi.org/10.1201/9781003471059-3
Khan S, Bhattacharjee A, Badhe P, Dongaonkar C, Gaglani Y (2022) Motifizer: a tool for parsing high-throughput sequencing datasets and quantitative comparative analyses of transcription factor-binding sites. https://doi.org/10.1101/2022.04.03.486862
Zhang L, Hu M, Yang H, LV Z (2023) Research on application containerized deployment workflow for high performance computing environment. In: Proceedings of the SPIE 2023. https://doi.org/10.1117/12.2673281
Saha P, Beltre A, Uminski P, Govindaraju M (2018) Evaluation of docker containers for scientific workloads in the cloud. In: Proceedings of the 2018 ACM Symposium on Cloud Computing, pp 321–322. https://doi.org/10.1145/3219104.3229280. ACM
Benedicic L (2017) Portable, high-performance containers for HPC. https://doi.org/10.48550/arxiv.1704.03383
Abraham S (2023) The HPC container experience on the summit. Supercomputer. https://doi.org/10.1145/3569951.3597608
Article MATH Google Scholar
Pokharana A, Gupta R (2023) Using sysbench, analyze the performance of various guest virtual machines on a virtual box hypervisor. In: 2023 2nd International Conference for Innovation in Technology (INOCON), pp 1–6. https://doi.org/10.1109/INOCON57975.2023.10101143
Fava FB, Leite LFL, Silva LFAD, Costa PRDSA, Nogueira AGD, Lopes AFG, Schepke C, Kreutz DL, Mansilha RB (2024) Assessing the performance of docker in docker containers for microservice-based architectures. In: 2024 32nd Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp 137–142. https://doi.org/10.1109/PDP62718.2024.00026
Fang B, Pattabiraman K, Ripeanu M, Gurumurthi S (2014) Evaluating the error resilience of parallel programs. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks. https://doi.org/10.1109/dsn.2014.73
Tagliavini G, Cesarini D, Marongiu A (2018) Unleashing fine-grained parallelism on embedded many-core accelerators with lightweight openmp tasking. IEEE Trans Parallel Distrib Syst. https://doi.org/10.1109/tpds.2018.2814602
Article MATH Google Scholar
Zhu G, Zhang D (2023) Parallel simulation of hydrological model based on container and automated deployment tools. https://doi.org/10.1117/12.2674668
Detti A, Funari L, Petrucci L, D’Orazio M, Mencattini A, Martinelli E (2023) Cwl-plas: task workflows assisted by data science cloud platforms. IEEE Access. https://doi.org/10.1109/access.2023.3272619
Article Google Scholar
Lee SI, Johnson TA, Eigenmann R (2004) Cetus—an extensible compiler infrastructure for source-to-source transformation. Lang Compilers Parallel Comput. https://doi.org/10.1007/978-3-540-24644-2_35
Article MATH Google Scholar
Amini M, Creusillet B, Even S, Keryell R, Goubier O, Guelton S, Mcmahon JO, Pasquier F-X, Péan G, Villalon P (2012) Par4all: From convex array regions to heterogeneous computing. In: IMPACT 2012 : Second International Workshop on Polyhedral Compilation Techniques HiPEAC 2012, Paris, France. https://minesparis-psl.hal.science/hal-00744733
Bondhugula U, Ramanujam J, Sadayappan P (2013) PLuTo: a polyhedral automatic parallelizer and locality optimizer for multicores
Alba E, Almeida F, Blesa M, Cabeza J, Cotta C, Díaz M, Dorta I, Gabarró J, León C, Luna J, Moreno L, Pablos C, Petit J, Rojas A, Xhafa F (2002) Mallba: a library of skeletons for combinatorial optimisation. In: Monien B, Feldmann R (eds) Euro-Par 2002 Parallel Processing. Springer, Berlin, Heidelberg, pp 927–932
Chapter Google Scholar
Peláez I, Almeida F, Suárez F (2008) Dpskel: a skeleton based tool for parallel dynamic programming. In: Wyrzykowski R, Dongarra J, Karczewski K, Wasniewski J (eds) Parallel Processing and Applied Mathematics. Springer, Berlin, Heidelberg, pp 1104–1113
Chapter MATH Google Scholar
Harel R (2023) Pragformer: data-driven parallel source code classification with transformers. https://doi.org/10.21203/rs.3.rs-3254961/v1
Prema S, Nasre R, Jehadeesan R, Panigrahi BK (2019) A study on popular auto-parallelization frameworks. Concurren Comput Pract Exp. https://doi.org/10.1002/cpe.5168
Article MATH Google Scholar
Sardar TH, Faizabadi AR (2019) Parallelization and analysis of selected numerical algorithms using openmp and pluto on symmetric multiprocessing machine. Data Technol Appl. https://doi.org/10.1108/dta-05-2018-0040
Article MATH Google Scholar
Kubeflow Pipelines. https://www.kubeflow.org/. Accessed 2024-09-05
Singh P (2019) Airflow. Apress, Berkeley, pp 67–84. https://doi.org/10.1007/978-1-4842-4961-1_4
Zaharia MA, Chen A, Davidson A, Ghodsi A, Hong SA, Konwinski A, Murching S, Nykodym T, Ogilvie P, Parkhe M, Xie F, Zumar C (2018) Accelerating the machine learning lifecycle with mlflow. IEEE Data Eng Bull 41:39–45
Google Scholar
Di Tommaso P, Chatzou M, Floden EW, Barja PP, Palumbo E, Notredame C (2017) Nextflow enables reproducible computational workflows. Nat Biotechnol 35(4):316–319
Article Google Scholar
Argo Workflows Documentation. https://argoproj.github.io/argo-workflows/. Accessed 2024-10-03
Rocklin M (2015) Dask: parallel computation with blocked algorithms and task scheduling. In: SciPy. https://api.semanticscholar.org/CorpusID:63554230
Amazon SageMaker Documentation. https://aws.amazon.com/sagemaker/. Accessed 2024-10-03
Azure Machine Learning Documentation. https://learn.microsoft.com/en-us/azure/machine-learning/. Accessed 2024-10-03
Vertex AI Documentation. https://cloud.google.com/vertex-ai. Accessed 2024-10-03
Domino Data Lab Platform Documentation. https://www.dominodatalab.com/. Accessed 2024-10-03
MinIO I (2024) MinIO: High Performance, Kubernetes Native Object Storage. https://min.io. Accessed 2024-09-09
Fisher RA (1936) Iris UCI Machine Learning Repository. https://doi.org/10.24432/C56C76
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V et al (2011) Scikit-learn: machine learning in python. J Mach Learn Res 12:2825–2830
MathSciNet MATH Google Scholar
Abadi M, Agarwal A, Barham P, Brevdo E, Chen Z, Citro C, Corrado GS, Davis A, Dean J, Devin M, et al (2016) Tensorflow: large-scale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467
Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L et al (2019) Pytorch: an imperative style, high-performance deep learning library. In: Advances in neural information processing systems, vol 32
Badaracco AG (2023) Scikeras: a scikit-learn API wrapper for Keras. https://adriangb.com/scikeras/stable/. Accessed 2024-09-05
Contributors S (2023) Skorch: a Scikit-learn compatible neural network library that wraps PyTorch. https://skorch.readthedocs.io/en/stable/. Accessed 2024-09-05
Zaharia M, Xin RS, Wendell P, Das T, Armbrust M, Dave A, Meng X, Rosen J, Venkataraman S, Franklin MJ, Ghodsi A, Gonzalez J, Shenker S, Stoica I (2016) Apache spark: a unified engine for big data processing. Commun ACM 59(11):56–65. https://doi.org/10.1145/2934664
Article MATH Google Scholar
Shepherd D. K3s: Lightweight Kubernetes. https://k3s.io. Accessed 19 May 2024
Coley G. Beaglebone black system reference manual. https://www.farnell.com/datasheets/1685587.pdf
Group of architecture and technology of computing systems (ArTeCS) of the Complutense University of Madrid: AccelPowerCape reference Page. https://artecs.dacya.ucm.es/tools/accelpowercape/
G. of Architecture of Computing Systems (ArTeCS) of the Complutense University of Madrid, T.: AccelPowerCape reference. https://artecs.dacya.ucm.es/tools/accelpowercape/
Cabrera A, Almeida F, Arteaga J, Blanco V (2014) Measuring energy consumption using eml (energy measurement library). Computer Science - Research and Development 30. https://doi.org/10.1007/s00450-014-0269-5
Barreda M, Barrachina Mir S, Catalán S, Dolz MF, Fabregat G, Mayo R, Orti E (2013) An integrated framework for power-performance analysis of parallel scientific applications
Krizhevsky A (2012) Learning multiple layers of features from tiny images. University of Toronto
LeCun Y, Cortes C, Burges C (2010) Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist 2
Xiao H, Rasul K, Vollgraf R (2017) Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. CoRR arXiv:1708.07747
Maas AL, Daly RE, Pham PT, Huang D, Ng AY, Potts C (2011) Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp 142–150. Association for Computational Linguistics, Portland, Oregon, USA. http://www.aclweb.org/anthology/P11-1015
Zhang X, Zhao J, LeCun Y (2015) Character-level convolutional networks for text classification

Download references

Funding

Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This work has been supported by the Spanish Ministry of Science and Innovation with Projects PID2023-151073NB-I00, TED2021-131019B-I00, and PDC2022-134013-I00; and the Spanish network CAPAP-H.

Author information

Daniel Suárez and Francisco Almeida have contributed equally to this work.

Authors and Affiliations

Computer Science and Systems Department, Universidad de La Laguna (ULL), San Francisco de Paula s/n, La Laguna, 38270, Spain
Daniel Suárez, Francisco Almeida, Vicente Blanco & Pedro Toledo

Authors

Daniel Suárez
View author publications
You can also search for this author inPubMed Google Scholar
Francisco Almeida
View author publications
You can also search for this author inPubMed Google Scholar
Vicente Blanco
View author publications
You can also search for this author inPubMed Google Scholar
Pedro Toledo
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

These authors contributed equally to this work.

Corresponding author

Correspondence to Vicente Blanco.

Ethics declarations

Conflict of interest

The authors have no conflict of interest to declare that are relevant to the content of this article.

Ethical approval

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Suárez, D., Almeida, F., Blanco, V. et al. KubePipe: a container-based high-level parallelization tool for scalable machine learning pipelines. J Supercomput 81, 451 (2025). https://doi.org/10.1007/s11227-025-06956-x

Download citation

Accepted: 14 January 2025
Published: 31 January 2025
DOI: https://doi.org/10.1007/s11227-025-06956-x

Keywords

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Part of a collection:

KubePipe: a container-based high-level parallelization tool for scalable machine learning pipelines

Abstract

Similar content being viewed by others

Memento: Facilitating Effortless, Efficient, and Reliable ML Experiments

Bioinformatics Application with Kubeflow for Batch Processing in Clouds

CEML: a Coordinated Runtime System for Efficient Machine Learning on Heterogeneous Computing Systems

1 Introduction

2 Related work

3 KubePipe software architecture

3.1 KubePipe components

3.2 Pipeline communication strategies

3.3 Parameter search with KubePipe

3.4 Cluster abstraction

4 Functionality and usage

4.1 Execution flow

4.2 Example usage

4.3 Integration with machine learning frameworks

4.4 Deployment and installation

5 KubePipe performance evaluation

5.1 Experimental setup

5.2 Kubernetes and container management

5.3 Energy monitoring and measurement

5.4 Dataset and model specifications

5.5 Hyperparameter tuning and pipeline execution

5.6 Results and discussion

5.6.1 Elapsed time

5.6.2 Resource usage metrics

5.6.3 Average power consumption

5.6.4 Energy consumed

5.6.5 Energy efficiency

5.6.6 Absolute overhead

5.6.7 Percentage overhead

6 Conclusion

Data Availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords