1 Introduction

Cardiac diseases are the major cause of death worldwide. The mechanical contraction of the cardiac tissue that ejects blood is preceded and triggered by a fast electrical wave, i.e., the propagation of the so-called action potential (AP). Abnormal changes in the electrical properties of cardiac cells as well as in the structure of the heart tissue can lead to life-threatening arrhythmia and fibrillation [1].

A widely used technique to observe the heart behavior is in silico experiments using mathematical models that can reproduce the heart’s function through computational tools. Generally these models are described by differential equations, representing the cell’s electrical and mechanical activity by ordinary differential equations (ODEs). Cardiac cells are connected to each other by gap junctions creating a channel between neighboring cells and allowing the flux of electrical current in the form of ions. An electrically stimulated cell transmits an electrical signal to the neighboring cells allowing the propagation of an electrical wave to the whole heart which triggers contraction. The phenomena of electrical wave propagation and cardiac contraction is usually modeled using partial differential equations (PDEs) [2].

Although the PDEs used for cardiac modeling are able to perform realistic tissue simulations, they involve the simulation of thousands of cells, which make its numerical solution quite challenging. This is an issue for clinical softwares, that may demand accurate results and real-time simulations. Therefore, some effort has been made in speeding up the PDEs solvers by parallel computing [3], as well as by different techniques to emulate PDEs simulations with less computational cost [4].

In previous works, we proposed a model of the cardiac electromechanics based on cellular automata and mass–spring systems (MSS), named FisioPacer [57]. Then we proposed a genetic algorithm (GA) to automatically tune parameters of the MSS, in order to reproduce the accurate simulations from partial differential equations (PDEs) simulators [8]. In this work our GA was adapted to find MSS parameters in order to reproduce a cardiac cycle of the left ventricle, using meshes generated from MRI images. However, due to the high computational cost of the ventricle simulation we present a scheme for tackling the workload balance of the GA, that was developed in shared-memory parallel computer implemented with C/C++ and OpenMP. The idea is to exploit two levels of parallelism in an automatic way to improve the processors efficiency. Our method reduced the execution time in up to 12 % which, in our case, represents a reduction of 36 h of computation. The parameters found by the GA were used to run FisioPacer and they were analyzed considering some ventricle measurements. The text is organized as follows, Sect. 2 presents how the left ventricle geometry was extracted from MRI data, Sect. 3 presents the electromechanic cardiac model with mass–spring systems and cellular automata, Sect. 4 presents the GA used to determine parameters for the simulator to reproduce a cardiac cycle, Sect. 5 describes a scheme for adapting the GA parallelism level, and finally Sect. 6 debates the results of the GA parallelism and the quality of the simulation obtained with the parameters found by GA.

2 Methods part I: left ventricle mesh generation

In order to represent the left ventricle (LV) geometry, we used a simplified technique that is based on ellipsoid equations [9]. The equation parameters, like the LV wall thickness and longitudinal height, were extracted from a open dataset provided by Koch et al. [11], that contains images of healthy human volunteers obtained by magnetic resonance imaging (MRI) of the heart in CINE mode. We used one image set that represents the geometry of the LV complete relaxation (diastole), which we extracted the necessary information and then we applied the techniques presented in the Rodrigues et al. [10] work, that resulted in a volumetric mesh with 19,912 tetrahedral elements. The cardiac tissue has three axes of anisotropy, as shown in Fig. 1a, where the dark-gray lines are the three axes. The first direction is the fiber axis, the sheet direction is orthogonal to the fiber, and the third direction is called sheet-normal, that is normal to the plane formed by the fiber and sheet axes. So, to represent the LV microstructure, the fiber f, sheet s and normal n directions were defined for each mesh element. Following [12], the sheet direction s was given by gradient direction of the scalar field that defines the transmural distance. Then, the fiber direction f was defined to be orthogonal to s in the plane z. f linearly changes its direction from the endocardium to the epicardium surface. Different values can be found in the literature. In this work, we used \(\alpha _{\text {endo}}=-60^\circ \) and \(\alpha _{\text {epi}}=65^\circ \). Normal direction n was obtained by the cross-product between f and s. Meshes are illustrated in Fig. 1b, and the fiber configuration, in Fig. 1c. For further details, the reader is encouraged to read [10].

Fig. 1
figure 1

Meshes and fibers. a Fiber directions in cardiac tissue. b LV diastolic mesh. c Fiber configuration

3 Methods part II: discrete model

This section presents the electro-mechanical model used by FisioPacer. The electrical model uses a cellular automata (CA) to represent the action potential (AP) propagation over the tissue. The mechanical contraction is modeled by a mass–spring system (MSS) coupled with the electrical model.

3.1 Modeling action potential propagation with cellular automaton

A cellular automaton is the model of a spatially distributed process that can be used to simulate various real-world processes. A 2-D cellular automaton consists of a two-dimensional lattice of cells where all cells are updated synchronously for each discrete time step. The cells are updated according to some updating rules that will depend on the local neighborhood of a particular cell.

In this work, the CA states are related to the action potential (AP) and force development in a cell. The CA is build on the idea that a single cell gets excited if the electrical potential exceeds a determined threshold. Once it is excited, it can trigger the excitation of neighboring cells. In this manner, an electrical stimulus will propagate through the CA grid as the time steps are computed. The CA values of AP and force mimics previously computed data, obtained via simulation of accurate differential equation models. So, the AP was divided into five different states that represents the different physiological stages of the AP [14], as shown in Fig. 2a. The active force was divided into four states, but it depends on AP automata (Fig. 2b). In this manner, the force starts some milliseconds after the AP, which is based on the physiological delay between electrical activation and cell contraction. Further details were comprehensively described in our previous works [57].

Fig. 2
figure 2

Automata and tetrahedron. a Action potential automaton. b Active force automaton. c Tetrahedron with one axis

3.2 Modeling mechanical contraction with anisotropic mass–spring systems

The active force is responsible for starting the contraction of the cardiac tissue. This force is coupled with the electrical potential of the cell. When the cell is stimulated, there is an increase in the concentration of calcium ions inside the cell, which triggers the cell contraction. The force development has a delay after the cell is stimulated because of its dependence on the calcium ions. The passive force in our simulator is modeled as a mass–spring systems, where masses are connected with the neighboring masses by springs. The active force is applied to the system deforming its spatial distribution and then the springs will try to bring the system back to its initial configuration. The cardiac tissue does not have a linear stress–strain relation. However, the linear model of the Hooke’s law can be used as a simplification.

The mechanical model was originally presented in [13]. In each tetrahedron, there are six springs. Three springs are placed in the anisotropic axes and are named as axial springs. The other three springs are named angular and they are placed between each pair of axial springs. Figure 2c illustrates a tetrahedron with one axis. The points where the axis intercepts the tetrahedron faces are called interception points. It follows that the active force must be applied on each tetrahedron, only on the fiber direction:

$$\begin{aligned} {\varvec{\mathcal {A}}}^t_{q0}= & {} +p^t \mu a^t_j {\widehat{{\varvec{\zeta }}}^t_j}, \end{aligned}$$
(1)

where \(\mu \) is a parameter of active pressure given in N/\({\text {mm}}^2\), \( p^t \in [0,1]\) comes from the force automaton, \(\widehat{{\varvec{\zeta }}^t_j}\) is the normalized fiber direction and \(a^t_j\) is the area of the faces intercepted by the fiber. At last, \({\varvec{\mathcal {A}}}^t_{q0}\) is the active forces on an interception point.

Turning to the passive force, we present the equations that react to the active force and bring the system to its initial geometry. The forces that each axial spring apply on an interception point are based on Hooke’s Law:

$$\begin{aligned} {\varvec{f}}^h_{q0}= & {} +\varDelta l_j \frac{E a^t_j}{l^0_j} {\widehat{{\varvec{\zeta }}}^t_j}, \end{aligned}$$
(2)

where \(\varDelta l_j\) represents the displacement of the spring, \(l^0_j\) is the initial spring axial length, \(a^t_j\) is the area of the faces intercepted by the fiber and E is the elastic modulus associated to the spring in axis j. It is another system parameter given in N/\(\mathrm{{mm}}^2\).

Next, the equations that control the angular forces (torsion springs) between a pair of axial springs are presented:

$$\begin{aligned} {\varvec{f}}^\alpha _{q0j}= & {} k^{jk}_\alpha a^t_j\left( {\widehat{{\varvec{\zeta }}}}^t_j \cdot {\widehat{{\varvec{\zeta }}}}^t_k - {\widehat{{\varvec{\zeta }}}}^0_j \cdot {\widehat{\varvec{\zeta }}}^0_k\right) {\widehat{\varvec{\zeta }}}^t_j,\end{aligned}$$
(3)
$$\begin{aligned} {\varvec{f}}^\alpha _{q0k}= & {} k^{jk}_\alpha a^t_j\left( {\widehat{{\varvec{\zeta }}}}^t_j \cdot {\widehat{{\varvec{\zeta }}}}^t_k - \widehat{{\varvec{\zeta }}}^0_j \cdot {\widehat{\varvec{\zeta }}}^0_k\right) {\widehat{{\varvec{\zeta }}}}^t_k. \end{aligned}$$
(4)

where \(k^{jk}_\alpha \) is the angular stiffness constant associated to axes j and k and \(\widehat{{\varvec{\zeta }}^t_j} \cdot \widehat{{\varvec{\zeta }}}^t_k\) is an approximation of the value of the angle between axes \(\widehat{{\varvec{\zeta }}^t_j}\) and \(\widehat{{\varvec{\zeta }}}^t_k\).

The cardiac tissue is mostly composed by water and therefore its volume does not have big variations. So to implement this feature, we added a volume-preserving force:

$$\begin{aligned} {\varvec{f}}^\vartheta _{q0}= & {} +\left( \vartheta _i^t-\vartheta _i^0\right) \frac{k_\vartheta a^t_j}{l_j^0} \widehat{{\varvec{\zeta }}^t_j}, \end{aligned}$$
(5)

where \(\vartheta _i^t\) and \(\vartheta _i^0\) are current and initial volume of the tetrahedron i and \(k_\vartheta \) is a parameter associated with the volume preservation on the spring on axis j.

Finally, a damping force that prevents the system to oscillate forever is added:

$$\begin{aligned} {\varvec{f}}^d_{q0}= & {} +\left( {\varvec{v}}_{q0} - {\varvec{v}}_{q1}\right) \cdot \widehat{{\varvec{\zeta }}^t_j} \frac{\beta a^t_j}{l_0} \widehat{{\varvec{\zeta }}^t_j}, \end{aligned}$$
(6)

where \({\varvec{v}}_{q0}\) and \({\varvec{v}}_{q1}\) are the velocities of interception points and \(\beta \) is the constant parameter of damping.

To conclude, the total force on every interception force on a tetrahedron is computed:

$$\begin{aligned} {\varvec{f}}^t_{q0}= & {} {\varvec{\mathcal {A}}}^t_{q0}+{\varvec{f}}^h_{q0}+{\varvec{f}}^d_{q0}+ {\varvec{f}}^\alpha _{q0} + {\varvec{f}}^\vartheta _{q0}. \end{aligned}$$
(7)

Next, the forces on interception points are distributed to vertices of the tetrahedron’s face, via the interpolation functions. Further details can be found in [13]. With the total force on each vertex in hands, it is possible to find the acceleration by Newton’s second law:

$$\begin{aligned} {\varvec{a}}_i^t = \frac{{\varvec{f}}_i^t}{m_i}. \end{aligned}$$
(8)

The final step consists in integrating the system of equations for each vertex in the CA to simulate the mechanical deformation of the tissue. We use the Forward Euler’s method.

Fig. 3
figure 3

Parallel FisioPacer implementation with OpenMP

Table 1 Parallel performance

3.3 Parallel FisioPacer

The FisioPacer simulator was adapted to perform parallel simulations with OpenMP. Inside the time integration loop, there is the loop over the elements, in order to update the CA states, the size of springs, the forces and so on. The elements’ updating are independent of each other, so they can be easily executed in parallel. The static workload strategy was chosen for two reasons: (1) the execution time of each update is nearby the same, so the static division is enough to get a balanced workload; (2) dynamic schedule would imply in an extra cost for handling threads, which is worsened by the fact that we fork and join threads every time step. In this manner, Fig. 3 shows the parallel algorithm.

3.3.1 Results

In order to test the parallel performance, we simulated 500 ms of LV activity, with the mesh presented in Sect. 2, with 0.05 ms timestep. Tests were performed in a AMD Opteron 6272 processor running Linux Red Hat 4.4.7-4 and gcc 4.4.7. Table 1 shows the average execution time for each thread number and speedup. The fastest execution happened with 16 and 32 threads, with less than 3 min and 2.2 speedup. The increasing of thread number is not necessarily decreasing the execution time. This is due two main reasons: (1) the so-called overhead to manage threads may consume a considerable percentage of execution time, because the cost of forking and joining threads every time step is high considering the low cost of MSS and CA simulator, and (2) when many threads are running simultaneously, they may try to access memory at the same time, causing a bottleneck to access this resource.

4 Methods part III: automatic tuning of parameters with genetic algorithm

Genetic algorithms (GA) are stochastic optimization methods inspired on Evolution Theory and Natural Selection. In summary, these techniques start with random set of candidate solutions and the environmental pressure causes natural selection, where the fittest individual has more chances to survive and generates more children. The individual is evaluated and then a value is attributed to it, the so-called fitness. The algorithm tries to maximize (or minimize) the population fitness by applying genetic operators (selection, recombination and mutation). In other words, at every generation, a new population is created to replace the population of the previous generation. The operators are applied to generate new individuals with better fitness. This is repeated for a limited number of generations or until appears an individual with fitness small (or big) enough.

The goal of our GA is to find the seven parameters of the mechanical model presented in Sect. 3. Therefore, each individual contains a candidate set of parameters, a floating-point vector named chromosome \(p^7_{i=1}\), while a single parameter \(p_i\) is called a gene. The description of the parameters to be found are shown in Table 2.

Table 2 Genes description

The fitness comes from a comparison between the discrete model simulation and the MRI data. There are many different ways to select, mutate and recombine individuals; however, we will describe only the relevant operators to this work.

4.1 Computing the fitness

The fitness is computed by some measurements of LV, such as: (1) the ventricle heights (Fig. 4a), named longitudinal lengths. Their size comprises the distance between the apex and the basis, on both epicardium and endocardium surfaces (\(a_1, a_2\)), (2) basis diameter of epicardium surface (b), and (3) basis diameter of endocardium surfaces (c), shown in Fig. 4b.

The measurements were done in the systolic and diastolic stages and the values are shown in Table 3.

The fitness is given by:

$$\begin{aligned} \Phi = \sum _i^{s,d}e\left( a^i_1, \overline{a}^i_1\right) +e\left( a^i_2, \overline{a}^i_2\right) +e\left( b^i, \overline{b}^i\right) +e\left( c^i, \overline{c}^i\right) , \end{aligned}$$
(9)

where e gives the relative error, \(i \in [s,d]\) indicates if the computation is done on systole or diastole, \(a^i_1, a^i_2, b^i_m, c^i_m, v^i\) are values of the MRI data obtained manually (Table 3), and \(\overline{a}^i_1, \overline{a}^i_2, \overline{b}^i\), and \(\overline{c}^i\) are values extracted from the simulated mesh to be evaluated by the GA. The systole measurement was done at 200 ms of simulation and the diastole at 500 ms. Each individual simulated 500 ms of cardiac activity, with 0.05 ms timestep. The parameter of active pressure (from Eq. 1) was \(\mu =100\,\mathrm{{Pa}}\).

Fig. 4
figure 4

Measurements for the fitness computation. a Longitudinal lengths. b Epicardium and endocardium basis

Table 3 LV measurements for fitness computation

4.2 GA operators

Selection consists in choosing individuals of a spring to be parents of the new individuals to be inserted on the next offspring. It must choose rather the individuals with best fitness; however, the other individuals may also have a chance. We have used the rank selection, that consists in choosing individuals by its position (or rank) on a sorted population. The worst individual will be assigned to fitness 1, and the second worst to fitness 2, and so on until the best individual has fitness n, where n is the population size. The selection probability of an individual is proportional to its rank.

Once the parents were selected, they must generate new children. This is done with crossover and in this work the blend crossover BLX-\(\alpha \) technique was used. It consists in randomly choosing a new gene \(o^i\) from an interval that depends on the parent genes \(p^i_1\) and \(p^i_2\) and a user-defined parameter \(\alpha \), where \(i\in [1,7]\) is the index of the gene in the chromosome. Figure 5 shows the BLX-\(\alpha \) algorithm. This is done for each gene of the new individual until its chromosome is complete.

Fig. 5
figure 5

Blended crossover algorithm

In order to ensure genetic diversity on the population mutation is applied, that consists in a perturbation caused on the genes of a new individual just after the crossover. This work uses non-uniform mutation, that finds a mutated gene \(\bar{o}_i\):

$$\begin{aligned} \bar{o}_i= & {} \left\{ \begin{array}{rcl} c_i+\varDelta (t,b_i-c_i) &{} \quad {\text {if}} &{}\tau =0\\ c_i+\varDelta (t,c_i-a_i) &{} \quad {\text {if}} &{}\tau =1 \end{array}\right. ,\end{aligned}$$
(10)
$$\begin{aligned} \varDelta (t,y)= & {} y\left( 1-r^\theta \right) ,\end{aligned}$$
(11)
$$\begin{aligned} \theta= & {} \left( \begin{array}{r}1-\frac{g_t}{g_{\text {max}}} \end{array}\right) ^b, \end{aligned}$$
(12)

where \(r\in [0,1]\) is a random number, \(\tau \) is random number, it can be only 0 or 1. \(g_t\) is the current generation and \(g_{\text {max}}\) is last generation, b is an user-defined parameter. Function \(\varDelta (t,y)\) generates a value in [0, y], that tends to 0 in the later generations.

When a new offspring takes place of previous one, it is not ensured that the best individual until that generation will remain in the population. However, loosing it can be catastrophic to the GA success. To avoid this problem the elitism technique is used, that keeps a certain number \(\lambda \) of the best individuals on the next generation, where \(\lambda \) is a parameter of the algorithm.

Table 4 shows the GA parameters used in this work.

Table 4 GA parameters

5 Methods part IV: multilevel parallelism scheme

In our first implementation, the GA individuals were simply executed in parallel. Only the fitness computation were in the parallel zone of the code, since the genetic operators are not computationally expensive and therefore they are not worth parallelizing. Since some parameters set may not finish the computation due numerical method’s stability issues, it was chosen a dynamic workload division, like shown in Fig. 6. This parallel loop over the individuals is inside an outer loop over the generations, that must be computed sequentially. So, there is also an overhead time due the threads’ forks and joins.

Fig. 6
figure 6

OpenMP parallel genetic algorithm

This strategy works fine when all processors are busy. However, if the number of individuals is less than the available number of processors, the strategy fails in efficiency. A simple idea for solving this matter is using nested parallelism, since each individual executes the FisioPacer simulation that can be run in parallel, which was presented in Sect. 3.3. However, using nested parallelism arbitrarily may even worsen the performance.

For tackling this problem, we propose a scheme for automatically changing the parallelism of our code. Preferably, the outer loop (AG individuals) is used, because this lowers the overhead cost. But, in execution time, our scheme starts the nested parallelism if there are idle processors. This allows the parallel execution of the inner loop (FisioPacer elements update), in order to maximize the processors usage. Therefore, the tradeoff between overhead and nested parallelism is automatically controlled.

The thread distribution from the outer to inner loop works as follows. Initially, only the outer loop is running in parallel. As the individuals finish their jobs, they inform the GA how many processors they were using. So, the GA waits until the number of computing individuals is less than the available threads. When this happens, it is time to execute inner loops in parallel. The distribution of free processor is done by a greedy strategy, where all individuals starts with one thread only. Every time a individual finish its job, it signals the number of processors they were using. These processors are then distributed one by one for the individuals that are still running. Those individuals with less processors receive the new processor in the pool first.

The number of threads per individual is stored in a vector threadPool, with the size of the population. So, whenever a thread finishes, this vector is updated. Individuals that are still running access the vector every time the loop over mesh elements starts, and therefore it is able to run with updated number of threads. This scheme is illustrated in Fig. 6.

Since the FisioPacer parallel performance achieves its best speedup with 16 threads, our scheme sets 16 as the limit of threads per individual.

Figure 7 illustrates the differences among the three levels of parallel code. In the figure, a job represents the FisioPacer simulation run by an individual during the AG execution. Each job can be split in three parts and when a job is running in parallel (inner loop), it is represented by a small rectangle. Thus, in this example we intend to share two jobs among three processors. In the first case, only the outer loop is running in parallel. Therefore two processors received one job each, while the third processor is idle. In the second case, only the inner loop is parallel, which results in a even workload sharing. However, there is a high cost managing the threads in every iteration. So, the third case shows the schema proposed in this paper, where the outer loop is preferably run in parallel, but to avoid idle processors, the inner loop can also be parallel. In this way, one full job is assigned to a processor, while the other two processors share the parts of the second job.

Fig. 7
figure 7

Possible levels of parallelism: outer, inner and multilevel

6 Results

In this section, the GA results are presented. The best set of parameters found and the LV simulation using these parameters are presented. Then the parallel performance is evaluated, where the multilevel parallel scheme is tested with different numbers of individuals and threads.

6.1 LV genetic algorithm

The GA was run three times with the parameters presented in Table 4. Figure 8a shows the behavior of the population fitness. In early generations there is more genetic diversity, and therefore the fitness are very different among individuals. In later generations, the GA performs local exploitation, which means less genetic diversity and thus the fitness are more similar.

Fig. 8
figure 8

Population fitness over generations and the LV behavior simulated with AG best solution. a Fitness evolution. b Diastole compared with simulated systole

The best set of parameters found by GA are depicted in Table 5. The FisioPacer was simulated with these parameters, the simulated systolic geometry can be compared with diastolic geometry in Fig. 8b. We can see there is decreasing in longitudinal heights and basis diameters. The LV wall thickness increased during the systole in order to keep the tissue volume. The evaluation parameters of the simulated LV at systole (at 200 ms) and diastole (at 500 ms) are exhibited in Table 6.

Table 5 Best set of parameters obtained by GA
Table 6 Simulated lengths and percentage relative errors of systole and diastole

6.2 Multilevel parallelism scheme

Tests were performed in the same environment of Sect. 3.3.1. In order to test the multilevel parallelism scheme, we run the GA with the new scheme and the simple parallelization of the outer loop. Each scheme was executed with two population sizes, 128 and 150. The first size generates an even division for every number of threads tested (1, 2, 4, 8, 16, 32, 64). The second size does not result in even division for every thread number, which damages the workload balance and therefore the impact of the multilevel parallelism is expected to be more evident.

Every test was run three times. The average execution times are presented below. The standard deviation was less than 4 % in all cases. Due the long execution time, specially for sequential code, we run our GA only for one generation. However, all GA parameters were the same as presented in the previous section.

Table 7 shows the average execution time, the speedup compared to the sequential code, and the gain, which represents how many minutes the new scheme was faster and also a percentage of how faster it was. The proposed scheme gains are modest with 128 individuals. This happens because the scheme only works when there are idle processors, and since the number of individuals generates an even division of threads per individual, the scheme is underused. Nevertheless, the gain was up to 9 %, with eight threads.The tests with 150 individual are presented in Table 8. The sequential code took more than 11 h to run. The simple (outer loop only) parallelism speedup varied from 2 to 28.9 with 2 and 64 threads, respectively. As expected, the gains of the proposed scheme were higher when the workload is unbalanced. The speedup in this scheme varied from 2.1 to 29. The best gain happened with eight threads, where it was 11 min faster than the simple scheme, that is 12 % faster.

Table 7 Parallel performance comparison with 128 individuals
Table 8 Parallel performance comparison with 150 individuals

In all tests, there is not a considerable increase on speedup from 32 to 64 threads. The multilevel has no gain with 64 threads, either. This is due the processor architecture, that contains 64 processors but only 32 floating-point units. Since the simulator mainly uses floating-point operations, this limits the performance.

Moreover, there is other limitations, like the memory bottleneck, extra cost for joining and forking threads, and also another overhead due the greedy strategy for thread distribution. These factors have a great influence on the speedup limitation. However, the multilevel parallelism resulted in up to 12 % of performance improvement. When the GA is executed for 200 generations, it finishes 36 h before the simple method, with eight threads.

It is important to highlight that this method is useful for balancing the workload regardless of the number of individuals. One may think that the balance issue can be simply tackled by defining a population size divisible by the thread number. Nevertheless, the set of parameter cannot be good enough for the numerical method, which may quit the simulation due stability problems. Thus, it is not possible to ensure a fair workload only by setting up the population size.

7 Conclusions

We presented the parallel implementation of a cardiac tissue simulator, based on mass–spring systems and cellular automata. We also used techniques available on literature for extracting the human left ventricle geometry from MRI data. In order to make our simulator to reproduce the systolic and diastolic geometries, we used a genetic algorithm. The GA computational cost is high, and therefore we applied techniques to run it in parallel. A new scheme was proposed, that consists in automatically deciding what loop to run in parallel. Preferably, the outer loop, that executes the AG individuals, are run in parallel. Due the unbalanced workload, some processor may become idle. In this manner, our scheme uses the idle processors to run each instance of the simulator in parallel. This improved the performance by reducing up to 12 % of total execution time.