Keywords

1 Motivation

As we are currently witnessing the end of Moore’s law, interrogations are raised on our capacity in solving challenging computational problems in a reasonable amount of time [16]. This could appear as a vain debate since a wide range of problems are efficiently addressed by today’s computer technologies. However, such an argument is questionable since it neglects the fact that the usefulness of Computer Science in various scientific domains is constantly growing; as a consequence, the set of new and challenging computational problems becomes broader every day (for instance Molecular Dynamics codes include now sophisticated visualization modules, multi-scale physical models through coupling classical n-body with chemistry or quantum physics, interactive processing, etc.). In addition, computer history gives us many examples of various technologies, that, because they led to a major increase in computing power, supported new revolutions. In other words: developing more powerful platforms creates always more needs.

For continuing to speedup the resolution of challenging computational applications, two classes of approaches are generally proposed. The former class deals with alternative machines and/or computing models in which fundamental concepts of current machines are replaced by other mechanisms (like when Von Neumann architectures moved to RISC, superscalar or VLIW [13]). Examples of such alternatives include quantum computers, dataflow or neural networks-based machines. In the second class, the idea is to enrich current computing models with new features in order to continuing to scale. Here, a good illustration is given by multicore architectures: Since the technology is not yet able to build more powerful processors, several cores are gathered into the same board to obtain more power.

In this paper, we propose to concentrate on the second class of the previous proposals. In particular, we are convinced that there is a neglected model of parallelism, suggested in the Flynn’s classification, that can break the limits observed in the resolution of several hard computational problems.

Parallel processing is usually presented late in the french academic curricula (it is only rarely addressed at an undergraduate level). The obvious consequence is that the students are educated in thinking sequential. Teaching some basic principles in an historical perspective is a good way to prepare the student minds to the unknown concepts of parallelism. The neglected model of parallelism, discussed in this paper, is easy to deploy on a cluster or multicore system. It is also a good illustration of concurrent programming and synchronization of parallel processes.

1.1 A New Look at the Old Time

Historically, Flynn’s taxonomy [7] served as a clear construct to think parallelism. He introduced a classification in the way the french savants of the Lumière in the XVIII-th century did in the Encyclopedia [5] with their effort to classify and organize the scientific knowledge. This taxonomy proposed two concepts for building parallel organizations Footnote 1: the stream of instructions and the stream of data. Depending on the multiplicity of these streams, Flynn proposed to define all possible combinations of instructions/data, leading to four classes of organizations: starting from the classical Von Neumann’s processor SISD (Single Instruction Single Data), SIMD (Single Instruction Multiple Data), MISD (Multiple Instructions Single Data) and MIMD (Multiple Instructions Multiple Data). In his original work [7], Flynn also discussed the effectiveness of the various organizations. That is, he located existing computer technologies in his taxonomy and defined the fundamental problems raised by each organization.

Flynn’s taxonomy conceptualized the parallelism at the level of machine instructions. This conceptualization inspired other models, where the control of parallelism is put at a higher level or layerFootnote 2. Thus, in considering the application level, the SPMD (Single Program Multiple Data) [4] and MPMD (Multiple Program Multiple Data) models were introduced.

Regarding MISD organizations, Flynn concluded to their little interest [8]. This opinion is still shared today by most parallel computing experts and students; beyond the model of systolic arrays (which may be debatable [13]) or replication systems, the community considers that there are only few examples where MISD architecture could be of interest.

This work goes in the direction of putting emphasis on MISD organizations. With the end of Moore’s law, we are convinced that the increase of parallelism in large scale parallel platforms is becoming the most serious issue for building powerful machines. To fully benefit from the whole parallelism and in particular to reach significant speedups, MISD models could be the key. The model we propose to consider is the discrete resource sharing model (DRSM) [1]. This abstract model and its practical counterpart (algorithm portfolio) is the missing brick in Flynn’s classification. It is detailed in the next section.

1.2 Informal Presentation of the Discrete Resource Sharing Model

We consider in this work the discrete resource sharing model (DRSM) where the control of parallelism is done at the application level. We consider a set of parallel algorithms (denoted by \(\mathcal{A}\)) solving the same problem, these algorithms provide the exact solution (but it is also possible to consider that they provide an approximation of the optimal for optimization problems). Let us assume a parallel platform composed of homogeneous computing units, which consist of processors, cores or virtual machines. Each algorithm can run on a part or the whole set of the computing units, with its own execution time (which may differ from an algorithm to another). DRSM defines a concurrent run of several algorithms in \(\mathcal{A}\) where each computing unit is assigned to at most one algorithm. In the execution, any instance of the problem is processed concurrently by some algorithms in \(\mathcal{A}\), depending on the computing units that execute each algorithm. The concurrent runs are stopped as soon as one algorithm finds a solution. The link between DRSM and MISD (or MPMD) organizations is natural if we consider that the algorithms with at most one computing unit are streams of instructions that operate on the data of the problem instance to solve.

1.3 Contributions and Content

With DRSM the resolution of a computational problem (denoted by \(\varPi \)) is formulated as a cooperative execution of multiple algorithms that concurrently solve the same instances of \(\varPi \). To demonstrate the interest in thinking parallelism under this vision, this paper has been organized in three parts: Firstly, we propose a formal model for building cooperative executions of algorithms in DRSM (Sect. 3). Our model assumes a statistical (context-aware) modeling of the instances of \(\varPi \). The main challenge here is to decide on the best allocation of computing units to parallel algorithms. In the second part (Sect. 4), we analyze the runtime gains using DRSM when applied to the resolution of the classical SAT decision problem. The analysis is based on a performance evaluation that uses data from the SAT competitionFootnote 3. Finally, Sect. 5 discusses some perspectives opened by DRSM in the development of parallel processing systems and machines.

It is worth noting that DRSM was already considered in our prior work [1, 2], we are going one step further here. In this previous work, we introduced the discrete Resource Sharing Scheduling Problem (DRSSP) for deciding on the resource allocation in DRSM. This paper introduces new variants of the initial formulation, more suitable to special resource allocation situations that depend on both machine architectures and algorithms (namely, portfolio of teams, equal sharing portfolio and QoS portfolio that are detailed in Sect. 3). Moreover, in comparison to our past work, we provide a better proof-of-concept of the runtime gain induced by DRSM on the SAT problem. Indeed, whereas our past evaluations [1, 2], assumed a known theoretical model for the speedup of SAT solvers, the experiments proposed here do not rely on such assumption.

2 Related Works

As already mentioned, Flynn’s work inspired further parallel computing taxonomies, which can be ranged into the two following categories.

  • The first one corresponds to studies that extent Flynn’s classification. For instance, putting the parallelism control at the application level instead of at the hardware. This is the case of the well-known SPMD and MPMD classes where programs generalize instructions. This is also the case while considering the memory pattern accesses in MIMD machines with shared-memory, distributed-memories, uniform and non-uniform memory accesses (UMA and NUMA) [12].

  • The second category of works proposed alternative taxonomies, that use other foundations to distinguish between parallel machines. In this spirit, Feng [6] introduced a taxonomy where machines are distinguished depending on the number of bits processed in parallel in a word. Despite its interest, this proposition never had the impact of Flynn’s taxonomy. One of its recurring criticism is that it does not make a clear distinction between pipelining and parallelism [10]. Another alternative taxonomy was proposed by Händler [10] where the parallel machines are categorized depending on their number of control units, the number of arithmetic and logical units and the number of elementary logic circuits. With this model, Handler classified several actual computer machines. However, as recognized by the author him-self, one of its limitation is that it is specifically related to a given Von Neumann architecture. This differs from Flynn’s taxonomy that is formulated over a more abstract model that could or not be realized by a Von Neumann model.

The work proposed in this paper is related to algorithm portfolio. Indeed, DRSM was inspired by the concept of algorithm portfolio introduced by Huberman et al. [11] for the resolution of hard computational problems. The original motivation for algorithm portfolio can be summarized as follows: hard computational problems are often solved with heuristics based on randomization. Usually, for the same problem, there exist several randomized heuristics that can be used. However, the quality of the result will certainly differ and, despite randomization, their runtime could remain expensive. The question then is to know how to use these heuristics to solve the problem. For this purpose, Huberman et al. proposed to take example of practices developed in Finance to minimize the risks in investments. Indeed, given an initial capital that could be invested on several assets, financial agents generally prefer to distribute the capital between the assets, instead on investing only on a unique asset. In considering the various assets as randomized heuristics, Huberman proposed to solve hard computational problems in launching several randomized heuristics that each solves the problem. As soon as a heuristic finds a solution, the execution is interrupted.

Huberman promoted an economic approach whose idea is to invest on several algorithms to solve a computational problem. A critical question is to determine how and what to invest on each heuristic in order to ensure an economy of time. In their introductory paper, Huberman et al. discussed about investments done on fractions of processor clock cycles. For instance, given two heuristics \(h_1\) and \(h_2\) and a CPU, one can run \(h_2\) every two clock cycles (attributed to the portfolio of algorithms) and \(h_1\) the remaining cycles. Such a proposition however supposes that the execution can be controlled at the clock cycles level, which might be challenging with parallel randomized heuristics in a multiprocessor context. Some other authors proposed to define the execution of the algorithm portfolio based on time slots (time sharing) [14] or on the number of CPUs or cores (resource sharing) [2]. The interest in these latter models is that we can control the execution at the application level. For instance, in time sharing, an internal counter can be used for aggregating the cumulative running times allocated to each individual heuristic. In this paper, we will mainly focus on the resource sharing model that we introduced in our prior work [2]. Nonetheless, let us observe that our contribution can be extended to other algorithm portfolio models.

3 The Discrete Resource Sharing Model

We formally define a DRSM by a triple \(\varGamma = (\mathcal{A}, \mu , S)\) where \(\mathcal{A}\) is an ordered set of parallel algorithms, \(\mu \in \mathbb {N}^+\) is the number of parallel computing units to use and \(S = [s_1,\dots ,s_k]\) (\(k = |\mathcal{A}|\)), where \(s_i \in \{ 0,\dots \mu \}\), defines a resource allocation of algorithms in \(\mathcal{A}\) to the computing units. In this definition, the \(i^{th}\) algorithm \(A_i\) of \(\mathcal{A}\) is associated with \(s_i\), the number of computing units allocated to the algorithm. We must also have \( \sum _{i=1}^k s_i \le \mu \) at any time slot.

\(\varGamma \) is associated with a halting condition expressed by the economic gain targeted in the execution. In this paper, we will focus on the economy of time. Thus, \((\mathcal{A}, \mu , S)\) implies that on each problem instance \(I_j\), each algorithm \(A_i\) runs on \(I_j\) using \(s_i\) computing units until one algorithm finds a solution.

This definition is restricted to the homogeneous setting where the computing units typically correspond to identical CPUs, cores, etc. The halting condition was targeted to the economy of time, however, other objectives are possible. For instance, distributed computing systems are nowadays associated with a pricing model in which users pay depending on the CPU time, memory size or any other feature that they consumed. In such a context, the halting condition can be defined as follows: stop when the execution price exceeds a given thresholdFootnote 4. Such a criteria is in particular meaningful if the execution of the algorithms in \(\mathcal{A}\), generates local solutions (e.g. local search or anytime algorithms). As already mentioned, an important question in DRSM is to determine the \(s_i\). We propose an adequate basic formulation model for this purpose in the next section and we give some examples of its variants.

3.1 The Discrete Resource Sharing Problem (DRSSP)

Base Formulation. To decide on the resource sharing, we propose to consider a context in which a computational problem \(\varPi \) is represented by a finite set of n instances \(\mathcal{I}\). Each instance \(I_j\) has an individual representativity, modeled as weight \(w_j \in [0,1]\). We assume that the running times \(C(A_i, I_j, s_i), j = 1 \ldots n\) spent by \(A_i\) to process \(I_j\) with \(s_i\) computing units is known a priori. The objective within DRSSP is to choose the vector S that will lead to the minimization of \(\sum _{j=1}^n w_j. \mathcal{C}(S, I_j)\) where

$$\begin{aligned} \mathcal{C}(S, I_j) = \underset{A_i \in \mathcal{A}}{\min } C(A_i, I_j, s_i) \end{aligned}$$

One can consider the representativity of an instance I as an estimation of the probability that the instance we want to solve (at any given date) is I or has a runtime close to the one of I. In this case, the optimization function in the above definition can also be associated with the average case complexity. Indeed, as defined, this function proposes to minimize the average runtime in the processing of \(\mathcal{I}\). Another objective function is to consider the worst case complexity or energy minimization. In the former case, the objective will then be to minimize \(\underset{I_j}{\max }~\mathcal{C}(S, I_j)\).

We introduced this base formulation (without weights) in our prior work [1]. For a hard combinatorial problem like the classical satisfiability problem SAT [9], \(\mathcal{I}\) could be chosen as one or a union of benchmarks of \(\varPi \). We also assume that each instance \(I_j\) is associated with a weight \(w_j \in [0,1]\). We introduced the weights for taking into account the individual representativity of some instances. For instance, the NP-completeness proofs of problems is usually based on worst-case analysis on some specific classes of instances. As a result, the benchmarks for NP-complete problems often distinguish between instances that are really hard or easy to solveFootnote 5. Thus, if a user aims at solving more hard instances than easy ones, he/she can adjust the weights accordingly. Weights can also be used to set preference on algorithms. Indeed, one algorithm may be preferred to the others since it provides faster solutions. In this case, it is worth to introduce a weight to put emphasis on some algorithms. We recommend by default to consider the uniform distribution (\(w_j = \frac{1}{n}\)) (for instance in Finance, this latter setting corresponds to the situation of an equal weighted portfolio).

DRSSP proposes an explicit cost function for the portfolio execution time expressed by the individual runtime of algorithms. It is important to notice that this formulation does not consider the runtime overhead induced by the concurrent run of the algorithms. Let us now derive several variants showing how to adapt the basic formulation to concrete examples.

Portfolio of Teams. The first example is to build portfolio of algorithm portfolios. Such a situation can be motivated as follows: Let us assume that the computing platform consists in 16 cores, belonging to 2 identical CPUs (8 cores per CPU). Let us assume that \(A_1\) is run on 8 cores. According to the formulation of Sect. 3.1, we should expect the same running time from \(A_1\) whether it is deployed only on one CPU or if we use cores of both CPUs. Unfortunately, this is not realistic if we consider communication costsFootnote 6. For a more realistic portfolio formulation, it is possible to avoid the combination of algorithms deployed on distinct CPUs. A portfolio of teams could be used for this purpose where the algorithms are grouped in teams, associated with a DRSM defined over a subset of resources. In the case of two teams whose resource sharing are defined by \(Q = [q_1,\dots , q_k]\) and \(R = [r_1, \dots r_k]\) s.t \(\sum _{i=1}^k q_i \le 8\) and \(\sum _{i=1}^k r_i \le 8\), the runtime of the portfolio on \(I_j\) is \(\min \{\mathcal{C}(Q, I_j), \mathcal{C}(R, I_j)\}\).

Equal Sharing Portfolio. The second example is when DRSSP serves to build DRSMs that consist of the execution of several sequential algorithms (called the equal sharing portfolio). This variant allows to derive simply parallel algorithms from sequential ones. It has been used in several winner solvers of the SAT competition. We formally define it as follows: A resource allocation corresponds to a vector \(S = [s_1, \dots , s_k]\) where \(s_i \in \{0,1\}\) and \(\sum _{i=1}^k s_i \le \mu \). The DRSSP question is then to find an allocation that leads to the minimization of \(\sum _{j=1}^n w_j. \mathcal{C}(S, I_j)\).

It is important to notice that in the case where \(\mu > k\), the question is straightforward and the optimal solution is vector \(S = [1,1,\dots ,1]\). If instead, \(k < \mu \), then we have at least \(\left( {\begin{array}{c}\mu \\ k\end{array}}\right) \) potential portfolio executions.

As already said, the equal sharing portfolio captures the situation where a portfolio is built in combining sequential algorithms. Consider a cluster of identical CPUs on which can be run the sequential algorithms \(A_i\). The equal sharing portfolio remains interesting even in the case of parallel algorithms. Indeed, to build the optimal solution in the base DRSSP formulation, we need a cost estimation \(C(A_i, I_j, s_i)\) for each algorithm, instance and number of processors. In order to avoid the big overhead spent to collect these values, one could instead only consider for each algorithm \(A_i\) a single number of processors \(s_i^*\) on which the instances are evaluated. Thus, we would have to consider a formulation close to the equal sharing portfolio: each algorithm \(A_i\) runs on \(s^*_i\) processors and \(\sum _{i=1}^k s^*_i \le \mu \).

Portfolio with Quality of Service. The last DRSSP variant we present is the case where the algorithms \(A_i\) are heuristics solving an optimization problem (like Traveling Salesman Problem). In this case, each instance \(I_j\) and algorithm \(A_i\) could be associated with an instance performance guarantee \(\rho _{i,j}\), defined as the fraction between the tour length found by \(A_i\) and the one of a lower bound on the problem. The shorter \(\rho _{i,j}\), the better the solution found by \(A_i\). Now, in considering the above DRSSP formulations, on the instance \(I_u\), the algorithm \(A_l\) that causes the interruption of the portfolio could be the one for which \(\rho _{i,u}\) is maximal, \(1 \le i \le k\). This means that the results returned by a DRSM generated from DRSSP could be the ones whose quality are the worst, regarding the instance performance guarantee. Thus, an important question is how to extend DRSSP for the optimization of the quality of results. A simple solution is to change the halting condition. For instance, we can consider that the execution of the portfolio is interrupted when \(k'\) algorithms (\(1 < k' \le k\)) found a solution. The best of the \(k'\) results is the solution of the portfolio.

From our prior work, it is easy to establish that the all the described DRSSP variants remain NP-hard. Thus, an interesting question is the one of building efficient heuristics for their resolution. However, this will not be discussed in this paper. Instead, we will propose in the next section a performance evaluation whose goal is to demonstrate the interest in building DRSM (based on DRSSP) on the SAT problem.

4 Application to SAT

We propose to illustrate the power of the portfolio approach on two series of experiments. Each series considers a particular scenario for creating a parallel solver for the resolution of the SAT problem. The first one is the base parallel portfolio, the second one is the equal sharing portfolio.

In the first experiments, we consider the construction of a portfolio of solvers built in combining several parallel SAT solvers in a multicore context. The portfolio of solvers was built with the running time distribution of 6 existing parallel SAT solvers. In the objective function, we assumed uniform weights. The resource sharing problem to solve in this series is a base DRSSP in which the running times are only defined for solvers that are run on 1, 8 and 32 cores. These running times come from a public database of SAT solversFootnote 7. We distinguish two cases in these experiments. In the first case, we simulate a portfolio with the 6 parallel solvers and 300 instances. The results obtained here clearly show that there is a dominant solver. We then did another simulation where the dominant solver was excluded. In both cases, we compared the running time of the portfolio of solvers versus the time of the best parallel solver on 32 cores.

The experimental results are depicted in Fig. 1. These experiments raised several conclusions. The first lesson learned is that we can effectively benefit from parallelism in combining several sequential SAT solvers according to a resource sharing obtained in solving a DRSSP instance. The second lesson is that we are able to build a portfolio of solvers that outperforms existing parallel algorithms. Indeed, from Fig. 1, one can notice that on 32 cores, the optimal portfolio was better than the best parallel algorithm available for this number of cores. The third lesson is that the greater the number of resources, the better the portfolio.

Fig. 1.
figure 1

Runtime of the base parallel portfolio

Table 1. Experimental plan for the second series
Fig. 2.
figure 2

Experimental evaluation

In the second series of experiments, we consider a portfolio of solvers built in combining sequential solvers, according to DRSSP. The resource sharing problem we have to solve in this setting corresponds to an Equal Sharing portfolio, with uniform weights, presented in the previous section. We measured the runtime gain induced by the portfolio of solvers (over the best sequential one) and the number of SAT instances that were solved. Indeed, as the resolution of some SAT instances may be highly time consuming, we introduced in practice a maximal cutoff time. Thus, if the solver answers before the cutoff time, then we know whether the SAT instance is satisfiable or not. Otherwise, we conclude that the solver was not able to provide an answer. In these experiments, the Equal Sharing portfolio problem with the running times data issued from 3 sessions of the 2013 SAT competitionFootnote 8. The chosen sessions names are: (1) Core solvers, Sequential, Random SAT+UNSAT (Random, SAT+UNSAT session), (2) Core solvers, Sequential, Hard-combinatorial certified UNSAT (Hard Certified UNSAT session), (3) Core solvers, Sequential, Application certified UNSAT (Application Certified UNSAT session). The data of our experimental plan are summarized in Table 1.

Figure 2 depicts the running time of the built portfolio and the number of instances we were able to solve. As one can notice, we can clearly benefit from parallelism in combining several sequential algorithms. In addition, we were also able to increase the number of SAT instances in the built portfolio. The speedup gain we observed in these experiments was not linear and did not change significantly between 4 and 8 cores. These results clearly show a leadership phenomenon that can be observed in team sports: a subgroup gives the whole team a boost. Here, there is a subset of complementary solvers that dominates the others. To improve the speedup, one should define another leadership by considering a more diversified basis of sequential SAT solvers.

Both experiments assess the approach proposed in this paper: algorithms portfolio can be used to design efficient SAT solvers better than usual approaches. We hope that the readers are convinced by the proof of concept. However, as the previous results are limited to focused examples and are based on simulations, an effective implementation on a more systematic and larger campaign would be important to consolidate these results.

5 Discussion

With DRSSP, the parallel execution is decided on the basis of a statistical model that is contextualized to the execution environment in which the algorithms are run. Thus, while classical parallel processing models only focus on the way the concurrency is formulated (threads, processes, fork-join, SPMD, etc.), DRSSP goes further in introducing an optimization model that defines the optimal parallel execution. There are several advantages of such a model, in particular for the users who do not have to choose the adequate algorithm for solving (optimally or not) their instances. Another advantage is on the flexibility in the objectives (for instance, we can optimize the parallel execution on energy consumption, by redefining DRSSP with this new target).

The experimental results obtained on the two case-studies of SAT confirm the interest in building algorithm portfolio. They not only provide a concrete application for the MISD class, but they also open new research directions for the design of parallel algorithms. One of the most important direction consists in building a library for automating the design of algorithms portfolio, according to the theoretical models discussed in this paper. In our viewpoint, such a library could be based on a generative programming model similar to the one we have in the implementation of remote procedure calls [15]. At the beginning, a user describes the input of the portfolio of algorithms to be constructed. This description is done according to a language model proposed by the library. Then, an optimization engine (included into the library) generates the optimal DRSM and returns it to the user. Finally, the user can launch the generated program. The importance of this research direction is that it can lead to an implementation that will have an impact, comparable to the one that PVM/MPI have had in the promotion of the SPMD model.

To end this discussion, let us come back to the Flynn’s classification. A contribution of this paper was to show that in considering the control of the parallelism at the application layer, the (extended) MISD class is efficient for the resolution of hard combinatorial problems and could even outperform parallel programs built upon other Flynn’s classes. This efficiency was emphasized in considering the DRSM model. An important question is then to translate this model (DRSM) at the operating system and hardware levels. We will not discuss what can be done at the hardware level, but at the level of operating system, we do believe that it makes sense to introduce a new type of process group [3] that supports a time/resource, aware of the concept of portfolio. Roughly speaking, in an operating system, a process group refers to a collection of one or several processes. In a process group that is DRSM aware, one could balance the time slots allocated to each process, according to a resource sharing specified at the user level. The automatic interruption of all processes of the group is initiated as soon as one process finds a solution. The execution such a process group must also try to isolate as much as possible the different processes. This is important to guarantee that the resource allocation is respected. Finally, the group can also be improved in order to handle various halting conditions.

6 Conclusion

The end of the Moore’s law is a great opportunity for the renewal of “parallel thinking” and the design of parallel systems. The thesis of this paper is that historically, there was a neglected model of parallelism (MISD) that deserves to be invested; in particular, we propose an extended MISD model where parallelism is formulated as a cooperation of concurrent algorithms solving the same problem. The proposed concurrency model is associated with an optimization model that defines optimal parallel executions. Our paper showed how we can build efficient parallel algorithms according to this model at the application layer. The portfolio approach is easily accessible and it allows to introduce fundamental concepts of parallelism like concurrency and synchronization. As showed in this paper, the approach puts a new light on the Flynn’ classification and the formulation of optimal parallel algorithms. For these reasons, we do believe that the notion deserves to be taught in undergraduate classes on concurrent programming, synchronization, and models for parallelism.