Parallelization issues of a code for physically-based simulation of fabrics☆
Introduction
Fabric and flexible material simulation is an essential topic in computer animation of realistic virtual humans, dynamic sceneries and computer games, among others. New emerging technologies, as interactive digital TV and multimedia products, make necessary the development of powerful tools to perform (near) real-time simulations. To reach real time each simulator stage should be optimized and executed on a high-performance platform, that frequently includes multiple processors. In such case, optimization implies parallelization as an effective tool for real-time execution.
Codes resulting from fabric simulators using a physical model are typically included in the class of irregular applications, as computations are organized using complex data structures, and data access patterns are unknown until runtime. This fact poses great difficulties to the optimization of such codes, and in particular to their parallelization.
In addition to their irregular nature, codes for this kind of applications commonly spent a significant portion of its execution time in reduction operations. These are accumulative operations based on commutative and associative operators (described in detail in Section 3.1). In our simulation codes, these operations correspond to the computation of physical magnitudes, such as forces acting over fabric particles. This process determines vectors and matrix coefficients necessary to solve the differential equations that models the fabric behavior.
This paper discusses the parallelization of a physically-based fabric simulator that we have developed, focusing on those procedures that carry out irregular computations, particularly, irregular reductions. We have chosen a ccNUMA shared memory architecture as the target platform. Several issues should be taken into account in order to achieve high code performance. These factors include, in addition to parallelism, exploitation of data locality as well as careful use of memory resources (memory overhead).
We analyze the computational structure of the selected procedures of the simulator and, consequently, adapt existing irregular reduction parallelization techniques in order to obtain the maximum efficiency from the code. Among the analyzed techniques some of our own proposals are included. These proposals are derived from the concept of data write affinity that we have developed to design efficient irregular reduction parallelization methods exploiting data locality. All the studied techniques were implemented and experimentally tested, in order to obtain comparative data about efficiency and overheads.
The rest of the paper is organized as follows. Section 2 introduces the physical foundations of our fabric simulator and also analyzes the characteristics of its force computation loops. Section 3 describes the irregular computational structure of the force loops, focusing on locality and reduction operations. Section 4 discusses the different parallelization techniques applicable to the reduction loops in the fabric simulator and their effects in performance. Also we introduce the write affinity concept for parallelizing irregular reductions which exploits memory access locality. Section 5 provides an experimental evaluation of the discussed techniques. Finally, Section 6 concludes the paper.
Section snippets
Overview of the fabric simulation problem
In a physical approach, fabrics and other non-rigid objects are usually represented by interacting discrete components (finite elements, springs-masses, patches) each one numerically modeled by an ordinary differential equation, as In most physically-based formulations, equations contain non-linear components, that are linearized generating a linear system of algebraic equations where positions and velocities of the masses are the unknowns. Such positions and
Irregular properties of the simulator code
The aforementioned force computation procedure belongs to the class of codes known as irregular. This property is found commonly in many data organizations used in numerical applications derived from models based on irregular domain discretization methods. Irregular codes are characterized by the way in which memory access patterns are carried out. Such accesses to data are not performed directly but by means of indirections. Approaching the optimization, like parallelization, of irregular
Irregular reduction parallelization
Due to its importance, reduction operation parallelization has been a field where a hard effort has been done since first multiprocessors appeared. In the context of shared memory machines we can classify the reduction parallelization methods, found in the literature, into two broad categories. The first one is based on the privatization of the reduction arrays and it is focused only on how to partition the iteration space among the cooperating threads. The second category is based on the
Experimental evaluation
The following experimental results have been obtained from the parallelization of the force computation loop of the discussed fabric simulator. This loop corresponds to the code shown in Fig. 4, which is the body of the evaluateForces() procedure (see Fig. 2). In such a reduction loop the system matrix A and the right-hand side vector b of the equation system to solve are built. Input data is obtained from the discretization of a piece of fabric yielding a total amount of 218 272 nodes, 653 100
Conclusions
Among the available techniques to optimize applications that requires (near) real-time, parallelism is a good candidate. Fabric simulation is one of these cases, as interactive and multimedia products (dynamic virtual sceneries, computer games, …) requires tight time constraints.
In the computational core of our fabric simulator irregular reductions are found, where a significant portion of the complete execution time is spent. This situation leads us to search efficient solutions to parallelize
References (14)
- et al.
Efficient compiler and run-time support for parallel irregular reductions
Parallel Computing
(2000) - et al.
Parallel scheduling of the PCG method for banded matrices arising from FDM/FEM
J. Parallel and Distributed Computing
(2003) - et al.
Large steps in cloth simulation
- et al.
Parallel programming with Polaris
IEEE Comput.
(1996) - et al.
On automatic parallelization of irregular reductions on scalable shared memory systems
- et al.
A compiler method for the parallel execution of irregular reductions in scalable shared memory multiprocessors
- et al.
Improving parallel irregular reductions using partial array expansion
Cited by (0)
- ☆
This work was supported by Ministry of Education and Culture (CICYT), Spain, through grant TIC2003-06623.