Parallelization issues of a code for physically-based simulation of fabrics

https://doi.org/10.1016/j.cpc.2004.07.003Get rights and content

Abstract

The simulation of fabrics, clothes, and flexible materials is an essential topic in computer animation of realistic virtual humans and dynamic sceneries. New emerging technologies, as interactive digital TV and multimedia products, make necessary the development of powerful tools to perform real-time simulations. Parallelism is one of such tools. When analyzing computationally fabric simulations we found these codes belonging to the complex class of irregular applications. Frequently this kind of codes includes reduction operations in their core, so that an important fraction of the computational time is spent on such operations. In fabric simulators these operations appear when evaluating forces, giving rise to the equation system to be solved. For this reason, this paper discusses only this phase of the simulation. This paper analyzes and evaluates different irregular reduction parallelization techniques on ccNUMA shared memory machines, applied to a real, physically-based, fabric simulator we have developed. Several issues are taken into account in order to achieve high code performance, as exploitation of data access locality and parallelism, as well as careful use of memory resources (memory overhead). In this paper we use the concept of data affinity to develop various efficient algorithms for reduction parallelization exploiting data locality.

Introduction

Fabric and flexible material simulation is an essential topic in computer animation of realistic virtual humans, dynamic sceneries and computer games, among others. New emerging technologies, as interactive digital TV and multimedia products, make necessary the development of powerful tools to perform (near) real-time simulations. To reach real time each simulator stage should be optimized and executed on a high-performance platform, that frequently includes multiple processors. In such case, optimization implies parallelization as an effective tool for real-time execution.

Codes resulting from fabric simulators using a physical model are typically included in the class of irregular applications, as computations are organized using complex data structures, and data access patterns are unknown until runtime. This fact poses great difficulties to the optimization of such codes, and in particular to their parallelization.

In addition to their irregular nature, codes for this kind of applications commonly spent a significant portion of its execution time in reduction operations. These are accumulative operations based on commutative and associative operators (described in detail in Section 3.1). In our simulation codes, these operations correspond to the computation of physical magnitudes, such as forces acting over fabric particles. This process determines vectors and matrix coefficients necessary to solve the differential equations that models the fabric behavior.

This paper discusses the parallelization of a physically-based fabric simulator that we have developed, focusing on those procedures that carry out irregular computations, particularly, irregular reductions. We have chosen a ccNUMA shared memory architecture as the target platform. Several issues should be taken into account in order to achieve high code performance. These factors include, in addition to parallelism, exploitation of data locality as well as careful use of memory resources (memory overhead).

We analyze the computational structure of the selected procedures of the simulator and, consequently, adapt existing irregular reduction parallelization techniques in order to obtain the maximum efficiency from the code. Among the analyzed techniques some of our own proposals are included. These proposals are derived from the concept of data write affinity that we have developed to design efficient irregular reduction parallelization methods exploiting data locality. All the studied techniques were implemented and experimentally tested, in order to obtain comparative data about efficiency and overheads.

The rest of the paper is organized as follows. Section 2 introduces the physical foundations of our fabric simulator and also analyzes the characteristics of its force computation loops. Section 3 describes the irregular computational structure of the force loops, focusing on locality and reduction operations. Section 4 discusses the different parallelization techniques applicable to the reduction loops in the fabric simulator and their effects in performance. Also we introduce the write affinity concept for parallelizing irregular reductions which exploits memory access locality. Section 5 provides an experimental evaluation of the discussed techniques. Finally, Section 6 concludes the paper.

Section snippets

Overview of the fabric simulation problem

In a physical approach, fabrics and other non-rigid objects are usually represented by interacting discrete components (finite elements, springs-masses, patches) each one numerically modeled by an ordinary differential equation, as x¨=M1f(x,x˙),ddt(xv)=(vM1f(x,v)). In most physically-based formulations, equations contain non-linear components, that are linearized generating a linear system of algebraic equations where positions and velocities of the masses are the unknowns. Such positions and

Irregular properties of the simulator code

The aforementioned force computation procedure belongs to the class of codes known as irregular. This property is found commonly in many data organizations used in numerical applications derived from models based on irregular domain discretization methods. Irregular codes are characterized by the way in which memory access patterns are carried out. Such accesses to data are not performed directly but by means of indirections. Approaching the optimization, like parallelization, of irregular

Irregular reduction parallelization

Due to its importance, reduction operation parallelization has been a field where a hard effort has been done since first multiprocessors appeared. In the context of shared memory machines we can classify the reduction parallelization methods, found in the literature, into two broad categories. The first one is based on the privatization of the reduction arrays and it is focused only on how to partition the iteration space among the cooperating threads. The second category is based on the

Experimental evaluation

The following experimental results have been obtained from the parallelization of the force computation loop of the discussed fabric simulator. This loop corresponds to the code shown in Fig. 4, which is the body of the evaluateForces() procedure (see Fig. 2). In such a reduction loop the system matrix A and the right-hand side vector b of the equation system to solve are built. Input data is obtained from the discretization of a piece of fabric yielding a total amount of 218 272 nodes, 653 100

Conclusions

Among the available techniques to optimize applications that requires (near) real-time, parallelism is a good candidate. Fabric simulation is one of these cases, as interactive and multimedia products (dynamic virtual sceneries, computer games, …) requires tight time constraints.

In the computational core of our fabric simulator irregular reductions are found, where a significant portion of the complete execution time is spent. This situation leads us to search efficient solutions to parallelize

References (14)

There are more references available in the full text version of this article.

Cited by (0)

This work was supported by Ministry of Education and Culture (CICYT), Spain, through grant TIC2003-06623.

View full text