Differential coding scheme for efficient parallel image composition on a PC cluster system
Introduction
Computer graphics including volume rendering have been major techniques to visualize and explore vast 3D numerical data [1], [2], [3], [4], [5]. However, its computational expensiveness has been a problem limiting its practical use in handling huge data sets. Parallel processing on a distributed-memory parallel computer is one of the solutions scalable to data size. So far, several parallel rendering algorithms have been proposed to accelerate rendering 3D data [6], [7], [8], [9], [10], [11], [12].
A useful taxonomy for parallel rendering algorithms was proposed by Molnar et al. [13], [14]. In a typical rendering process, both geometric processing and successive rasterization processing can be parallelized. Primitives in 3D data are sorted from object-space to image-space at some point, either before the geometric processing, after the geometric processing but before the rasterization processing, or after the rasterization processing. Based on where primitives are sorted, the taxonomy classifies parallel rendering algorithms into three categories; sort-first, sort-middle and sort-last algorithms.
For distributed-memory parallel computers, sort-last algorithms are superior to the others in terms of parallel processing granularity and load balancing. In a sort-last algorithm, a 3D data set is divided into subsets, which are independently rendered in parallel to obtain local images. Then processing elements (PEs) of a parallel computer merge their local images to form the final image by communicating with each other. This merging process is generally referred to as image composition. In parallel rendering, image composition is carried out by using an efficient composition algorithm, e.g. a binary-swap algorithm [9], [10], [11], [12], [15] that is most efficient in terms of PE and network utilization. Since each PE performs both geometry processing and rasterization processing on a subset of 3D data, large granularity of parallel processing is obtained. Moreover, since division of 3D data can be adjusted without respect to sorting that takes place as image composition, geometry and rasterization processing can be well balanced among PEs. These advantages result in high parallel processing efficiency.
However, composition of local images is a bottleneck in sort-last parallel rendering. Although the computational time of the parallel processing part decreases as the number of PEs increases, the composition time cannot be reduced any more because of communication among PEs. This means that the total rendering time cannot be less than the composition time that influences the peak performance of parallel rendering. Image composition restricts a speedup of parallel rendering. Accordingly, the composition time has to be reduced in order to achieve higher rendering performance.
One of the methods for decreasing the total composition time is to reduce the communication time for image-data transfer. However, in some cases, it is difficult to reduce the communication time by improving bandwidth of an interconnection network. The larger number of PEs require a wider bandwidth, resulting in a more expensive interconnection network. A network that has narrower bandwidth and does not cost so much is favorable in most cases, including coarse-grain parallel computers like grid-computing. For parallel computing environments where bandwidth improvement is not promising, one approach to save the communication cost is data compression. If transferred data are compressed by slightly additional computation, the total communication time can be reduced.
So far, several lossless compression algorithms have been proposed to encode images communicated for parallel composition by exploiting their coherence [10], [16], [17], [18], [19]. One simple approach is a bounding rectangle [20]. In computer graphics like volume rendering, non-transparent objects are projected onto an image as opaque pixels. We refer to a minimum area containing all of such opaque pixels as projection. The pixels excluding opaque pixels of projections remain transparent (or blank). Since these blank pixels in local images are unnecessary for composition and therefore do not have to be communicated, the bounding rectangle eliminates blank pixels outside the rectangle tightly enclosing projections in an image. Ma et al. [10], Lee et al. [19] and Sano et al. [12] applied this method to parallel image composition. The bounding rectangle has a small overhead requiring only a few additional data to record the top-left and bottom-right corners of a rectangle, and is also effective in removing blank pixels around clustered projections. However, it does not work well in the case of projections made all over an image, even if they contain a lot of blank pixels.
Ahrens et al. applied run-length (RL) encoding [20] to parallel image composition [16], which is effective in sparse projection containing many blank pixels. In the RL encoding, successive pixels with the same value are regarded as a run, and runs are encoded into repetition of run’s value and length. Firstly, local images are encoded by the RL encoding. Then, after transferring them among PEs, image composition is efficiently performed while decoding images. RL encoding of images does not only result in data compression, but also allows skipping redundant composition of blank pixels that have no contribution to their final image. This approach achieved reduction in the total time by decreasing both communication and composition time in the case of sparse projection. However, there exists a serious disadvantage that the size of encoded data can be larger than that of the original data when the average run-length is small in local images. This is likely to occur in the case that complicated 3D objects are densely projected onto an image, e.g., volume rendering.
Yang et al. [17], [18] introduced both a bounding rectangle and RL encoding of blank pixels into volume rendering with the binary-swap image composition. Their method is designed for sparse projection. The RL encoding only encodes blank pixels instead of encoding both blank and non-blank pixels. We refer to Yang’s RL encoding as binary run-length (BRL) encoding to distinguish it from Ahrens’s one. After eliminating blank pixels outside a bounding rectangle, blank pixels within the rectangle are encoded by the BRL encoding. As a result, only opaque pixels are communicated and processed for composition.
By using the BRL encoding, further compression is achieved in comparison with the case where only a bounding rectangle is applied. In addition, there exists almost no disadvantage with encoded data having larger data size than original ones. This is because their method handles only blank pixels that have larger continuity as a run than non-blank pixels. However, Yang’s method does not work well for non-sparse projections. In the case of dense projections, their method cannot sufficiently compress images because of less blank pixels. Insufficient compression suffers from increasing composition time because encoding time cannot be amortized. Furthermore, they did not apply a load balancing scheme to the combination of the BRL encoding with a bounding rectangle despite their assertion that non-blank pixels exchanged in the binary-swap composition should be balanced. Even if images are sufficiently compressed, uneven distribution of encoded data cannot reduce communication time so much. In addition, their work does not contain any evaluation for different viewing directions that seriously influence performance of parallel image composition.
In this paper, we provide a new lossless compression scheme for image composition to achieve faster parallel rendering on distributed-memory parallel computers, like recent PC clusters consisting of high-performance microprocessors and a relatively slow commodity interconnection network. In order to compress data of non-blank pixels, we introduce a differential coding in addition to Yang’s method. The proposed coding scheme is designed based on statistical characteristics of differential pixel values between adjacent pixels. Moreover, we apply a load balancing scheme by interleaving scanlines to level off the amount of data exchanged in the binary-swap image composition. We prove the concept of our approach by implementation on a PC cluster system.
This paper is organized as follows. Section 2 describes the sort-last parallel rendering and the binary-swap image composition algorithm. Section 3 makes our motivation more obvious by describing related work in detail. In Section 4, we introduce the proposed differential coding scheme with load balancing after showing coherence of differential pixel values in volume rendering images. Section 5 discusses the performance improvement by the proposed scheme based on experimental results on a PC cluster system. We give concluding remarks and our future work in Section 6.
Section snippets
Sort-last parallel rendering
According to the Molnar’s taxonomy [13], [14] based on primitives sorted from object-space to image-space, parallel rendering algorithms can be classified into three categories; sort-first, sort-middle and sort-last algorithms. Fig. 1 depicts an overview of a sort-last parallel rendering algorithm. In sort-last parallel rendering, an object space is divided into subspaces in pre-processing. These subspaces are rendered in parallel to generate local images containing contributions of the
Conventional coding schemes for parallel image composition
So far, several lossless image coding schemes have been proposed to reduce the size of communicated data in parallel image composition[10], [16], [17], [18], [19]. To decrease the total time for parallel image composition by data compression, coding time should be short enough to be amortized by reduction of communication time. Consequently, coding schemes simply exploiting coherence of images with a few additional computations are required. In the following subsections, we explain some of
Differential image coding scheme with load balancing
To compress images without a lot of additional computations, we have to ingeniously exploit the coherence in local images of parallel rendering. Furthermore, a load balancing method should be introduced to mitigate exchanged data that are biased by compression. In this section, we present a coding scheme to compress foreground pixels that cannot be compressed by the BRL encoding. Then, we describe a load balancing method combined with the proposed coding scheme.
Performance evaluation
To evaluate effectiveness of the proposed coding scheme, we conducted experiments of parallel image composition on a PC cluster with 8 PEs. Each PE of the PC cluster has a Pentium 4 processor running at 1.8 GHz and a main memory of 512 MB, which is sufficient to store the entire data for experiments. The PEs are interconnected by a 100 Mbit Ethernet switching hub. The binary-swap composition with image coding schemes was implemented by using the C language and the MPI (message passing
Conclusions
In this paper, we have proposed a image coding scheme in order to reduce the time for parallel image composition. The proposed scheme exploits the statistical characteristics in differential pixel values of local images. In addition, we have introduced a load balancing method by interleaving scanlines to the coding scheme. We have conducted experiments on a PC cluster system with 8 PEs. The experimental results showed that the proposed coding scheme with load balancing achieved much better
Acknowledgements
This research was partially supported by Grant-in-Aid for Young Scientists(B) KAKENHI(#13780183) and Grant-in-Aid for Scientific Research(B) KAKENHI(#14380131).
References (21)
Display of surfaces from volume data
IEEE Computer Graphics and Applications
(1988)- et al.
V-BUFFER: visible volume rendering
Computer Graphics
(1988) - et al.
Volume rendering
Computer Graphics
(1988) Footprint evaluation for volume rendering
Computer Graphics
(1990)- P. Lacroute, M. Levoy, Fast volume rendering using a shear-warp factorization of the viewing transformation, in:...
- et al.
Data parallel volume-rendering algorithms for interactive visualization
The Visual Computer
(1993) - W.M. Hsu, Segmented ray casting for data parallel volume rendering, in: Proceedings of 1993 Parallel Rendering...
Analysis of a parallel volume rendering system based on the shear-warp factorization
IEEE Transactions on Visualization and Computer Graphics
(1996)- K.-L. Ma, J.S. Painter, C.D. Hansen, M.F. Krogh, A data distributed, parallel algorithm for ray-traced volume...
- et al.
Parallel volume rendering using binary-swap compositing
IEEE Computer Graphics and Applications
(1994)
Cited by (12)
An experimental comparison of parallel algorithms for hyperspectral analysis using heterogeneous and homogeneous networks of workstations
2008, Parallel ComputingCitation Excerpt :This consideration has a significant impact on the design of data partitioning strategies for parallelization. In particular, it has been shown in the literature that domain decomposition techniques provide flexibility and scalability in parallel image processing [19–21]. In hyperspectral imaging, two types of partitioning can be exploited: spectral-domain partitioning and spatial-domain partitioning.
Commodity cluster-based parallel processing of hyperspectral imagery
2006, Journal of Parallel and Distributed ComputingComposition without inactive pixels in cluster node
2013, Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer GraphicsResearch on parallel computing of spatial vector data conversion based on common interface
2012, Proceedings - 2012 20th International Conference on Geoinformatics, Geoinformatics 2012Parallel 3D imaging based on linux cluster
2010, 2010 International Conference on Future Information Technology and Management Engineering, FITME 2010