Differential coding scheme for efficient parallel image composition on a PC cluster system

doi:10.1016/j.parco.2003.11.001

Parallel Computing

Volume 30, Issue 2, February 2004, Pages 285-299

https://doi.org/10.1016/j.parco.2003.11.001 Get rights and content

Abstract

Although the sort-last parallel rendering is a promising approach to accelerate large-scale computer graphics applications handling huge data sets, parallel image composition is a bottleneck of performance improvement. So far, several image coding schemes have been proposed in order to achieve fast image composition by compressing communicated data. These schemes mainly encode blank pixels in rendered images, which are pixels with no projection of objects. However, sufficient compression was not available in the case that rendered images have few blank pixels. This paper presents an image coding scheme that reduces the communication time in parallel image composition by effective compression of non-blank pixels and load balancing. The coding scheme exploits coherence of differential pixel values with a few additional computations that do not spoil the reduction in communication time. Experiments on a PC cluster with eight processing elements connected by a 100 Mbit Ethernet switching hub show that the worst frame rate of all viewing parameters can greatly be improved by the proposed coding scheme.

Introduction

Computer graphics including volume rendering have been major techniques to visualize and explore vast 3D numerical data [1], [2], [3], [4], [5]. However, its computational expensiveness has been a problem limiting its practical use in handling huge data sets. Parallel processing on a distributed-memory parallel computer is one of the solutions scalable to data size. So far, several parallel rendering algorithms have been proposed to accelerate rendering 3D data [6], [7], [8], [9], [10], [11], [12].

A useful taxonomy for parallel rendering algorithms was proposed by Molnar et al. [13], [14]. In a typical rendering process, both geometric processing and successive rasterization processing can be parallelized. Primitives in 3D data are sorted from object-space to image-space at some point, either before the geometric processing, after the geometric processing but before the rasterization processing, or after the rasterization processing. Based on where primitives are sorted, the taxonomy classifies parallel rendering algorithms into three categories; sort-first, sort-middle and sort-last algorithms.

For distributed-memory parallel computers, sort-last algorithms are superior to the others in terms of parallel processing granularity and load balancing. In a sort-last algorithm, a 3D data set is divided into subsets, which are independently rendered in parallel to obtain local images. Then processing elements (PEs) of a parallel computer merge their local images to form the final image by communicating with each other. This merging process is generally referred to as image composition. In parallel rendering, image composition is carried out by using an efficient composition algorithm, e.g. a binary-swap algorithm [9], [10], [11], [12], [15] that is most efficient in terms of PE and network utilization. Since each PE performs both geometry processing and rasterization processing on a subset of 3D data, large granularity of parallel processing is obtained. Moreover, since division of 3D data can be adjusted without respect to sorting that takes place as image composition, geometry and rasterization processing can be well balanced among PEs. These advantages result in high parallel processing efficiency.

However, composition of local images is a bottleneck in sort-last parallel rendering. Although the computational time of the parallel processing part decreases as the number of PEs increases, the composition time cannot be reduced any more because of communication among PEs. This means that the total rendering time cannot be less than the composition time that influences the peak performance of parallel rendering. Image composition restricts a speedup of parallel rendering. Accordingly, the composition time has to be reduced in order to achieve higher rendering performance.

One of the methods for decreasing the total composition time is to reduce the communication time for image-data transfer. However, in some cases, it is difficult to reduce the communication time by improving bandwidth of an interconnection network. The larger number of PEs require a wider bandwidth, resulting in a more expensive interconnection network. A network that has narrower bandwidth and does not cost so much is favorable in most cases, including coarse-grain parallel computers like grid-computing. For parallel computing environments where bandwidth improvement is not promising, one approach to save the communication cost is data compression. If transferred data are compressed by slightly additional computation, the total communication time can be reduced.

So far, several lossless compression algorithms have been proposed to encode images communicated for parallel composition by exploiting their coherence [10], [16], [17], [18], [19]. One simple approach is a bounding rectangle [20]. In computer graphics like volume rendering, non-transparent objects are projected onto an image as opaque pixels. We refer to a minimum area containing all of such opaque pixels as projection. The pixels excluding opaque pixels of projections remain transparent (or blank). Since these blank pixels in local images are unnecessary for composition and therefore do not have to be communicated, the bounding rectangle eliminates blank pixels outside the rectangle tightly enclosing projections in an image. Ma et al. [10], Lee et al. [19] and Sano et al. [12] applied this method to parallel image composition. The bounding rectangle has a small overhead requiring only a few additional data to record the top-left and bottom-right corners of a rectangle, and is also effective in removing blank pixels around clustered projections. However, it does not work well in the case of projections made all over an image, even if they contain a lot of blank pixels.

Ahrens et al. applied run-length (RL) encoding [20] to parallel image composition [16], which is effective in sparse projection containing many blank pixels. In the RL encoding, successive pixels with the same value are regarded as a run, and runs are encoded into repetition of run’s value and length. Firstly, local images are encoded by the RL encoding. Then, after transferring them among PEs, image composition is efficiently performed while decoding images. RL encoding of images does not only result in data compression, but also allows skipping redundant composition of blank pixels that have no contribution to their final image. This approach achieved reduction in the total time by decreasing both communication and composition time in the case of sparse projection. However, there exists a serious disadvantage that the size of encoded data can be larger than that of the original data when the average run-length is small in local images. This is likely to occur in the case that complicated 3D objects are densely projected onto an image, e.g., volume rendering.

Yang et al. [17], [18] introduced both a bounding rectangle and RL encoding of blank pixels into volume rendering with the binary-swap image composition. Their method is designed for sparse projection. The RL encoding only encodes blank pixels instead of encoding both blank and non-blank pixels. We refer to Yang’s RL encoding as binary run-length (BRL) encoding to distinguish it from Ahrens’s one. After eliminating blank pixels outside a bounding rectangle, blank pixels within the rectangle are encoded by the BRL encoding. As a result, only opaque pixels are communicated and processed for composition.

By using the BRL encoding, further compression is achieved in comparison with the case where only a bounding rectangle is applied. In addition, there exists almost no disadvantage with encoded data having larger data size than original ones. This is because their method handles only blank pixels that have larger continuity as a run than non-blank pixels. However, Yang’s method does not work well for non-sparse projections. In the case of dense projections, their method cannot sufficiently compress images because of less blank pixels. Insufficient compression suffers from increasing composition time because encoding time cannot be amortized. Furthermore, they did not apply a load balancing scheme to the combination of the BRL encoding with a bounding rectangle despite their assertion that non-blank pixels exchanged in the binary-swap composition should be balanced. Even if images are sufficiently compressed, uneven distribution of encoded data cannot reduce communication time so much. In addition, their work does not contain any evaluation for different viewing directions that seriously influence performance of parallel image composition.

In this paper, we provide a new lossless compression scheme for image composition to achieve faster parallel rendering on distributed-memory parallel computers, like recent PC clusters consisting of high-performance microprocessors and a relatively slow commodity interconnection network. In order to compress data of non-blank pixels, we introduce a differential coding in addition to Yang’s method. The proposed coding scheme is designed based on statistical characteristics of differential pixel values between adjacent pixels. Moreover, we apply a load balancing scheme by interleaving scanlines to level off the amount of data exchanged in the binary-swap image composition. We prove the concept of our approach by implementation on a PC cluster system.

This paper is organized as follows. Section 2 describes the sort-last parallel rendering and the binary-swap image composition algorithm. Section 3 makes our motivation more obvious by describing related work in detail. In Section 4, we introduce the proposed differential coding scheme with load balancing after showing coherence of differential pixel values in volume rendering images. Section 5 discusses the performance improvement by the proposed scheme based on experimental results on a PC cluster system. We give concluding remarks and our future work in Section 6.

Section snippets

Sort-last parallel rendering

According to the Molnar’s taxonomy [13], [14] based on primitives sorted from object-space to image-space, parallel rendering algorithms can be classified into three categories; sort-first, sort-middle and sort-last algorithms. Fig. 1 depicts an overview of a sort-last parallel rendering algorithm. In sort-last parallel rendering, an object space is divided into subspaces in pre-processing. These subspaces are rendered in parallel to generate local images containing contributions of the

Conventional coding schemes for parallel image composition

So far, several lossless image coding schemes have been proposed to reduce the size of communicated data in parallel image composition[10], [16], [17], [18], [19]. To decrease the total time for parallel image composition by data compression, coding time should be short enough to be amortized by reduction of communication time. Consequently, coding schemes simply exploiting coherence of images with a few additional computations are required. In the following subsections, we explain some of

Differential image coding scheme with load balancing

To compress images without a lot of additional computations, we have to ingeniously exploit the coherence in local images of parallel rendering. Furthermore, a load balancing method should be introduced to mitigate exchanged data that are biased by compression. In this section, we present a coding scheme to compress foreground pixels that cannot be compressed by the BRL encoding. Then, we describe a load balancing method combined with the proposed coding scheme.

Performance evaluation

To evaluate effectiveness of the proposed coding scheme, we conducted experiments of parallel image composition on a PC cluster with 8 PEs. Each PE of the PC cluster has a Pentium 4 processor running at 1.8 GHz and a main memory of 512 MB, which is sufficient to store the entire data for experiments. The PEs are interconnected by a 100 Mbit Ethernet switching hub. The binary-swap composition with image coding schemes was implemented by using the C language and the MPI (message passing

Conclusions

In this paper, we have proposed a image coding scheme in order to reduce the time for parallel image composition. The proposed scheme exploits the statistical characteristics in differential pixel values of local images. In addition, we have introduced a load balancing method by interleaving scanlines to the coding scheme. We have conducted experiments on a PC cluster system with 8 PEs. The experimental results showed that the proposed coding scheme with load balancing achieved much better

Acknowledgements

This research was partially supported by Grant-in-Aid for Young Scientists(B) KAKENHI(#13780183) and Grant-in-Aid for Scientific Research(B) KAKENHI(#14380131).

References (21)

M. Levoy
Display of surfaces from volume data
IEEE Computer Graphics and Applications
(1988)
C. Upson et al.
V-BUFFER: visible volume rendering
Computer Graphics
(1988)
R.A. Drebin et al.
Volume rendering
Computer Graphics
(1988)
L.A. Westover
Footprint evaluation for volume rendering
Computer Graphics
(1990)
P. Lacroute, M. Levoy, Fast volume rendering using a shear-warp factorization of the viewing transformation, in:...
P. Schröder et al.
Data parallel volume-rendering algorithms for interactive visualization
The Visual Computer
(1993)
W.M. Hsu, Segmented ray casting for data parallel volume rendering, in: Proceedings of 1993 Parallel Rendering...
P. Lacroute
Analysis of a parallel volume rendering system based on the shear-warp factorization
IEEE Transactions on Visualization and Computer Graphics
(1996)
K.-L. Ma, J.S. Painter, C.D. Hansen, M.F. Krogh, A data distributed, parallel algorithm for ray-traced volume...
K.-L. Ma et al.
Parallel volume rendering using binary-swap compositing
IEEE Computer Graphics and Applications
(1994)

There are more references available in the full text version of this article.

Cited by (12)

An experimental comparison of parallel algorithms for hyperspectral analysis using heterogeneous and homogeneous networks of workstations
2008, Parallel Computing
Citation Excerpt :
This consideration has a significant impact on the design of data partitioning strategies for parallelization. In particular, it has been shown in the literature that domain decomposition techniques provide flexibility and scalability in parallel image processing [19–21]. In hyperspectral imaging, two types of partitioning can be exploited: spectral-domain partitioning and spatial-domain partitioning.
Imaging spectroscopy, also known as hyperspectral imaging, is a new technique that has gained tremendous popularity in many research areas, including satellite imaging and aerial reconnaissance. In particular, NASA is continuously gathering high-dimensional image data from the surface of the earth with hyperspectral sensors such as the Jet Propulsion Laboratory’s Airborne Visible-Infrared Imaging Spectrometer (AVIRIS) or the Hyperion hyperspectral imager aboard NASA’s Earth Observing-1 (EO-1) spacecraft. Despite the massive volume of scientific data commonly involved in hyperspectral imaging applications, very few parallel strategies for hyperspectral analysis are currently available, and most of them have been designed in the context of homogeneous computing platforms. However, heterogeneous networks of workstations represent a very promising cost-effective solution that is expected to play a major role in the design of high-performance computing platforms for many on-going and planned remote sensing missions. Our main goal in this paper is to understand parallel performance of hyperspectral imaging algorithms comprising the standard hyperspectral data processing chain (which includes pre-processing, selection of pure spectral components and linear spectral unmixing) in the context of fully heterogeneous computing platforms. For that purpose, we develop an exhaustive quantitative and comparative analysis of several available and new parallel hyperspectral imaging algorithms by comparing their efficiency on both a fully heterogeneous network of workstations and a massively parallel homogeneous cluster at NASA’s Goddard Space Flight Center in Maryland.
Commodity cluster-based parallel processing of hyperspectral imagery
2006, Journal of Parallel and Distributed Computing
The rapid development of space and computer technologies has made possible to store a large amount of remotely sensed image data, collected from heterogeneous sources. In particular, NASA is continuously gathering imagery data with hyperspectral Earth observing sensors such as the Airborne Visible-Infrared Imaging Spectrometer (AVIRIS) or the Hyperion imager aboard Earth Observing-1 (EO-1) spacecraft. The development of fast techniques for transforming the massive amount of collected data into scientific understanding is critical for space-based Earth science and planetary exploration. This paper describes commodity cluster-based parallel data analysis strategies for hyperspectral imagery, a new class of image data that comprises hundreds of spectral bands at different wavelength channels for the same area on the surface of the Earth. An unsupervised technique that integrates the spatial and spectral information in the image data using multi-channel morphological transformations is parallelized and compared to other available parallel algorithms. The code's portability, reusability and scalability are illustrated by using two high-performance parallel computing architectures: a distributed memory, multiple instruction multiple data (MIMD)-style multicomputer at European Center for Parallelism of Barcelona, and a Beowulf cluster at NASA's Goddard Space Flight Center. Experimental results suggest that Beowulf clusters are a source of computational power that is both accessible and applicable to obtaining results in valid response times in information extraction applications from hyperspectral imagery.
Synthesis of quantum images using phase rotation
2018, arXiv
Composition without inactive pixels in cluster node
2013, Jisuanji Fuzhu Sheji Yu Tuxingxue Xuebao/Journal of Computer-Aided Design and Computer Graphics
Research on parallel computing of spatial vector data conversion based on common interface
2012, Proceedings - 2012 20th International Conference on Geoinformatics, Geoinformatics 2012
Parallel 3D imaging based on linux cluster
2010, 2010 International Conference on Future Information Technology and Management Engineering, FITME 2010

View all citing articles on Scopus

View full text

Differential coding scheme for efficient parallel image composition on a PC cluster system

Abstract

Introduction

Section snippets

Sort-last parallel rendering

Conventional coding schemes for parallel image composition

Differential image coding scheme with load balancing

Performance evaluation

Conclusions

Acknowledgements

Display of surfaces from volume data

IEEE Computer Graphics and Applications

V-BUFFER: visible volume rendering

Computer Graphics

Volume rendering

Computer Graphics

Footprint evaluation for volume rendering

Computer Graphics

Data parallel volume-rendering algorithms for interactive visualization

The Visual Computer

Analysis of a parallel volume rendering system based on the shear-warp factorization

IEEE Transactions on Visualization and Computer Graphics

Parallel volume rendering using binary-swap compositing

IEEE Computer Graphics and Applications