High performance JPEG 2000 and MPEG-4 VTC on SMPs using OpenMP
Introduction
Image and video coding methods that use wavelet transforms have been successful in providing high rates of compression while maintaining good image quality and have generated much interest in the scientific community as competitors to DCT-based compression schemes. With the finalization of the wavelet-based JPEG 2000 standard [1] and the inclusion of a wavelet algorithm for synthetic/natural hybrid coding in MPEG-4 (MPEG-4 VTC) [2], [3] there is no doubt that wavelet image compression has to be considered state of the art nowadays.
In this work we show how we can improve the runtime performance of MPEG-4 VTC and JPEG 2000. First, we improve the wavelet decomposition via a reorganization of the order in which the data is processed in order to enhance the overall cache access. Second, we exploit parallelism within the two major coding stages of both algorithms to further speed up the execution, which are the wavelet-lifting and code-block processing part in JPEG 2000, and the convolution-based wavelet filtering and the zerotree coding in MPEG-4 VTC.
The reference software used in our experiments is the MPEG-4 MoMuSys (Mobile Multimedia Systems) Verification Model of August 1999 (ISO/IEC JTC1/SC29/WG11 N2805) and the Jasper JPEG 2000 reference implementation (by Michael D. Adams, available at http://www.ece.ubc.ca/mdadams), which are both written in C. We use OpenMP (http://www.openmp.org) to implement our parallel concept for the execution on shared-memory multiprocessors which are known to be interesting hardware platforms for image processing applications [4]. Parallel results are presented for two multiprocessor platforms: a SGI Power Challenge (20 IP25 RISC CPUs, running at 195 MHz) and a SGI Origin3800 (128 MIPS RISC R12000 CPUs, running at 400 MHz). Note that the following paragraphs focus on the two considered official JPEG 2000 reference implementations. Other software packages (like the JPEG 2000 VM 6.0 or the Kakadu software) already contain some of the proposed or similar techniques for cache behavior optimization.
Lucka and Sorevik [5] propose an OpenMP based parallelization of a first generation wavelet compression scheme. Message passing based parallelizations of second generation wavelet image coding systems (i.e. tree-based or EBCOT-like schemes) are discussed by Feil and Uhl [6], [7] and Kutil [8]. In this work we apply OpenMP based parallelization techniques to the second generation wavelet image coding systems JPEG 2000 and MPEG-4 VTC. In Section 2, we shortly review JPEG 2000 and MPEG-4 VTC and compare those standards from a compression and execution performance point of view. Section 3 discusses and resolves cache organization problems of the considered reference implementations. Parallelization strategies and corresponding experimental results are covered in Section 4. Section 5 concludes the paper.
Section snippets
Wavelet-based image compression standards
Here we present the basic features and techniques implemented in the two still image coding standards JPEG 2000 and MPEG-4 VTC. Firstly, both algorithms are discussed. Secondly, the coding performance of the two algorithms is compared and the results are related to the performance of the well-known JPEG image coding standard. The last part of this section gives a runtime analysis of JPEG 2000 and MPEG-4 VTC.
Cache issues
In Fig. 5a, the runtime of the first decomposition level of the MoMuSys MPEG-4 VTC DWT is subdivided into the vertical and horizontal filtering. We see a significant difference between the vertical and horizontal filtering performance, especially in the case of increasing image dimensions. The vertical filtering of large images is up to 5–8 times slower as compared to the horizontal filtering.
A very similar runtime gap can be observed in the JPEG 2000 reference implementation Jasper (see Fig. 5
Parallelization using OpenMP
OpenMP [19] (http://www.openmp.org) is an efficient tool for programming within parallel shared-memory environments. OpenMP can be seen as a programming interface generalizing the usage of threads, hiding the pure thread and its appliance, respectively the synchronization between threads under macroconstructs, so-called pragmas. These pragmas provide constructs for performing sections of a sequential program (e.g. loops) in parallel.
Conclusion
The runtime performance of the MoMuSys MPEG-4 VTC and the JPEG 2000 reference implementation Jasper is improved significantly by implementing aggregated vertical filtering. Depending on certain parameters, a parallel version can further speedup the execution of both algorithms to some extent. The aggregated version’s parallel efficiency is better, which is mainly due to cache and bus phenomena. However, the scalability is very limited due to the relatively large amount of inherently sequential
Acknowledgment
This work has been partially supported by the Austrian Science Fund (project FWF-13903).
References (26)
- et al.
Motion-compensated wavelet packet zerotree video coding on multicomputers
Journal of Systems Architecture
(2003) Approaches to zerotree image and video coding on MIMD architectures
Parallel Computing
(2002)Parallelizing Mallat algorithm for 2-D wavelet transforms
Information Processing Letters
(1993)- et al.
JPEG2000—Image Compression Fundamentals, Standards and Practice
(2002) - et al.
Scalable wavelet coding for synthetic/natural hybrid coding
IEEE Transactions on Circuits and Systems for Video Technology
(1999) - ISO/IEC 14496-2, Information technology—coding of audio-visual objects—Part 2: Visual, December...
- C. Rothlübbers, R. Orglmeister, Parallel image processing using a Pentium based shared-memory multiprocessor system,...
- M. Lucka, T. Sorevik, Parallel wavelet-based compression of two-dimensional data, in: A. Handlovicova, M. Komornikova,...
- et al.
Wavelet packet zerotree image coding on multicomputers
- et al.
JPEG—Still Image Compression Standard
(1993)