Elsevier

Parallel Computing

Volume 28, Issues 7–8, August 2002, Pages 1079-1093
Parallel Computing

Parallel codebook design for vector quantization on a message passing MIMD architecture

https://doi.org/10.1016/S0167-8191(02)00109-6Get rights and content

Abstract

Vector quantization (VQ) is a widely used algorithm in speech and image data compression. One of the problems of the VQ methodology is that it requires large computation time especially for large codebook size. This paper addresses two issues. The first deals with the parallel construction of the VQ codebook which can drastically reduce the training time. A master/worker parallel implementation of a VQ algorithm is proposed. The algorithm is executed on the DM-MIMD Alex AVX-2 machine using a pipeline architecture. The second issue deals with the ability of accurately predicting the machine performance. Using communication and computation models, a comparison between expected and real performance is carried out. Results show that the two models can accurately predict the performance of the machine for image data compression. Analysis of metrics normally used in parallel realization is conducted.

Introduction

Vector quantization (VQ) [1] is an efficient technique in compression of speech and image signals. A vector quantizer, VQ, can be defined as a mapping Q of a d-dimensional Euclidean space Rd into a finite subset Y of Rd. Thus,Y={yi;i=1,2,…,N}is a set of reproduction vectors (called codevectors), and N is the number of vectors in Y. The set Y is called the codebook and N is called the codebook size. A VQ can be decomposed into two components, the encoder and the decoder. The encoder, E, is the mapping from Rd to the index set, I={i=1,2,…,N}, and the decoder, D, maps the index set, I, into the codebook, C, i.e., E:RdI and D:IRd.

The code rate, or bit rate, is defined as r=(log2N)/d, which measures the number of bits per vector component used to represent the input vector and gives an indication of the precision that is achievable with a vector quantizer if the codebook is well designed. The distortion of the VQ is defined as the average quantization error between the input source signals and their reproduction codevectors. Increasing the number of codevectors in the codebook can decrease the distortion of the VQ, and, normally, will increase the bit rate. One major concern of a VQ design is the trade-off between distortion and the bit rate.

Construction of the codebook is achieved through a learning process. Traditional learning algorithms include the LBG algorithm [2] and the K-means algorithm [3]. Many neural codebook techniques have also been proposed for the training of vector quantizers. These include competitive learning [4], self-organization maps [5], and frequency sensitive competitive learning [6].

During training using the K-means algorithm, a distortion measure J(x,y) is viewed as the cost of representing an input vector x using the codevector y where J is given byJ(x,y)=∑i=1d∥xi−yi2

The training process can be summarized as follows. Each of the data vectors is compared to each of the codevectors. The codevector that most closely matches the data vector, i.e., the codevector which has the minimum distortion with the data vector, is selected and the codeword is modified to reflect the inclusion of this new data vector. After the codebook has been constructed, input data are encoded in a similar manner. Each data vector is compared with each of the codevectors (called full search), and the index of the codevector with minimum distortion is transmitted or stored. In decoding, the index is used to access the codevector which is then used to represent the data vector. Obviously, the full search VQ is a computationally expensive algorithm.

This has led to some attempts in parallel realizations of the VQ algorithm. There are two approaches for the parallel implementation of VQ. The first is concerned with the VQ encoding process where the codevectors are distributed over the processors and the nearest neighbor of the input vector is searched in parallel. The second approach deals with the parallel construction of the codebook. The first approach was adopted by Manohar and Tilton [7], [8] in their parallel implementation of a progressive vector quantization compression approach, which decomposes image data into a number of levels using full search VQ. The computational difficulties are addressed by implementation on the SIMD MasPar machine. In this method, each processor element (PE) is assigned one codevector. Data vectors are individually transmitted to all PEs and compared against all the codevectors at the same time. The method requires the codebook to be equal to the number of PEs (16,384) in order to utilize the machine to its full capacity. The parallel algorithm was not evaluated using metrics normally used in the parallel literature. In another paper [9], the evaluation of the parallel implementation of full search VQ on the MasPar machine is carried out. The MFLOPS metric is used to evaluate the performance of the method where the MasPar is compared against a Sun SPARCstastion 10 workstation. Different sizes of codebooks were experimented. Performance improves as an increase in the codebook size with a constant codevector size would result in an increase in the computational load while communication time remains constant. On the other hand, an increase in the vector size with a constant codebook size would not result in any extra computation but would amount to greater overhead in terms of communications. The result is a decrease in the performance. The peak performance achieved by the algorithm is close to 3.6 GFLOPS which is considered to be a very good performance on the MasPar MP-2 system (peak rating is 6.2 GFLOPS).

Another parallel implementation on the MasPar is also reported in [10]. In this implementation, the codebook size was fixed but the codebook size and the number of training patterns varied. The algorithm was implemented in such a way that neither communication nor synchronization among the PEs are required.

Kobayashi et al. [11] proposed a memory-based processor called a functional memory type parallel processor (FMPP) for VQ encoding. The architecture has many PEs as code vectors. A shared bus connects all PEs. The nearest vector is searched exhaustively in parallel. Each PE has conventional memories to store code vectors and an arithmetic logic unit to compute the distance between the code vector and the input vector. The nearest vector is obtained using a parallel search which is done in O(k), where k is the vector dimension. The number of code vectors does not affect the computation time. In the nearest neighbor search, only input vectors are broadcast to PEs from a shared data bus. Code vectors can be easily updated since they are stored in conventional memories. All PEs can be laid out in memory-like regular-array structure, which minimizes circuit area. They have implemented an VLSI including four PEs operating at 25 MHz. The architecture is compared with a conventional serial processor where all instructions are done in one clock cycle. The clock cycle on the serial processor is much greater than that on the FMPP-VQ. The architecture has proved its efficiency in sending a video sequence through a low bit-rate line with such a simple hardware.

Contrary to the previously outlined methods, in this paper we are investigating the second approach of the parallel implementation of the VQ algorithm where the VQ codebook is constructed in a parallel fashion. Here we will be using the distributed-memory (DM) MIMD architecture known as the Alex AVX-2 machine. The parallel algorithm used here is employing the master/worker method to parallelize the algorithm.

The paper is organized as follows. Section 2 describes the architecture of the AVX machine and the message passing mechanism used for communication. Section 3 explains the master/worker parallel implementation of the VQ codebook construction algorithm. The single-processor computation model is analyzed in Section 4 while the communication model is introduced in Section 5. In Section 6, the actual performance for different codebook sizes and different number of processors is then compared with the predicted performance. Finally, conclusions of the paper are given in Section 7.

Section snippets

The AVX-2 MIMD machine

The parallel VQ algorithm proposed here is mapped onto the MIMD machine known as the Alex AVX-2. The machine is a fairly new parallel system. It is a RISC-based distributed-memory multi-processor with 60 Mflop/s per processor and 20 MB memory per node. Its maximal theoretical peak performance is 3.84 Gflop/s. Its communication bandwidth is <10 MB/s. The Alex machine under consideration contains 64 compute nodes and eight root nodes. Each node employs an Intel i860 microprocessor for computation

The parallel vector quantization (PVQ) codebook generation algorithm

In our simulation of the PVQ, we intended to build a codebook of different sizes using a pool of training data2 extracted from nine 512×512 gray images. To generate the data, the images were divided into blocks of size 4×4 and thus making each training data vector to be comprised of 16 elements. Using this large number of data to build the vector quantizer, the AVX machine was configured to use the master/worker

The computation model

In order to predict the performance of the machine when p processors are arranged to build the codebook, it is required first to find out a model of the computation time when only one processor is used. Assuming a perfect load balance, it is then possible to project the computation time needed for the p processors to do the task.

To achieve this goal, we have experimented with four different codebooks with the number of codevectors N equal to 64, 128, 256, 512. In training each codebook, we have

The communication model

In a previous implementation [14] of the backpropagation network on the AVX–2 machine running Trollius, we have found that the communication time highly exceeds the computation time when the transport level message passing was used. At this level, synchronization is enforced, i.e., a sending process cannot transmit a message that may never be received due to a programming error or a deadlock situation. The sending process sends a “ready-to-send” message to the receiver and blocks until the

Comparison between the actual and the predicted performance

In this section, we will draw a comparison between the predicted machine performance and the actual performance. Here we will be using the computation and communication models developed earlier to calculate the predicted time required to train different sizes of codebooks. Employing , , the total predicted time isTpred(D,p,Y)=Tcalc(p,Y)+Tcomm(p,Y)=DpτpatN+τitrN+tsetup+tpackdN+tcommdNpSince data slicing parallelism normally assumes a perfect load balance, each worker will be calculating the

Conclusions

In this paper, a master/worker parallel implementation of a VQ algorithm to train a codebook on gray image database has been evaluated using the Alex AVX-2 parallel computer. Models estimating computation and communication time have been developed. Using these models, it has been shown that the machine performance can be predicted within a very small margin of error. Therefore, the machine performance for any set of workers, training data or codebook size can be easily found. The same method

Acknowledgements

This research has been supported by Infolytica Inc., Montreal, Quebec, Canada.

References (20)

There are more references available in the full text version of this article.

Cited by (8)

View all citing articles on Scopus
1

Present address: Mentor Graphics Corp., Egypt.

View full text