A new algorithm for translating psycho-acoustic information to the wavelet domain

doi:10.1016/S0165-1684(00)00230-9

Signal Processing

Volume 81, Issue 3, March 2001, Pages 519-531

https://doi.org/10.1016/S0165-1684(00)00230-9 Get rights and content

Abstract

Among the characteristics of a filter bank to be used in subband audio coding, it is desirable that the filter bank and its inverting process maintain a high-frequency selectivity. This design goal is not always met when the audio coder is based on the wavelet transform, making the translation of psycho-acoustic information from the Fourier to the wavelet domain difficult. In this paper, we present a new method for translating the psycho-acoustic information to the wavelet domain, which can be applied to subband audio coders based on the orthonormal wavelet transform, when the subband decomposition approximates the frequency decomposition of sounds in the inner ear. A simple and improved psycho-acoustic model is also described. Both the psycho-acoustic model and the translation algorithm have been included in a wavelet-packet-based audio coder. This uses CD-quality signals and allows transparent coding with binary rates of about 1.5 bits/sample. The inverse relation between the number of vanishing moments of the mother wavelet and the minimum binary rate to ensure transparent coding is shown. A decrease in the number of vanishing moments increases the binary rate slightly, without loss to the subjective quality of the decoded audio signals.

Introduction

Since the 1980's, CD-quality has been the main reference for audio-signal representation. It is defined as an audio signal, where each channel is sampled at 44.1 kHz, and linearly coded with 16 bits/sample. Of course, it results in very high binary rates for the multimedia applications that are appearing nowadays. This fact justifies the use of compression techniques, the objective of which is to achieve a low bit rate in the digital representation of signals with the minimum perceived loss in quality.

A proper choice of the coding technique has to take into consideration the nature of the audio signals. These are neither Gaussian nor stationary. A very important issue in audio coding is the possibility of exploiting the masking phenomenon in the inner ear. To do this, subband decompositions must be used, and each subband should be coded as an isolated signal, ensuring that the introduced quantization noise remains below the masking threshold. The overall binary rate which ensures transparent coding is defined as the ‘perceptual entropy’, proposed by Johnston [9] to be used in audio coding instead of the classical entropy defined by Shannon.

Another important issue is the non-stationary nature of the audio signal, which forces us to use local analysis. Several techniques have been proposed to do that. The most novel is the wavelet transform, and its generalization, the wavelet-packet decomposition. Due to the usual implementation of the wavelet transform as a filter bank, audio coders based on this transform can be understood as subband coders.

Nowadays, the international interest in audio coding is centered on the ISO/MPEG audio standards. The first one, called ISO/MPEG-1 [3], manages sampling rates of 32, 44.1 and 48 kHz. The last one, the ISO/MPEG-4 standard, is composed of several speech and audio coders that support bit rates from $2 to 64$ kbps per channel.

Parallel to the definition of the ISO/MPEG standards, several audio-coding algorithms have been proposed which use the wavelet transform as the tool for decomposing the signal. The most often cited was designed by Sihna and Tewfik [12]. It is a high-complexity audio coder which provides high compression rates and high-quality reconstructed signals. This perceptual coder includes a method for translating psycho-acoustic information to the wavelet domain. Several drawbacks related to this method have been reported: It is computationally complex, and its simplification implies the use of high-selectivity filters to implement the wavelet transform.

This paper focuses on the problem of translating the psycho-acoustic information to the wavelet domain. We demonstrate that it can be easily achieved using filters that generate wavelets with any number of vanishing moments if several constraints are imposed on the analysis and synthesis filter banks. We also describe a new perceptual-coding algorithm for CD monophonic audio signals that combines simplicity and high compression rates and ensures high perceptual quality of the decoded signal.

The sections of the paper are organized as follows:

Section 2 describes multi-resolution analysis and wavelet transform to show the properties of this transform which make it suitable for audio coding purposes. The way the discrete-time wavelet transform is implemented as an iterated filter bank is very important.

In Section 3, coder and decoder structures are presented, which follow the general scheme of subband coding. However, several original contributions are included, such as an improved psycho-acoustic model, a wavelet-packet decomposition of the audio signal, and the main one, a new method for translating psycho-acoustic information from the Fourier domain to the wavelet one.

Section 4 describes the psycho-acoustic model used. It is based on the model proposed in [9] which is also the base of the second model included in the MPEG-1 standard.

Section 5 describes the criteria for the design of filter-banks used in subband coders. The filter-bank used to decompose the time-domain input data into subband components is crucial if a good performance of an audio coding system is to be ensured [6], [8].

The wavelet-packet decompositions have many advantages, but one serious drawback, which is the low selectivity of the equivalent subband filters. This makes it impossible to apply the estimated psycho-acoustic information directly to the Fourier domain.

Due to this drawback, in order to use a wavelet-transform based audio coder a translation algorithm is necessary. Section 6 includes the description of a new algorithm, discussing the conditions that must be fulfilled for its correct usage.

A double blind test has been carried out to check the behaviour of the audio coder. The results with a wide set of audio signals are presented in Section 7.

Finally, conclusions are extracted and new research lines are presented in Section 8.

Section snippets

Multiresolution analysis and wavelet transform

Multiresolution analysis is the process that allows us to express a function f(t)∈L²(R) as a linear combination of its projections into a set of closed subspaces ${V_{m} | m∈Z}$ that fulfil the following properties [1], [5], [10]:

•
All subspaces V_m are enclosed: $⋯⊂V_{2} ⊂V_{1} ⊂V_{0} ⊂V_{−1} ⊂V_{−2} ⋯ .$
•
Completeness: $⋂ m∈Z V_{m} ={0} ⋃ m∈Z V_{m} =L^{2} (R).$
•
Scaling property: $f (x)∈V_{m} ⇔ f (2x)∈V_{m−1} .$
•
Base property: there is a scaling function φ(t)∈V₀ so that ∀m∈Z, the set ${φ_{mn} (t)=2^{−m/2} φ(2^{−m} t−n)},$ is an orthonormal base for V_m, that is $〈φ_{mn} (t),φ_{mn′} (t)〉=δ_{}$

Audio coder and decoder structure

In this section, coder and decoder structures are described. They work with monophonic audio signals, sampled at 44.1 kHz, where each sample has been linearly coded with 16 bits. The objective is to obtain a digital representation of the original signal, with the minimum size, and preserving as much as possible its psycho-acoustical properties. To do that, the following scheme has been implemented (see [15]):

. Coder structure.

(1)	The proposed audio coder analyses the signal with a filter bank that

Masking threshold

We have used a psycho-acoustic model similar to the well-known one proposed in [9], which is the basis of the second psycho-acoustic model of the ISO-MPEG-1 standard for audio coding. The only improvement to be introduced estimates the tonality coefficient that must be assigned to each critical band.

To estimate the masking threshold, we first estimate the power spectrum of the audio frame (x[n]). To do that, we use the modified periodogram with a Hann window (w[n]) $S(e^{jk(2π/N)})= 1 N∑_{n=0}^{N−1} w[n]^{2} ∑ n=0$

Filter bank design

The design of the filter-bank structure must follow the general objective of representing the input signal with as low a number of bits as possible. It is desirable to attend to several design goals [8]:

(1)	The decomposition should be invertible, i.e., the filter bank should be a perfect-reconstruction.
(2)	Both the analysis filter bank and its inverting process must maintain a high degree of frequency selectivity, in order to make the application of the perceptual threshold as simple as possible.

Algorithm for translating psycho-acoustic information to the wavelet domain

We were interested in developing a procedure which allows the direct translation of masking and auditory thresholds from the Fourier to the wavelet domain, even when low-selectivity filters are used to implement the wavelet transform or the wavelet-packet decomposition. We demonstrate that under several constraints, audio coders based on the wavelet transform can be developed using filters that generate compactly supported wavelets with any number of vanishing moments (filters with any length).

Results

To assess the quality of signals encoded with the proposed coder, we have obtained some subjective and objective results. Six music samples considered hard to encode have been used. Special attention has been paid to signals which consist of impulsive energy bursts, like ‘drums’ or ‘piano solo’. These signals are extremely susceptible to the presence of ‘pre-echos’. As we shall see, the proposed coder with the new algorithm for translating psycho-acoustic information to the wavelet domain

Conclusions

In this paper, we have presented an improved masking model in which we propose a new method for calculating the tonality coefficient that characterizes the tonal or noise like nature of the analysed signal. This method is based on a linear predictor of the magnitude and phase spectra. Also, a new algorithm for translating psycho-acoustic information to the wavelet domain is presented which allows us to implement transparent audio coders based on the orthonormal-wavelet transform, using filters

References (15)

L. Cohen et al.
Biorthonormal bases of compactly supported wavelets
Commun. Pure Appl. Math.
(1992)
R. Coifman, Y. Meyer, S. Quake, M. Wickerhauser, Signal processing and compression with wave packets, Dept. Math., Yale...
I.T. Committee, Coding of moving pictures and associated audio for digital storage media up to about 1.5 Mbit/s,...
C.D. Creusere et al.
Efficient audio coding using perfect reconstruction noncausal iir filter banks
IEEE Trans. Speech Audio Process.
(1996)
I. Daubechies
Orthonormal bases of compactly supported wavelets
Commun. Pure Appl. Math.
(1988)
A. Gersho
Advances in speech and audio compression
Proc. IEEE
(1994)
R.P. Hellman
Asymmetry of masking between noise and tone
Perception Psychophys.
(1972)

There are more references available in the full text version of this article.

Cited by (15)

Auditory-motivated Gammatone wavelet transform
2014, Signal Processing
Citation Excerpt :
A psychoacoustic model based on the WPT with applications to audio compression and watermarking was proposed by He and Scordilis [30]. Zurera et al. proposed an algorithm for translating psychoacoustic information to the wavelet domain, using which a WPT-based audio coder was developed [31]. Reyes et al. developed an adaptive WPT by minimizing different perceptual-entropy-based cost functions, and showed that it leads to a low bit-rate representation of audio signals without affecting perceptual quality [32].
The ability of the continuous wavelet transform (CWT) to provide good time and frequency localization has made it a popular tool in time–frequency analysis of signals. Wavelets exhibit constant-Q property, which is also possessed by the basilar membrane filters in the peripheral auditory system. The basilar membrane filters or auditory filters are often modeled by a Gammatone function, which provides a good approximation to experimentally determined responses. The filterbank derived from these filters is referred to as a Gammatone filterbank. In general, wavelet analysis can be likened to a filterbank analysis and hence the interesting link between standard wavelet analysis and Gammatone filterbank. However, the Gammatone function does not exactly qualify as a wavelet because its time average is not zero. We show how bona fide wavelets can be constructed out of Gammatone functions. We analyze properties such as admissibility, time-bandwidth product, vanishing moments, which are particularly relevant in the context of wavelets. We also show how the proposed auditory wavelets are produced as the impulse response of a linear, shift-invariant system governed by a linear differential equation with constant coefficients. We propose analog circuit implementations of the proposed CWT. We also show how the Gammatone-derived wavelets can be used for singularity detection and time–frequency analysis of transient signals.
An enhanced psychoacoustic model based on the discrete wavelet packet transform
2006, Journal of the Franklin Institute
Citation Excerpt :
However, in contrast to other approaches (e.g. [14]) here the computation uses more precise critical bank approximations as well as simultaneous masking results obtained in the wavelet domain. As mentioned in Section 2, the DWPT can conveniently decompose the signal into a critical band-like partition [14,16,17]. The standard critical bands are included in Table 1.
The perception of acoustic information by humans is based on the detailed temporal and spectral analysis provided by the auditory processing of the received signal. The incorporation of this process in psychoacoustical computational models has contributed significantly both in the development of highly efficient audio compression schemes as well as in effective audio watermarking methods. In this paper, we present an approach based on the discrete wavelet packet transform, which closely mimics the multi-resolution properties of the human ear and also includes simultaneous and temporal auditory masking. Experimental results show that the proposed technique offers better masking capabilities and it reduces the signal-to-masking ratio when compared to related approaches, without introducing audible distortion. Those results have implications that are important both for audio compression by permitting further bit rate reduction, and for watermarking by providing greater signal space for information hiding.
Novel wavelet domain Wiener filtering de-noising techniques: Application to bowel sounds captured by means of abdominal surface vibrations
2006, Biomedical Signal Processing and Control
This work focuses on the design and evaluation of efficient and accurate de-noising algorithms that combine robust signal enhancement and minimum signal distortion. The proposed method introduces novel, frequency depended, parametric, Wiener filtering techniques that involve Discrete Wavelet Transform and Wavelet Packets. Implementations of various decomposition schemes, different mother wavelets and various thresholding options were tested, while perceptual criteria were also taken into account. The introduced de-noising approach has been extensively tested on human bowel sounds, captured by means of abdominal surface vibration recordings, in order to be further utilized as a diagnostic tool. Qualitative and quantitative analysis of the method's performance, when applied to various types of recorded and synthetic sounds, revealed that the new approach works excellent with favourable results.
Use of the symmetrical extension for improving a time-varying wavelet-packet-based audio coder
2003, Digital Signal Processing: A Review Journal
This paper deals with the use of the symmetrical extension as an alternative to wraparound in order to avoid the necessity of overlapping. Two important problems have been solved. First, how can we determine the quantization noise power that can be added to each subband signal in order to ensure transparent coding? Second, a new algorithm is presented for reducing sharp variations in quantization noise that appear at the border of frames when overlapping is avoided. The algorithm is based on forward and backward prediction at the border of frames and has been applied with success to an audio coder based on time-varying wavelet-packet decompositions that use symmetrical extension as a method for processing frames in isolation.
Adaptive wavelet-packet analysis for audio coding purposes
2003, Signal Processing
Citation Excerpt :
To reduce this complexity, a simplification was proposed using long wavelet filters. In [10], an algorithm for translating psycho-acoustic information to the wavelet domain is presented. It can be used with filters that generate orthogonal wavelets with any compact support, but the analysis tree must be close to the critical band division.
This paper describes a wavelet-based perceptual audio coder, addressing the problem of the search for the wavelet-packet decomposition that minimizes a new perceptual cost function computed in the wavelet domain. We are interested in decompositions adapted to the nature of audio signals which take into account the characteristics of human hearing. The results of audio coding with three different decomposition criteria are presented for comparison purposes. They all give rise to adaptive wavelet-trees obtained minimizing different cost functions. These cost functions are the non-normalized Shannon entropy, the SUPER and our proposed perceptual cost function. Another important contribution is the algorithm for bit allocation, that takes into consideration the synthesis filter bank. The results confirm that the best way to achieve maximum compression rate and transparent coding is the usage of perceptual-entropy-based decompositions. Experimental results indicate that our coding scheme ensures transparent coding of one channel CD-quality audio signals at bit rates below $64 kbps$ for most audio signals.
Comparison of different wavelet decomposition techniques for PEAQ model to assess the quality of audio codecs
2015, 2nd International Conference on Electronics and Communication Systems, ICECS 2015

View all citing articles on Scopus

View full text

A new algorithm for translating psycho-acoustic information to the wavelet domain

Abstract

Introduction

Section snippets

Multiresolution analysis and wavelet transform

Audio coder and decoder structure

Masking threshold

Filter bank design

Algorithm for translating psycho-acoustic information to the wavelet domain

Results

Conclusions

Biorthonormal bases of compactly supported wavelets

Commun. Pure Appl. Math.

Efficient audio coding using perfect reconstruction noncausal iir filter banks

IEEE Trans. Speech Audio Process.

Orthonormal bases of compactly supported wavelets

Commun. Pure Appl. Math.

Advances in speech and audio compression

Proc. IEEE

Asymmetry of masking between noise and tone

Perception Psychophys.