To what degree of voice perturbation are jitter measurements valid? A novel approach with synthesized vowels and visuo-perceptual pattern recognition

https://doi.org/10.1016/j.bspc.2011.05.002Get rights and content

Abstract

Objective measurement of the severity of dysphonia typically requires signal processing algorithms applied to acoustic recordings. Since Lieberman (1963) introduced the concept of perturbation analysis in the area of voice, the best-known acoustic parameter in clinical practice is conventional jitter. However, jitter measurements have some critical limitations. According to a widely accepted guideline, in sustained vowels of dysphonic voices, only perturbation measures less than about 5% are reliable: this is related to period extraction methods. This limit of 5% deserves critical analysis, certainly when there are indications that some acoustic analysis programs can be applied to quite irregular voices such as substitution voices. The present experiment demonstrates that – on signals of synthesized deviant voices (sustained vowel) with moderate additive noise – different raters are able to visually identify in a very consistent way the period durations of successive cycles up to values of about 13% jitter. Furthermore, even for higher values – over 30% – the jitter % computed with the period values rated by visual perception is, for some of the raters, very comparable to the real value. This suggests that improved acoustic programs using more reliable algorithms could validly transgress the traditional limit of 5% if they demonstrate the correspondence of their computations with the true jitter values. This is now made possible by synthesizers generating artificial deviant voices that cannot be distinguished from true dysphonia, and in which the jitter put in is exactly known.

Introduction

Objective measurement of the severity of dysphonia typically requires signal processing algorithms applied to acoustic recordings. Since Lieberman [1] introduced the concept of perturbation analysis in the area of voice, the best-known acoustic parameter in clinical practice is conventional jitter. Its relevance has been demonstrated in several clinical trials, e.g. for investigating treatment efficacy [2], [3], and it has been recommended (when measured on a sustained /a/) as an essential acoustic parameter within the basic ELS-protocol for the multidimensional assessment of dysphonia [4].

However, jitter measurements have some critical limitations. For specific categories of severely deviant voices such as spasmodic dysphonia and substitution voices, the ELS-protocol – and the acoustical analysis in particular – is not suited due to the strong aperiodicity in the signals [4]. Also, the Dysphonia Severity Index (DSI), including a jitter measurement, struggles with this limitation [5]. Van As concludes that only 30% of the tracheo-esophageal voices can be reliably analysed with the Multi-Dimensional Voice Program (Kay Elemetrics, USA) [6]. Programs either deny to quantify perturbation, indicating the signal is (mainly) unvoiced, or provide aberrant/irreproducible results.

Titze and Liang [7] conclude that waveform-matching (a method commonly used by voice analysis programs) meets reliability criteria better than peak-picking and zero-crossing methods for detecting frequency changes, but warn for loss of accuracy for variations higher than 6%. Possible reasons for this are discussed by Roark [8]. In a summary statement of the National Center for Voice and Speech, Titze confirms that, for type 1 signals (i.e. without ‘structured’ modulations as diplophonia, without strong sub harmonics and not completely aperiodic) perturbation analysis has considerable utility and reliability, and states to consider – as a practical guideline – that perturbation measures less than about 5% are reliable [9].

Limitations in jitter measurements have motivated investigating nonlinear dynamics for quantifying the chaotic character of strongly deviant voices. For example, the “correlation dimension” is a quantitative measure that may specify the number of degrees of freedom (i.e. dimensions) needed to describe a dynamic system [10]. If the dynamics of a system can be determined to be low dimensional, then a complex determinism may exist, which is responsible for the observed signal profile. Alternatively, the more complex the system, the greater the number of degrees of freedom needed to describe its dynamic state, and the higher the correlation dimension. A potential advantage of nonlinear analysis methods, as compared with traditional perturbation measures of the voice signal, consists in the possibility of avoiding the requirement of cycle boundary identification. However, here again restrictions arise, and a non negligible percentage of voices cannot be analysed. Awan et al. [10] conclude that the strength of nonlinear dynamic methods may potentially reside in providing some insight into the theoretical rules or initial conditions that may result in different modes of normal or disordered phonation, but that the applicability for e.g. treatment outcome measure is questionable, particularly for the more severely dysphonic samples.

This limit of about 5% in jitter measurements, related to period extraction methods, seriously restrains clinical application of acoustic analysis, e.g. when attempting to gauge pre- versus post-treatment change, as valid measures of severity may not be available for patients who initially present moderate-to-severe dysphonias. Thus this limit of 5% deserves critical analysis, certainly if there are indications that some acoustic analysis programs can be applied to quite irregular voices, such as substitution voices [11].

We therefore propose, as a first step, to investigate the ability of human visual perception to recognize patterns in perturbed speech signals. The limit of this capacity can be defined as the level of jitter up to which experts rating separately agree in defining boundaries of successive cycles. This is the level an analysis program should reach at least. For such an experiment, it is necessary to have a reliable reference that is a wide range of voice signals of which the jitter is known exactly. Recently Fraj et al. [12], [13], [14] have developed a generator of synthetic deviant voices (sustained /a/) producing signals that cannot be distinguished from true pathological voices by expert raters. This offers the possibility to control the parameters of the signal, and particularly the amount of jitter put in. Exploring the ability of human visual perception in recognizing cyclic patterns in this type of signals is expected to provide relevant information about the degree of perturbation up to which computing jitter makes sense.

However, and this is a second step, even when different raters disagree to some extent about durations of individual periods, it is worthwhile considering what becomes of the jitter value computed via their ratings, and comparing them to the true value put in.

Section snippets

Synthesis of deviant voices

The synthesis of the disordered voices involves four stages that are, first, the generation of a sinusoidal driving function the instantaneous frequency of which is disturbed to simulate vocal frequency jitter; second, the modelling of the glottal area via a pair of polynomial distortion functions into which the (pseudo-)harmonic driving function is inserted; third, the generation of the airflow rate at the glottis, including acoustic tract-source interactions, via an algebraic model and,

Results and discussion

Fig. 3 presents the variation coefficients for the 3 raters for each of the 13 levels of jitter. For each level of jitter 39 period durations are rated. It appears that the agreement is excellent up to level 5, and that the degree of disagreement exponentially increases from level 6 to level 13. The intrarater variability (forward/backward) is 0% for level 1 and 8.4% on average for level 13.

In Fig. 4 two series of median values of variation coefficients are compared for the different levels of

Conclusion

According to a widely accepted guideline, in sustained vowels of dysphonic voices, only perturbation measures less than about 5% are reliable: this is related to period extraction methods. The present experiment demonstrates that – on signals of high-quality synthetic deviant voices (sustained vowels) – different raters are able to identify in a very consistent way the period durations of successive cycles up to values of about 13% jitter. Furthermore, even for higher values – over 30% – the

References (21)

There are more references available in the full text version of this article.

Cited by (27)

  • BioVoice: A multipurpose tool for voice analysis

    2021, Biomedical Signal Processing and Control
    Citation Excerpt :

    Indeed the estimation of formants with numerical methods is still challenging and further improvements have to be developed and implemented. On this topic, some papers on simulated data show the robustness of the methods implemented in BioVoice [14–17,19]. Generally, the vowel triangles of adults obtained with the values estimated by BioVoice are similar to the reference triangle, while those of children show higher formant frequencies.

  • Smartphones Offer New Opportunities in Clinical Voice Research

    2017, Journal of Voice
    Citation Excerpt :

    In this experiment, we used synthesized deviant voices that have the advantage of an exact calibration of period perturbation parameters as well as of noise. Such samples have been used in checking the adequacy of voice analysis programs8–12 and are used here to test the reliability of the two smartphones. The synthesizer uses a model of the glottal area based on a polynomial distortion function that transforms two excitatory harmonic functions into the desired waveform.13,14

View all citing articles on Scopus
View full text