Keywords

1 Introduction

In human oral communication, the sounds where the vocal cords vibrate show a quasi-periodic pattern in their acoustic waveforms. The origin of this periodicity is found in the alternating opening and closing of the vocal folds at the glottis, allowing a pulse-like flow of air coming from the lungs to travel forward and create sound [1]. The determination of glottal pulse boundaries is a common problem in several speech processing tasks, either oriented for healthy (e.g. singing [2], fluent speaking [3, 4] and similar uses) or pathologic speech [5]. The methods used for determining the pulse location are specific types of Pitch Detection Algorithms (PDAs), working on a cycle-by-cycle basis [6].

When facing pathological voices, a degradation in the PDAs performance appears [7, 8], due to the higher levels of periodicity perturbations present. There are dozens of PDAs to choose from, and selecting a particular approach should be based on thorough performance comparisons [9,10,11,12]. Key in this evaluation is the selection of the measures of PDAs performance extracted.

The most commonly used measure of PDA performance is to compare the pulse’s durations variability in the detected sequence Td(n) with the variability of the reference, known sequence Tr(n). Variability of the pulse duration contour is a concept denoted as jitter in the Voice Measurement literature [13], and there are several expressions proposed to measure it [14], among which a representative ones is:

$$ \alpha = \frac{1}{N - 1}\sum\limits_{n = 1}^{N - 1} {\frac{{\left| {T\left( {n + 1} \right) - T\left( n \right)} \right|}}{{0.5 * \left( {T\left( {n + 1} \right) + T\left( n \right)} \right)}}} * 100 $$
(1)

The difference |αr − αd|, i.e. the values obtained when using Tr(n) and Td(n) on Eq. (1), is widely reported as a measure of PDA accuracy [8,9,10,11,12]. However, this practice has been heavily criticized [7, 15, 16], since equal variability of two sequences does not imply sequences equality. An alternative approach suggested since [7] is to use an inter-sequence variability measure, termed β as a successor to α in Eq. (1). The expression for β used in [7] was slightly modified to closely resemble α in [15], as:

$$ \beta = \frac{1}{N}\sum\limits_{n = 1}^{N} {\frac{{\left| {T_{d} - T_{r} \left( n \right)} \right|}}{{T_{r} \left( n \right)}}} * 100 $$
(2)

However, the internal functioning of the PDAs frequently produces Tds which cannot be related to Tr by simply iterating through n, the pulse index. A Dynamic Time Warping (DTW) of both glottal pulse lengths sequences, Tr(n) and Td(n), was proposed [15] in order to correct the misalignments described, for a better evaluation of the performance of the PDAs. The DTW by itself produced several new performance measures by counting the amount of the different types of misalignment. The DTW procedure described in [15] actually detected the different types of misalignment between both sequences, but, as will be shown in the next section, the value of β produced failed to represent the one corresponding to the aligned sequences.

In this paper we describe the problem present in the reported DTW, we introduce the required modifications to solve it, and perform some experiments showing the relevance of the correction.

2 Existing DTW Procedure: Problem and Solution Proposed

A flowchart depicting the DTW procedure proposed in is shown in the left panel of Fig. 1. The algorithm departs from the sequences of reference and detected pulse positions, Pr(n) and Pd(n), from which the sequences of pulse lengths Tr(n) and Td(n) are obtained by a difference (discrete derivative) operation. The additional measures of performance produced by the DTW algorithm are the number of significant/gross errors (GE) between both aligned sequences, the number of pulse insertions (PI) and deletions (PD) and the number of contour shifts to the left (SL) or to the right (SR) of the detected pulse boundaries Pd(n) as compared to the reference, hand-marked contour Pr(n). A copy of the detected pulse length contour is made to store the dynamically aligned contour, labeled TdAl.

Fig. 1.
figure 1

Flowchart used for calculating the measures of performance as reported in [15] (left panel) and here (right panel). Basic inputs are the reference and detected pulse boundary positions Pr(n) and Pd(n). Bounding labeled squares indicate the relevant stages on a DTW procedure. Positive outcomes in the conditional blocks have been signaled by a black circle in the corresponding corner of the rhombus. Notice that the “Alignment” block has been replaced by suppression of pointed-by (^) elements in the processing of PI, PD, SR & SL conditions.

The DTW procedure works by incrementally moving through the contours by using separate vector indexes for both, namely nr and nd. Both indexes compose a bi-dimensional grid of points (nr, nd) where the best path for alignment is searched for using some heuristic constraints.

Regarding the Global Constraints for the DTW, the alignment of length values (Ts) can only be performed if position values (Ps) are actually related in time. If for a particular position this is not met, then a pitch mark deletion (PD) or insertion (PI) occurred. Both PD and PI counts are metrics available from the DTW procedure. The conditions for the occurrence of PD and PI are:

$$ \begin{array}{*{20}c} {PD\text{? = }P_{d} \left( {n_{d} - 1} \right)\;\text{ > }\;P_{r} \left( {n_{r} + 1} \right)} \\ {PI\text{? = }P_{d} \left( {n_{d} + 1} \right)\;\text{ < }\;P_{r} \left( {n_{r} - 1} \right)} \\ \end{array} $$
(3)

If Global Constraints are not violated, Local Constraints were checked on whether the length contours require alignments, either a shift to the left (SL) or to the right (SR) in the (nr, nd) path. A first check is that the contours are not similar at their current positions in the grid (nr, nd), for which the occurrence of a Gross Error (GE) is evaluated. An auxiliary large-difference/error function between Tr(nr) and Td(nd) for arbitrary displacements a and b, respectively, was defined in [15] for determining the presence of GE,, SL and SR. This Boolean function D(a,b) checks if the difference between the contours at certain positions (displaced a and b with respect to the current positions nr and nd, respectively) exceeds a given threshold Thr:

$$ D\left( {a,b} \right)\text{?} = \left| {T_{r} \left( {n_{r} + a} \right) - T_{d} \left( {n_{d} + b} \right)} \right|\text{ > }Thr $$
(4)

The presence or absence of a GE is given by checking for D at the current coordinates of the search grid:

$$ GE\text{?} = D(0,0)\text{?} $$
(5)

And then, the need to perform an SL or SR is given by the expressions in (6), where logical complements have been denoted by bars above the respective terms:

$$ \begin{array}{*{20}c} {SL\text{? = }\left( {\overline{{D\text{?}\left( { - 1,0} \right)}} \;{\& }\;\overline{{D\left( {0,1} \right)}} } \right)\,\;{\& }\;\left( {\overline{{D\text{?}\left( {0, - 1} \right){\& }\overline{{D\left( {1,0} \right)}} }} } \right)} \\ {SR\text{? = }\left( {\overline{{D\text{?}\left( {0, - 1} \right)}} \;{\& }\;\overline{{D\left( {1,0} \right)}} } \right)\;{\& }\;\left( {\overline{{D\text{?}\left( { - 1,0} \right){\& }\overline{{D\left( {0,1} \right)}} }} } \right)} \\ \end{array} $$
(6)

The GE measure is common in short-term PDA comparisons, from where its name was taken [17, 18]. The value of Thr has been expressed in the literature either in time units [17, 19, 20] or in percentage of the average Tr [18, 21, 22]. The latter approach was used in [15], choosing a value of Thr equal to 3% of the average Tr.

After all Global and Local Constraints have been checked, and the indexes nr and nd properly modified, TdAl is correspondingly modified (if needed) to better match Tr, and the algorithm proceeds to check the contour at the next values of nr and nd. The DTW ends whenever any of the contours contour reaches its end as indexed by the current values of nr and nd. The alignment performed on TdAl consists in replacing its value in index nr with the value of Td(nd).

2.1 Modification to Make Alignment Bidirectional

The alignment performed in [15] could be considered, however, incomplete, due to this unilateral alignment: only Td is modified to correspond to Tr. In this way, there will be indexes in TdAl with no corresponding aligned values in Tr. In this paper we slightly modify the DTW procedure so that the alignment occurs bilaterally: from Tr indexes to TdAl (as the original) but also from Td indexes to an aligned version of Tr, namely TrAl. In this way, TdAl and TrAl are better suited as aligned detected and reference contours, to evaluate an inter-contour variability measure, than the previously used pair TdAl and Tr. The modified flow diagram of the DTW procedure used here is shown in the right panel of Fig. 1. There are two main differences between the alignments performed on both methods, represented in left and right panels:

  • First, the alignment of the contours occurs in this case not after all constraints checking were performed (“Alignment” section in left panel), but within the actions taken for a particular condition producing a modification of either index nr or nd (i.e. the occurrence of a PI, PD, SL or SR in right panel).

  • Second: The alignment is not performed here by assignment, but by suppression of the value without a corresponding element in the companion contour. The element to suppress in each case is pointed to by the ^ sign within the square encasing the actions corresponding to the particular condition (right panel).

In short, the element [nd − (SL + PI)] in TdAl is suppressed every time a PI or SL occurs, while the element [nr − (SR + PD)] in TrAl is suppressed every time a PD or SR occurs. These actions are in line with the classical ‘insertion’ and ‘deletion’ operations defined for string sequences in [23]. The third operation (‘change’) is still present when a GE occurs without an SL or SR condition met.

3 Experiments Performed

Both alignment procedures were evaluated by applying them to reference and detected contours corresponding to synthetic and real signals. Synthetic signals allow introducing periodicity perturbations of any magnitude desired, while real signals serve as validation, in case the synthesis procedure were to be contested.

Three well known PDAs were used, so that different performances are to be expected and tested. Among the PDAs are the ones included by default in the freely available Praat system [24], namely the Peak Picking (P-P) and the Cross Correlation (C-C) based methods. Praat is used as a reference in the evaluation of other systems and methods in [9, 12, 25,26,27,28,29], among many other studies. A third PDA evaluated is the Super-Resolution (S-R) method described in [30], which has shown very good results elsewhere [7, 15, 18]. The internal implementation of these PDAs is out of the scope of this paper, however, they are a representative sample of the best performing and more frequently used PDAs, as the previous references may prove.

3.1 Synthetic Signals

Synthetic signals were obtained according to the source filter model [31]. The standard pulse duration TE was chosen to correspond to a frequency of 150 Hz, with sampling frequency of 22050 Hz. A single glottal pulse waveform was generated according to the polynomial model type C in [32], with rising time of 0.33 TE and falling time of 0.09 TE, which reportedly produced the most natural-sounding synthesis.

A vocal tract configuration corresponding to a vowel “a” with resonances defined by the frequencies, amplitudes and bandwidths already used in [7, 15, 30, 33,34,35]. Periodicity perturbations were introduced in seven levels of perturbation, ranging from quasi-periodic to highly aperiodic signals, incorporating random jitter and shimmer (pulse amplitude variability) values per individual pulse up to a maximum % as given in Table 1. Additive noise is added such that the signal-to-noise ratio (SNR) expressed in dB is also given in the table.

Table 1. Values of the individual perturbations per level, combined in the synthetic signals.

A total of 300 contiguous pulses, corresponding to roughly two seconds of signal, are synthesized for each perturbation level, with the reference pulse boundaries Pr(n) available from the synthesis procedure. More details in the synthesis procedure can be found in the references provided.

3.2 Real Signals

The hand-marked pulse positions from the 29 pathological signals used in [15] were available for this paper as Pr(n). The acoustic waveforms are available from the Massachusetts Eye and Ear Infirmary Database [36]. The hand-marking includes 4695 pulse markers, with average F0 of 161 Hz, close to the value in the synthetic signals.

3.3 Evaluation

Both alignment procedures, as depicted in left and right panels of Fig. 1, were applied to the available reference Pr(n) and detected Pd(n) resulting from the different PDAs. In the case of the synthetic signals, 100 realizations were obtained for each level, to report averaged measures.

All methods were programmed in MatLab R2017, and a script calling Praat’s PDAs was executed within the MatLab environment. Three results are to be reported for β values: the first being value obtained for non aligned contours, denoted simply as β. The second is the value obtained from the pair of sequences TdAl and Tr, following the procedure in [15] and shown in the left panel of Fig. 1, and denoted βO. Finally, the third value is obtained from applying Eq. (2) to the sequences TdAl and TrAl, obtained according to the corrections described here and represented in the right panel of Fig. 1 and denoted as βC.

4 Results and Discussion

4.1 Synthetic Signals

As mentioned in description of the synthetic signals, 100 realizations of the two second synthesis were obtained. The pulse sequences were aligned to the reference ones, and the 100 Beta values corresponding to each level per PDA were averaged. These values are shown in Table 2.

Table 2. Values of the three alternatives for β obtained for the three PDAs on synthetic signals

A first conspicuous result is that the S-R PDA greatly outperforms both Praat based variants. In fact, in the most degraded level the P-P variant did not report an average value, since some sequences returned empty from the Praat system, and for the 5th and 6th perturbation level the variability was higher than 100%. But PDAs performance is not the main subject of analysis here, but the behavior of our proposed modifications in the value of βC.

With respect to this issue, it is noticeable the manifest tendency for reduction from the un-aligned β, to the aligned βO, to our alignment proposal βC. This tendency is of course larger for the worst PDAs, where the occurrence of misalignments is expected to be higher. But, even for the almost-unaffected S-R PDA, the values of βC are always equal or lower than βO.

4.2 Real Signals

The average values for the three variants of β obtained for the 29 signals are shown in Table 3, with rows corresponding to each PDA considered.

Table 3. Averaged values of the three alternatives for β obtained for the three PDAs on the 29 hand-marked signals

The tendency to obtain reduced inter-contour variability as we move from the non aligned variant to the original alignment described in [15] and then to the corrections of the alignment described here is manifest for the three PDAs. Here again the Praat based PDAs show the poorest performance, reinforcing the need to check for tweaking the default settings. However, the PDA performance is not our objective in this paper. More interesting results the fact that the reduction is more noticeable in the C-C PDA. It seems that previous β and βO values were more affected by insertion and deletions of pulses in C-C than by the magnitudes of the GE errors, the opposite of the P-P variant, which shows a smaller reduction towards βC.

5 Conclusions

A correction to the DTW procedure proposed in [15] has been described, which allows to perform a more realistic comparison of the PDAs performances in terms of β, the variability between the reference and detected pulse lengths contours which can be credited completely to the functioning of the PDA.

The other measures of performance produced by the DTW algorithm (GE, PI, PD, SR & SL) are unaffected by the corrections made, so any previous result obtained with the uncorrected DTW method hold for all these measures. The ability of the correction to reduce the described inflation of β was shown in both real and synthetic signals. The modification introduced is simple to implement, and the authors will make public the code in the near future.