Elsevier

Journal of Phonetics

Volume 31, Issue 1, January 2003, Pages 39-62
Journal of Phonetics

On the reliability of overall intensity and spectral emphasis as acoustic correlates of focal accents in Swedish

https://doi.org/10.1016/S0095-4470(02)00071-2Get rights and content

Abstract

This study shows that increases in overall intensity and spectral emphasis are reliable acoustic correlates of focal accents in Swedish. They are both reliable in the sense that there are statistically significant differences between focally accented words and nonfocal ones for a variety of words, in any position of the phrase and for all speakers in the analyzed materials, and in the sense of their being useful for automatic detection of focal accents. Moreover, spectral emphasis turns out to be the more reliable correlate, as the influence on it of position in the phrase, word accent and vowel height was less pronounced and as it proved a better predictor of focal accents in general and for a majority of the speakers. Finally, the study has resulted in data for overall intensity and spectral emphasis that might prove important in modeling for speech synthesis.

Introduction

This study deals with the acoustic signaling of focal accent in Swedish, and in particular with the reliability of two acoustic features—overall intensity and spectral emphasis—that have been mentioned among the acoustic correlates of focal accents. ‘Focal accent’ is a term used in the Swedish intonation model about an accent signaling that a word (or some other constituent within a phrase which may be smaller or larger) is ‘focused’ or ‘in focus’ (Bruce, 1977; Bruce & Gårding, 1978; Gårding & Bruce, 1981; Bruce, Granström, Grustafson, Horne, House, & Touati, 1997; Bruce, 1999). Overall intensity and spectral emphasis, furthermore, represent two different operationalizations of loudness. Overall intensity, as the name suggests, is the intensity (or SPL) of the whole spectrum, as opposed to spectral emphasis, which may be described as the relative intensity in the higher frequency bands. Two aspects of the reliability of these acoustic correlates will be considered. The first is an investigation of whether there are statistically significant differences between focally accented and nonfocal words in paradigmatic—or between-phrase—comparisons. The second approach is to explore the usefulness of these correlates for the detection of focally accented words within phrases, i.e., in syntagmatic comparisons.

It is generally agreed that the most important and reliable acoustic correlates of accents marking focus in languages such as English, Dutch and Swedish are fundamental frequency (f0) movements (e.g., Bolinger, 1958; Fry, 1958; van Katwijk, 1974; Bruce, 1977; Beckman, 1986; t’ Hart, Collier, & Cohen, 1990) and prolonged segmental durations (e.g., Cooper, Eady, & Mueller, 1985; Eefting, 1991; Fant, Kruckenberg, & Nord, 1991; Sluijter & van Heuven, 1995; Cambier-Langeveld & Turk, 1999; Turk & White, 1999; Heldner & Strangert, 2001). At the same time, some kind of loudness variation is also intuitively felt to be part of the signaling of prominence distinctions (cf. Lehiste & Peterson, 1959). Indeed, increases in loudness, as measured using several different operationalizations such as overall intensity (e.g., Fry, 1955), intensity summed over time (Beckman, 1986), spectral tilt (Sluijter, Shattuck-Hufnagel, Stevens, & van Heuven, 1995), and spectral balance (Sluijter & van Heuven, 1996) have also been shown to be reliable acoustic correlates of accents.

Thus, f0 and duration, as well as the different operationalizations of loudness are all potentially useful for automatic detection of accented words. In fact, systems for automatic classification of prosodic categories, including detection of accented words, typically use some combination of duration, f0 and overall intensity (or energy) features (e.g., House & Bruce, 1990; Campbell, 1992; Campbell, 1994; Wightman & Ostendorf, 1994; Sautermeister & Lyberg, 1996; Ostendorf & Ross, 1997; Nöth, Batliner, Kießling, Kompe, & Niemann, 2000; Shriberg, Stolcke, Hakkani-Tür, & Tür, 2000). Although less frequent, various features related to the slope of the spectrum (e.g., spectral balance, spectral emphasis or spectral tilt) have also been exploited for automatic detection of prominence distinctions (e.g., Campbell, 1995; Sluijter et al., 1995; Sluijter & van Heuven, 1996; van Kuijk & Boves, 1999).

Just as there are several terms to denote the phenomena related to the slope of the spectrum (i.e., spectral balance, spectral emphasis, and spectral tilt), there are several methods for measuring them. Furthermore, there seems to be no consensus as to which term is to be associated with which method. Therefore, it is tentatively proposed that there are two classes of measures, which will be referred to as ‘spectral tilt’ and ‘spectral emphasis’. ‘Spectral tilt’ will be used for measures explicitly representing the slope of the spectrum, while ‘spectral emphasis’ will be used for measures of the relative energy in the higher-frequency bands, or, put differently, the relative contribution of the high-frequency parts of the spectrum to the overall intensity. Although the two classes are related to each other, spectral emphasis is—as will be shown below—distinct from spectral tilt in several respects, a salient one being that an increase in spectral emphasis results in a decrease in spectral tilt.

A commonly used measure of spectral tilt is the difference (in dB) between the first harmonic (H1) and the strongest harmonic in the third formant peak (A3) with corrections (marked by asterisks) for the influence of the first formant on H1 and of the first and second formants on A3. This spectral tilt measure is thus defined as H1*−A3* (e.g. Stevens & Hanson, 1994; Sluijter et al., 1995). A related estimate of spectral tilt is the difference between the first and second harmonics (H1−H2) (Jackson, Ladefoged, Huffman, & Antoñanzas-Barroso, 1985; Titze & Sundberg, 1992; Campbell, 1995; Campbell & Beckman, 1997).

There exist several measures that would fall into the spectral emphasis category. In the influential work by Sluijter & van Heuven (1996) a measure called ‘spectral balance’ was defined as the intensity in four contiguous frequency bands: 0–0.5, 0.5–1, 1–2, 2–4 kHz. Moreover, an estimate referred to as ‘spectral tilt’ and used in recent studies by Fant and colleagues (Fant, 1997; Fant, Kruckenberg, & Liljencrants, 2000a; Fant, Kruckenberg, Liljencrants, & Hertegård, 2000c) is the difference (in dB) between signals with a high frequency pre-emphasis and a flat frequency weighting (defined as SPHL-SPL). Several authors have also measured spectral emphasis as the difference between the overall intensity and the intensity in a low-pass-filtered signal (e.g., Childers & Lee, 1991; Campbell, 1995; Traunmüller, 1997; Traunmüller & Eriksson, 2000). The latter methods differ mainly in the low-pass filter cut-off frequency.

Several spectral emphasis measures of the last mentioned type were also used in a previous study of our own (Heldner, Strangert, & Deschamps, 1999). These measures included one calculating the difference (in dB) between the overall intensity and the intensity in a signal that was low-pass filtered at 1.5 times the f0 mean for each utterance (as was also done in Traunmüller, 1997; Traunmüller & Eriksson, 2000). The other measures were inspired by the work of Sluijter & van Heuven (1996). In these measures, too, the difference between the overall intensity and the intensity in a low-pass filtered signal was calculated, but fixed low-pass filters with cut-off frequencies at 0.5, 1 and 2 kHz were used. The rationale behind a filter cut-off frequency at 1.5 times f0 is to ‘separate’ the fundamental from the rest of the harmonics (the second harmonic being at 2 times f0) and to obtain a normalized measure of the energy in the higher frequency bands. (Strictly speaking, however, the filter has a slope of 12 dB/octave and is only attenuating the rest of the harmonics and especially the second harmonic will be included to some extent.) However, determining the low-pass filter from the f0 mean of a whole utterance does not seem altogether satisfactory. In the case where f0 is below the f0 mean of the whole utterance, more energy will pass through the filter than just the fundamental thereby resulting in a lower spectral emphasis value. Similarly, when f0 is above the mean, the result will be a higher value. To overcome this problem, we have developed a new and fully automatic technique for measuring spectral emphasis applying a dynamic low-pass filter with a cut-off frequency following the course of the fundamental frequency. This technique will be described in more detail below (Section 2.2).

Although several acoustic features have been shown to be reliable correlates of accentuation, and thus also potentially useful for automatic detection, this investigation has been restricted to the reliability of overall intensity and spectral emphasis as acoustic correlates of focal accents in Swedish. One approach to this subject is paradigmatic (or between-phrase) comparisons of focally accented and nonfocal words. If the correlates are to be considered reliable, these comparisons should establish statistically significant differences between focal and nonfocal words. Previous work in this area includes a series of studies by Fant and his associates. Fant et al. (2000a) recently summarized their own work on acoustic correlates of prominence in Swedish in general and of focal accents in particular. Regarding the correlates of interest in the present study, they reported the gain in overall intensity (or SPL) in focally accented words compared to nonfocal to be in the order of 4–6 dB. The corresponding gain in their measure of ‘spectral tilt’ (SPLH-SPL) was in the order of 2–3 dB. These results were based on five speakers’ readings of a five-word sentence occurring in six versions, one of which had a neutral reading and the rest a systematically varied focal accent distribution. Fant et al. (2000a) concluded that overall intensity and spectral tilt (i.e., SPLH-SPL) are fairly reliable correlates of focal accents in Swedish. In the present study, additional data for nonfocal and focally accented words were collected using a larger and more varied material.

It is well known that the overall intensity of the human voice increases with fundamental frequency, at least up to a mid-frequency of the speaker's f0-range (e.g., Fant et al., 2000a; Fant, Kruckenberg, & Liljencrants, 2000b). For example, an increase in fundamental frequency of six semitones is typically accompanied by an increase in overall intensity of about 6 dB, mainly due to increased voice source amplitude and a larger number of excitations per second. Conversely, a decrease in fundamental frequency is typically accompanied by decreased overall intensity. Pierrehumbert (1979) observed that the general downdrift of the fundamental frequency over the course of an intonation group (a tendency that has been observed in many languages) was accompanied by a downdrift in overall intensity of 3–4 dB. Thus, there may be an influence (at least an indirect one) of position on overall intensity and possibly also on spectral emphasis. Moreover, given the covariation of overall intensity and fundamental frequency, it also seems warranted to examine if the differences in f0 patterns between pre- and post-focal words in Swedish, that is, a compressed pitch range after the focal accent (Bruce, 1982), are reflected in the overall intensity and spectral emphasis patterns.

For this reason, besides treating the effects of focal accents, we will also touch upon the possible influence of position on overall intensity and spectral emphasis; that is, position of the focally accented word in the phrase and position and distance of nonfocal words relative to the focally accented word. If the correlates are to be considered reliable, there should be significant differences between focal and nonfocal words in all positions in the phrase. Moreover, if positional influences do exist, they might prove important in modeling for synthesis. Therefore, the results from different positions will be presented separately both in the paradigmatic comparisons and in the detection experiment.

Another approach to studying the reliability of overall intensity and spectral emphasis as acoustic correlates is investigating to what extent focally accented words may be detected automatically on the sole basis of these correlates. Given such an approach, a high degree of correct detections will obviously have to be taken to indicate high reliability. The work on automatic detection of focal accents in Swedish using overall intensity and spectral emphasis was initiated by Heldner et al. (1999) in a study where several measures of overall intensity and spectral emphasis were evaluated. As mentioned earlier, these spectral emphasis measures were all calculated as the difference (in dB) between the overall intensity and the intensity in a low-pass-filtered signal, and differed only in the choice of low-pass filter cut-off frequency. One of the measures used a low-pass filter at 1.5 times the f0 mean for each utterance, and the others used fixed low-pass filters with cut-off frequencies at 0.5, 1 and 2 kHz, respectively. These experiments showed that overall intensity generally scored better than the different spectral emphasis measures. Moreover, the spectral emphasis measures using low-pass filters adjusted to the f0 mean of the utterance resulted in more correct detections than those using fixed cut-off frequencies. However, as noted above, none of these spectral emphasis measures seemed satisfactory, as they might have been dependent on f0 and might have favored words with higher f0 than the mean and disfavored those with lower f0 than the mean. Although this probably meant favoring focally accented words, it might also have favored words in phrase initial position and disfavored final words given a general declining trend in f0 over the course of the utterance.

A solution to this problem would be a dynamic low-pass filter with a cut-off frequency following the course of the fundamental frequency. Using a dynamic low-pass filter had not been feasible in the previous study (Heldner et al., 1999), for lack of adequate tools. Since then, however, tools using this kind of filters have been developed. In the present study, this improved technique for measuring spectral emphasis was used for revisiting automatic detection of focal accents in Swedish. In addition, we wished to test whether this new technique yields higher recognition scores than the previous method and, moreover, whether overall intensity is a better predictor than the improved spectral emphasis measure.

To summarize, then, the primary aim of this study is to assess the reliability of overall intensity and spectral emphasis as acoustic correlates of focal accents in Swedish. This problem is approached from two angles. The first consists in paradigmatic comparisons of nonfocal and focally accented words, using statistical methods to assess the reliability of the correlates. Here, for the correlates to be considered reliable, the experiment must establish statistically significant differences between focal and nonfocal versions of words for all speakers, for all words and in all positions in the phrase. The second approach to the reliability of overall intensity and spectral emphasis is to investigate to what extent focally accented words may be detected automatically using these correlates. More exactly, what was being evaluated here was the usefulness of overall intensity and an improved spectral emphasis measure as predictors in an automatic focal accent detector for Swedish. If the correlates are to be considered reliable, automatic detection using these correlates should yield a fairly high degree of correct detections. A secondary aim of this research is to collect data for overall intensity and spectral emphasis to be used in modeling for speech synthesis.

Section snippets

Method

Recordings taken from three different sets of phrases were used for both the paradigmatic comparisons and for the detection experiment. However, the material was primarily designed for paradigmatic comparisons. Two of the phrase sets were recorded for a study on temporal effects of focal accents in Swedish (Heldner & Strangert, 2001). A short description of the material and the recording procedures will be provided below. Although the composition of the three sets was different, they all

Results: paradigmatic comparisons

In this section, the reliability of overall intensity and spectral emphasis as acoustic correlates of focal accents is assessed by comparing focally accented and nonfocal words from different phrases (i.e., paradigmatic comparisons). Here, to be considered reliable, statistically significant differences between focal and nonfocal words should be established for all words, in all positions in the phrase and for all speakers. In addition, it is examined whether the choice of nonfocal reference

Discussion: paradigmatic comparisons

The first part of the experiment with paradigmatic comparisons has shown that although there were differences among the words, focally accented words were characterized by statistically significant increases in overall intensity and spectral emphasis compared to nonfocal words. Moreover, these effects were found primarily in the vowels in the stressed and unstressed syllables.

Across all 40 words, the increase in overall intensity was about 3 dB both in the stressed vowels and in the unstressed

Results: the detection experiment

In the detection experiment, a different approach is taken to assess the reliability of overall intensity and spectral emphasis as acoustic correlates to focal accents in Swedish. Instead of making paradigmatic comparisons of focally accented and nonfocal words the reliability is assessed by investigating to what extent it is possible to tell focally accented and nonfocal words apart automatically using these correlates. As noted before, they should yield a high degree of correct detections in

Discussion: the detection experiment

First of all, this experiment has shown that the new method of measuring spectral emphasis improved the detection scores by 12% compared to that used in our previous study (Heldner et al., 1999). Moreover, this new spectral emphasis measure turned out to be a better predictor of focal accents than overall intensity, a result which is not in conformity with that of our previous study. The experiment also showed that the usefulness of overall intensity and spectral emphasis as predictors of focal

General discussion and conclusions

Overall intensity is generally considered a weak prominence cue. Perceptual experiments, including the classic experiments by Fry (1955), Fry (1958), have shown that overall intensity is relatively unimportant as a cue in the perception of stress. More recent work, however, has shown that spectral emphasis is a relevant cue for the perception of lexical stress; it is more reliable than overall intensity and close in strength to duration as a cue of lexical stress (Sluijter, van Heuven, &

Acknowledgements

The research reported here was carried out while I was a guest at the Centre for Speech Technology (CTT) at KTH in Stockholm, an opportunity for which I am extremely grateful. I would also like to thank Eva Strangert, Rolf Carlson, Hartmut Traunmüller, Anders Eriksson, Gunnar Fant, Nick Campbell and two anonymous reviewers for helpful comments and discussion, and Thierry Deschamps for technical assistance. Finally, I would like to thank Hartmut Traunmüller and Anders Eriksson again for

References (49)

  • Bruce, G. (1982). Developing the Swedish intonation model. In Working papers 22, Department of Linguistics, Lund...
  • G. Bruce

    Word tone in Scandinavian languages

  • Bruce, G., & Gårding, E. (1978). A prosodic typology for Swedish dialects. In Nordic prosody, Lund (pp....
  • G. Bruce et al.

    On the analysis of prosody in interaction

  • Campbell, N. (1992). Prosodic encoding of English speech. In Proceedings of the ICSLP 92, Department of Linguistics,...
  • Campbell, N. (1994). Combining the use of duration and f0 in an automatic analysis of dialogue prosody. In Proceedings...
  • Campbell, N. (1995). Loudness, spectral tilt, and perceived prominence in dialogues. In Proceedings of the...
  • N. Campbell et al.

    Stress, prominence, and spectral tilt

  • D.G. Childers et al.

    Vocal quality factorsanalysis, synthesis, and perception

    Journal of the Acoustical Society of America

    (1991)
  • W.E. Cooper et al.

    Acoustical aspects of contrastive stress in question–answer contexts

    Journal of the Acoustical Society of America

    (1985)
  • W. Eefting

    The effect of “information value” and “accentuation” on the duration of Dutch words, syllables, and segments

    Journal of the Acoustical Society of America

    (1991)
  • G. Fant

    Acoustic theory of speech production

    (1960)
  • G. Fant et al.

    Acoustic-phonetic analysis of prominence in Swedish

  • G. Fant et al.

    The source-filter frame of prominence

    Phonetica

    (2000)
  • Cited by (0)

    View full text