Elsevier

Speech Communication

Volume 48, Issues 3–4, March–April 2006, Pages 437-462
Speech Communication

Extraction of pragmatic and semantic salience from spontaneous spoken English

https://doi.org/10.1016/j.specom.2005.07.007Get rights and content

Abstract

This paper computationalizes two linguistic concepts, contrast and focus, for the extraction of pragmatic and semantic salience from spontaneous speech. Contrast and focus have been widely investigated in modern linguistics, as categories that link intonation and information/discourse structure. This paper demonstrates the automatic tagging of contrast and focus for the purpose of robust spontaneous speech understanding in a tutorial dialogue system. In particular, we propose two new transcription tasks, and demonstrate automatic replication of human labels in both tasks. First, we define focus kernel to represent those words that contain novel information neither presupposed by the interlocutor nor contained in the precedent words of the utterance. We propose detecting the focus kernel based on a word dissimilarity measure, part-of-speech tagging, and prosodic measurements including duration, pitch, energy, and our proposed spectral balance cepstral coefficients. In order to measure the word dissimilarity, we test a linear combination of ontological and statistical dissimilarity measures previously published in the computational linguistics literature. Second, we propose identifying symmetric contrast, which consists of a set of words that are parallel or symmetric in linguistic structure but distinct or contrastive in meaning. The symmetric contrast identification is performed in a way similar to the focus kernel detection. The effectiveness of the proposed extraction of symmetric contrast and focus kernel has been tested on a Wizard-of-Oz corpus collected in the tutoring dialogue scenario. The corpus consists of 630 non-single word/phrase utterances, containing approximately 5700 words and 48 minutes of speech. The tests used speech waveforms together with manual orthographic transcriptions, and yielded an accuracy of 83.8% for focus kernel detection and 92.8% for symmetric contrast detection. Our tests also demonstrated that the spectral balance cepstral coefficients, the semantic dissimilarity measure, and part-of-speech played important roles in the symmetric contrast and focus kernel detections.

Introduction

Words are tools; in real speech, every word is deployed for the purpose of achieving a human goal. The fields of computational semantics and pragmatics study quantifiable goal variables—variables that encode quantifiable aspects of the goals served by a word in context—and their semantic and contextual correlates. This paper describes the computation of two semantic and pragmatic goal variables, focus and contrast, from spontaneous speech.

The paper is organized as follows. The remainder of Section 1 explains why we are interested in annotating focus and contrast, defines the aspects of focus and contrast that are under study with examples from an intelligent tutoring system (ITS) corpus, and then puts forward the objectives of our study in this paper. Section 2 provides some background in support of our work and describes related work in modern linguistics and computational linguistics. Section 3 describes the ITS corpus in detail, with particular attention paid to annotations and corpus statistics of the proposed focus and contrast variables. Sections 4 Prosodic analysis, 5 Word semantics analysis describe the algorithms implemented for the purpose of detecting the proposed focus and contrast variables: Section 4 describes prosodic analysis, and Section 5 describes the measurement of word semantic similarities. Section 6 describes system integration and results of experimental evaluation using the ITS corpus. Section 7 discusses and concludes our work.

The motivation of this study is to achieve robust spontaneous spoken language understanding (SSLU) in an intelligent tutoring dialogue system. The system intends to provide a computer-based environment for education in math and physics, using the Lego construction set, with children of primary and early middle school ages (9–12 years old). Due to the characteristics of the dialogue scenario as we describe in Section 3.1, the children users’ spontaneous utterances are often dysfluent, ungrammatical, and even incoherent. Our robust speech understanding system design under these circumstances basically involves two steps: (1) Classification of each utterance into one of a list of 30 tutoring events. Similar to call types in an automatic call center or call router (Gorin et al., 2002, Chu-Carroll and Carpenter, 1999), the tutoring events are used to summarize the content meaning of utterances in the tutoring dialogue scenario in a broad way. For example, the tutoring event AskForPlayInstruction means that the user asks a question requesting the instruction on how to play the Legos; SpinSpeed means that the user is talking about the spinning speed of the Lego gears; and ExplainAction means that the user explains what is being done with the Legos. (2) Sometimes the tutoring event itself cannot provide sufficient information for the computer to pop up proper response. For example, when the tutoring event ArithmeticComputation is detected, sometimes the computer needs to know what the type of the arithmetic computation is; if it is division, then the computer needs to know what the dividend and divisor are for proper response. Such detailed information needs syntactic/semantic structure parsing or named entity recognition (Zhang, 2004).

To analyze the content meaning of an utterance, we are interested in extracting a small set of words, from the utterance, that encode pragmatically and semantically salient information. We investigate the computerization of two linguistic concepts, focus and contrast, that are assumed to be useful for content summarization and structure parsing of spoken messages. Both of the concepts have reasonably clear published definitions. We wish to adapt the published definitions as necessary in order to define a corpus transcription experiment, and to train and test algorithms that automatically detect these two categories of salience based on cues measured in the speech waveform and in its orthographic transcription.

The information structure of a sentence can be partitioned into presupposition and focus: presupposition is what the interlocutor assumes to be true when the sentence is elicited in a conversation, and focus is the non-presupposed part of the sentence (Chomsky, 1971, Zubizarreta, 1998). For example (T represents tutor and U represents user. Focus is marked by [ ]F),

  • (1)

    T: What are you exploring there?

    • U: [Seeing if the small gears move the big gears.]F

  • (2)

    T: How many times does the small gear spin until they line up again?

    • U: I think it goes around [one and a half]F times.

By definition focus is indicative of pragmatically and semantically new information not presupposed by the interlocutor. If focus can be reliably detected, it should be possible to use the distinction between focus and presupposition to detect new information embedded in an utterance. Speakers will often signal focus of a sentence by the use of pitch accent (we use pitch accent to mean prosodic prominence marked by F0 extrusion; the same word is usually also marked by the other acoustic correlates of prominence, including duration, energy, and spectral balance). Pitch accent marks the constituents within an utterance as highlighted or unexpected; it has been argued that constituents outside focus are expected, and hence tend to be unaccented (Kadmon, 2001, Zubizarreta, 1998, Hedberg and Sosa, 2001). For example (pitch accented words are marked with subscript a),

  • (3)

    T: Which gear are you counting?

    • U: I am counting the smalla gear.

  • (4)

    T: Which gear do you think is the strongest?

    • U: Probably the largea gear.

However, the phonological manifestation is not straightforward: pitch accent can only approximate the location of focus in a sentence. For example, in the sentence They turn in the oppositea direction, the accented word ‘opposite’ is focus for the question What can you tell me about the directions they turn but not for the question What else do you notice? The latter question requires the sentence ‘they turn in the opposite direction’ to be focus for interpretation. Such ambiguity in pragmatic interpretation of single accent has been known traditionally as the focus projection phenomenon, demonstrating that focus expressed by a single accent can project to a larger linguistic constituent than just the word with pitch accent.

Since focus is a syntactic constituent, the boundaries of focus need to be determined to identify focus. A sentence may have multiple foci, and the size of a focus may vary from a single word to a phrase or even a sentence. It is difficult to automatically extract syntactic constituents containing novel information without making use of a complete parse tree for the sentence in question. Even with a parse tree available, automatically selecting the right constituents would be difficult; for example, in the following exchange,

  • (5)

    T: Which gear are you counting?

    • U: I am counting the [small]F gear [in my hand]F,

it would be difficult for an automatic algorithm to determine that focus consists of a single word and a prepositional phrase; it would be nearly impossible without access to a correct parse of the sentence. It is even harder to extract focus from spontaneous speech, since spontaneous speech often has loose grammar structure, dysfluency, and inconsistency between linguistic segments and acoustic segments. For example (‘…’ represents silence):
  • (6)

    T: What happens when you spin the left gears?

    • U: Ahmm … When you [after it goes around once the other one goes around … the same … the same I mean it goes around … you know you only have to spin it around once … and that makes sense basically because they are the same size.]F

To robustly understand spontaneous speech, we propose labeling individual words containing new information neither presupposed by the interlocutor nor contained in the preceding part of the utterance. Such a word is usually a content word because of the information content requirement. We hypothesize that words matching this definition will typically be the semantically salient part of focus. Therefore, we call each of these words a focus kernel. In the following examples, focus kernels are marked with bold:

  • (7)

    T: What happens to the different gears as you spin the one at the end?

    • U: They move with the single gear that I’m spinning.

  • (8)

    T: Oh, are you having fun?

    • U: Yeah, it’s kind of interesting.

  • (9)

    T: How many times would it take for the reds to come back on top?

    • U: It would take three times to have the red be back on top.

Contrast is a concept having multiple senses: (1) In logic, two propositions are defined to be contrastive if it is impossible for them to be true simultaneously. For example, in the sentence Bach was an organ mechanic; Mozart knew little about organs, the two propositions are not contrastive, whereas they become contrastive when ‘Mozart’ is replaced by ‘Bach’ at the beginning of the second sentence (Bosch and van der Sandt, 1999). (2) The discourse relation called contrast is induced by ‘but,’ and constitutes a pair (or pairs) of contrasted alternatives, which can be predicates (e.g., John cleaned up the room, but he didn’t wash the dishes), individual words (e.g., John cleaned up the room, but Bill didn’t), or propositions (e.g., It is raining, but we go out for a walk) (Umbach, 2004). (3) Some linguists use contrast to denote the mutually exclusive disjunction between the words contributing to a fact and other alternatives made available by context (Vallduví and Vilkuna, 1998). It has been argued that focus in general establishes a contrast since novel information usually conveys contrast between a fact and the potential alternatives (Bolinger, 1961, Kruijff-Korbayova and Steedman, 2003). For example, in the sentence Last night they had a party, there is a contrast between the focus ‘party’ and any other alternative activities of the group. (4) Symmetric contrast consists of a set of words that are parallel or symmetric in linguistic structure but mutually exclusive in meaning; the stress on one word is motivated by its distinction from the others, e.g., ‘American’ and ‘Canadian’ in An American farmer was talking to a Canadian farmer (Rooth, 1992, Umbach, 2004).

In this study, we seek to make use of the knowledge about contrast from the pragmatics and prosody literature, for the purpose of detecting pairs of symmetrically contrasted words that are assumed to be useful for spontaneous speech understanding. Symmetric contrast can occur within a sentence, e.g. (contrasted words are marked with bold),

  • (10)

    U: The large gear has five times as many teeth as the small ones.

  • (11)

    U: How about small and big and medium?

Topics and/or foci of conjunct phrases or coordinated sentences (by ‘and’, ‘but’, etc.) can also constitute symmetric contrast, e.g.
  • (12)

    T: Where are the gears?

    • U: The red gear is on the bottom and the yellow gear is on the top.

  • (13)

    T: How are the gears spinning?

    • U: The two outside ones spin in the same direction and the middle one spins in the opposite direction.

The words participating in a symmetric contrast satisfy semantic parallelism, which has two implications: (a) the conjunct alternatives have to be semantically independent of each other in the sense that neither of them subsumes the other; and (b) there has to be a “common integrator,” i.e., a concept subsuming both conjunct alternatives (Umbach, 2004).

As its primary technical goal, the study intends to test whether the proposed word tags, i.e., focus kernel and symmetric contrast, can be reliably annotated in a spontaneous speech corpus using both manual and automatic annotation. As part of this evaluation, this study tests the relationship of focus kernel and symmetric contrast with the following prosodic and pragmatic variables: (1) prosodic prominence—experiments described in this paper test the reliability of prosodic prominence in the automatic identification of focus kernel and symmetric contrast; (2) novelty and semantic parallelism, the semantic attributes of focus kernel and symmetric contrast. Information theoretic measures of novelty and semantic parallelism are implemented, based on algorithms proposed in the computational linguistics literature. Implemented algorithms are tested for the purpose of automatically identifying focus kernel and symmetric contrast; and (3) part-of-speech. In addition, this study discusses the usefulness of focus kernel and symmetric contrast to spontaneous speech understanding.

Section snippets

Background and related work

Focus and contrast in modern linguistics are used to “account for the correlation between certain prosodic patterns and certain pragmatic and semantic effects” (Kadmon, 2001). Sections 2.1 Focus, 2.2 Contrast describe related work on contrast and focus published in the linguistics and computational linguistics literature. Section 2.3 describes previous work on the word dissimilarity measure (given a pair of words, how much novel information a word contains with respect to the other word) in

Tutoring dialogue scenario

The intelligent tutoring system helps students learn basic math and physics concepts by playing with Lego gears, with the objective of helping students develop a physical understanding of abstract concepts. For example, one question about the relationship between gear size and spinning speed is Line up a 24-tooth gear and a 40-tooth gear. If the 24-tooth gear spins 5 times, then how many times must the 40-tooth gear spin for them to line up again? Why? Children can answer this question by

Prosodic analysis

The literature in both prosody and pragmatics reports the pitch accent correlate of contrast and focus. Therefore, pitch accent is a reasonable first step in the automatic classification of focus kernel and symmetric contrast. Because of the man power involved in manual labeling of pitch accent in the ITS corpus, we try to use an automatic system to label pitch accent. To date pitch accent automatic detection concentrates on Radio Speech, in which half of all words may be pitch accented (Kim et

Word semantics analysis

We use a word dissimilarity measure to model the degree of novelty that a word has in comparison with other words. We use Ni to denote the novelty of word wi given dialogue context. According to the definition of focus kernel, we compute Ni by the minimum of the dissimilarity between wi and the words in set S, where S consists of those words appearing in the interlocutor’s presupposition and those precedent of wi in the utterance, i.e.,Ni=minwjSdis(wi,wj).

Semantic parallelism is the semantic

System evaluation

The corpus for focus kernel classification consisted of 630 multi-word, multi-phrase utterances, containing approximately 5700 words and 48 min of speech. In the experiments of extracting focus kernel and symmetric contrast, training and test data included different utterances from the same set of talkers, so the experiments were multi-speaker speaker-dependent.

Discussion and conclusions

This paper has computationalized two linguistic concepts, contrast and focus, that were assumed to be useful for robust understanding of spontaneous spoken messages in a dialogue system. Standard and reasonable linguistic definitions of focus are difficult to implement computationally, because the scope of focus is dependent on the syntactic structure of the utterance and highly variable. In order to create a computationally feasible focus detector, this paper has defined focus kernel to be the

Acknowledgements

We would like to thank Richard W. Sproat for helpful discussions and providing us with the GigaWords text corpus. We would also like to thank Carla Umbach and Chungmin Lee for their comments and suggestions. This work is supported by NSF grant number 0085980. Statements in this paper reflect the opinions and conclusions of the authors, and are not endorsed by the NSF.

References (57)

  • I. Dagan et al.

    Contextual word similarity and estimation from sparse data

    Computer Speech Language

    (1995)
  • S.-S. Kim

    Time-delay recurrent neural network for temporal correlations and prediction

    Neurocomputing

    (1998)
  • Beckman, M.E., Ayers, G.M., 1994. Guidelines for ToBI Labeling. Available from:...
  • D. Bolinger

    Contrastive accent and contrastive stress

    Language

    (1961)
  • D. Bolinger

    Forms of English

    (1965)
  • P. Bosch et al.

    Focus: Linguistic, Cognitive, and Computational Perspective

    (1999)
  • J. Chu-Carroll et al.

    Vector-based natural language call routing

    Comput. Linguistics

    (1999)
  • N. Chomsky

    Deept structure, surface structure and semantic interpretation

  • D. Dahl

    Topic and Focus: a Study in Russian and General Transformational Grammar

    (1969)
  • I. Daubechies

    The wavelet transform, time-frequency localization and signal analysis

    IEEE Trans. Information Theory

    (1990)
  • P. Edmonds et al.

    Near-synonymy and lexical choice

    Comput. Linguistics

    (2002)
  • J. Firbas

    On defining the theme in functional sentence analysis

    Travaux Linguistiques de Prague

    (1964)
  • J. Firbas

    Non-thematic subjects in contemporary English

    Travaux Linguistiques de Prague

    (1966)
  • Flammia, G., 1998. Discourse segmentation of spoken dialogue: an empirical approach. Ph.D. Thesis....
  • A.L. Gorin et al.

    Natural spoken dialog

    IEEE Computer Magazine

    (2002)
  • J.K. Gundel et al.

    Topic and focus

  • Gussenhoven, C. 2002. Intonation and interpretation: phonetics and phonology. in: Bel, B., Marlien, I. (Eds.), Speech...
  • M. Halliday

    Notes on transitivity and theme in English. Part II

    J. Linguistics

    (1967)
  • Hedberg, N., Sosa, J.M., 2001. The prosody of topic and focus in spontaneous English dialogue. LSA Topic and Focus...
  • Heldner, M., Strangert, E., Deschamps, T., 1999. A focus detector using overall intensity and high frequency emphasis....
  • Higgins, D., 2004. Which statistics reflect semantics? Rethinking synonymy and word similarity. Internat. Conf. on...
  • R. Jackendoff

    Semantic Interpretation in Generative Grammar

    (1972)
  • Jiang, J.J., Conrath, D.W., 1997. Semantic similarity based on corpus statistics and lexical taxonomy. Proc. Internat....
  • N. Kadmon

    Formal Pragmatics

    (2001)
  • M. Kay

    Syntactic processing and functional sentence perspective

  • S.-S. Kim et al.

    Automatic recognition of pitch movements using multi-layer perceptron and time-delay recurrent neural network

    IEEE Signal Process. Lett.

    (2003)
  • Krifka, M., 1999. Additive particles under stress. Proc. of SALT...
  • I. Kruijff-Korbayova et al.

    Discourse and information structure

    J. Logic Language Inf.

    (2003)
  • Cited by (16)

    • Unsupervised accent classification for deep data fusion of accent and language information

      2016, Speech Communication
      Citation Excerpt :

      The study is concluded in Section 6 with a summary and directions for future research. It has previously been shown that it is more probable to observe semantic differences in spontaneous text and speech rather than formal written newspapers or prompted/read speeches (Antoine, 1996; Hansen, 2004; Hasegawa-Johnson and Levinson, 2006). Simply reading prepared text does not in general convey actual dialect content of a language (Huang and Hansen, 2007a; Liu et al., 2010b).

    • Automatic detection of contrastive word pairs using textual and acoustic features

      2014, International Conference on Signal Processing Proceedings, ICSP
    • Using conditional random fields to predict focus word pair in spontaneous spoken English

      2014, Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
    • Detection and emphatic realization of contrastive word pairs for expressive text-to-speech synthesis

      2012, 2012 8th International Symposium on Chinese Spoken Language Processing, ISCSLP 2012
    View all citing articles on Scopus
    View full text