Elsevier

Speech Communication

Volume 65, November–December 2014, Pages 109-118
Speech Communication

Feasibility of augmenting text with visual prosodic cues to enhance oral reading

https://doi.org/10.1016/j.specom.2014.07.002Get rights and content

Highlights

  • We developed a novel reading software to augment text with visual prosodic cues.

  • We assessed the feasibility of the software on a group of beginning readers.

  • Results indicated that visual prosodic cues were readily learned and implemented.

Abstract

Reading fluency has traditionally focused on speed and accuracy yet recent reports suggest that expressive oral reading is an important component that has been largely overlooked. The current study assessed the impact of augmenting text with visual prosodic cues to improve expressive reading in beginning readers. Customized reading software was developed to present text augmented with prosodic cues to convey changes in pitch, duration and/or intensity. Prosodic modulation was derived from the recordings of a fluent adult model and rendered as a set of visual cues that could be presented in isolation or in combination. To establish baseline measures, eight children aged 7–8 first read a five-chapter story in standard text format. In the subsequent three sessions, participants were trained to use each augmented text cue with the guidance of an auditory model. They also had the opportunity to practice reading aloud in each cue condition. At the post-training session, participants re-recorded the baseline story with each chapter read in one of the different cue conditions (standard, pitch, duration, intensity and combination). Post-training and baseline recordings were acoustically analyzed to assess changes in reading expressivity. Despite large individual differences in how each participant implemented the prosodic cues, as a group, there were notable improvements in marking pitch accents and elongating word duration to convey linguistic contrasts. In fact, even after only three training sessions, participants appeared to have generalized implementation of pitch and word duration cues when reading standard text at post-training. In contrast, while participants manipulated pause duration when provided with explicit visual cues, they did not transfer these cues to standard text at post-training. These findings suggest that beginning readers could benefit from explicit visual prosodic cues and that even limited exposure may be sufficient to learn and generalize skills. Further discussion focuses on the implications of this work on struggling readers and second language learners.

Introduction

Fluent oral reading is a hallmark of skilled reading and involves a number of complex skills including rapid or semi-automatic word decoding and the extraction of syntactic and semantic information to facilitate the translation of text into speech (Adams, 1990). Although much of the literature defines reading fluency in terms of rate and accuracy (Daane et al., 2005), evidence suggests expression and ease of reading are also critical to oral reading fluency (Allington, 1983, Dowhower, 1991, Eason et al., 2013, Jenkins et al., 2003).

Reading with expression requires modulation of prosody – the rhythm and melody of speech. Prosody is used to signal linguistic contrasts and express emotions and attitudes (Lehiste, 1970, Shattuck-Hufnagel and Turk, 1996). Speakers manipulate the fundamental frequency (F0) of their voice (perceived as pitch), together with changes in duration, vocal intensity (perceived as loudness), and voice quality to convey linguistic and affective goals (Xu, 1999, Xu, 2011). Prosody is also important for reading comprehension. Listeners use prosodic cues to segment oral language into meaningful syntactic units, or phrases within an utterance (Cutler et al., 1997, Shattuck-Hufnagel and Turk, 1996), which supports working memory and comprehension. In fact, even newborns and infants utilize prosody to attune to the rhythmic regularities of their native language (Morgan and Demuth, 1995).

Given the importance of prosody in spoken fluency and comprehension, we hypothesized that providing explicit visual cues to the underlying prosody would improve reading fluency in beginning readers. Although beginning readers can modulate conversational prosody, many children struggle to apply this skill when reading aloud, resulting in expressionless and labored speech even when they are proficient decoders. The lack of sufficient cues in written text may contribute to this apparent dichotomy between the presence of prosodic modulation in conversation yet its absence during reading. Readers must draw inferences about appropriate prosody from context, punctuation and grammar (Carlson, 2009, Miller and Schwanenflugel, 2006, Schreiber, 1987).

Furthermore, reading with expression is made even more challenging by the fact that prosodic control is developing simultaneously with reading acquisition (Cruttenden, 1985, Crystal, 1978, Local, 1980, Snow, 1994, Snow, 1998). The developmental trajectory of prosodic control begins with the modulation of cries (Gilbert and Robb, 1996, Lind and Wermke, 2002, Protopapas and Eimas, 1997, Wermke et al., 2002) and continues throughout childhood and even into adolescence (Cruttenden, 1985, Crystal, 1986, Local, 1980, Snow, 1994, Snow, 1998, Tingley and Allen, 1975, Wells et al., 2004). Young children often use different acoustic cues, or combinations of cues than older children to signal prosodic contrasts. For example, Patel and Grigos (2006) showed that while seven and eleven year olds marked yes/no questions with increased phrase final F0, four year olds tended to rely on duration cues. Perhaps rising contours are more motorically demanding (Snow, 1998) thus young children manipulate duration. Additionally, Grigos and Patel (2007) noted despite being able to mark prosodic contrasts, 7 year olds exhibited greater kinematic and acoustic variability than 11 year olds suggesting continued motor development.

Previous attempts to address the need to supplement written text with prosodic information have been limited to manipulations of spacing, punctuation, font and case. Some researchers have recommended formatting text to display intra-sentence phrasal boundaries to facilitate chunking of text into meaningful units (Cromer, 1970, Levasseur et al., 2006, O’Shea and Sindelar, 1983). Others have suggested manipulating punctuation (e.g. My friend? My friend! My friend.) and font case (e.g. I like SOME of my relatives versus I like some of MY relatives) to practice modulating intonation (Blevins, 2001). These approaches apply a set of grammatical rules to convey prosodic variation. There is however, considerable prosodic variation in natural speech that cannot be captured by simple mappings.

Our approach aimed to augment written text with visual prosodic cues that are derived from fluent adult recordings. The goal is to provide beginning readers with the scaffolding to read aloud expressively as they continue to master control of prosody. Toward this end, we developed a software program called ReadN’Karaoke that provides multimodal (visual and auditory) cues to three different components of prosody: pitch, duration, and vocal loudness. The first version (ReadN’Karaoke 1.0) directly manipulated text based on a fluent adult reader’s F0, duration, and intensity variation. A user study with typically developing children showed significant increases in F0 and duration modulation when reading with the cues (Patel and McNab, 2011). Although these results were promising, participants also reported that manipulated words were sometimes hard to read; specifically word boundaries were often difficult to distinguish on pitch manipulated tokens (Patel and McNab, 2011). To address these concerns, a new visualization scheme was designed. Rather than manipulating text, ReadN’Karaoke 2.0, augments written text with overlaid cues to pitch, duration and intensity variation (Patel and Furr, 2011). Fig. 1 displays both the manipulated text from ReadN’Karaoke 1.0 alongside the augmented text from ReadN’Karaoke 2.0. A preliminary usability study with two typically developing children indicated that augmented text was easier to read than manipulated text and that it resulted in similar gains in prosodic reading (Patel and Furr, 2011).

The current study aimed to assess ReadN’Karaoke 2.0 with a group of beginning readers using a five-chapter standardized story. In our earlier work (Patel and McNab, 2011) we measured overall changes in F0, duration and intensity while reading stories and did not control for the type of prosodic contours children were asked to read. While the results indicated increases in overall prosodic range, the impact of visual prosodic cues on the ability to mark specific linguistic contrasts remained unclear. In the current study, we address this issue by examining acoustic changes for a subset of linguistic contrasts. Motivated by Miller and Schwanenflugel (2006), we designed a novel, five chapter story that included at least three instances of six prosodic targets in each chapter. Four of these prosodic targets were identified by Dowhower (1991) as indicative of adult prosodic reading ability because they tend to involve modulation of either pitch (declarative sentence, yes/no question), or duration (phrase final lengthening, noun-adjective list). We also included sentences with contrastive stress and exclamations because they may be marked by a combination of acoustic cues (Cooper and Sorensen, 1981, Eady and Cooper, 1986, Fry, 1955). See Fig. 2 for a schematic of the typical prosodic contours for each of the six target linguistic contrasts.

The present study sought to determine whether reading aloud using ReadN’Karaoke 2.0 would improve reading fluency and expression in typically developing children. We had two main hypotheses:

  • (1)

    Participants will read with more expression, as measured by changes in F0, duration and intensity, when provided with visual cues versus without cues.

  • (2)

    Visual prosodic cues will improve reading comprehension scores in comparison to reading without cues.

Section snippets

Participants

Eight typically developing children (3M, 5F; M = 8.06 years, SD = 0.36, range = 7;7–8;4) who were all native speakers of American English were recruited to participate. Reading levels of each child were determined using the Developmental Reading Assessment-2 (DRA-2; Beaver and Carter, 2006). Participants who demonstrated “instructional” or “independent” proficiency on levels comparable to 2nd grade reading level were eligible for the study, and children were excluded if their reading level was outside

Word duration

The difference in mean word duration from baseline to post-training for target words in PFL and CS were included in this analysis. For these sentence types, the target word tends to be elongated (Bolinger, 1989, Cooper et al., 1985, Fry, 1955). The main effect of testing time was significant, F(1, 15.26) = 12.53, p = 0.003, as was the interaction between testing time and cue condition, F(8, 137.31) = 2.76, p = 0.007. Fig. 4 provides mean word duration difference scores (post-training–baseline) for each

Discussion

This study aimed to investigate the effectiveness of providing visual prosodic cues to improve oral reading expressivity in beginning readers. In contrast to previous work, changes in expressiveness were quantified through acoustic measures of prosody (pitch, loudness, duration), rather than through qualitative ratings scales (Allington, 1983, Eason et al., 2013) or measures of speed and accuracy (e.g. Neddenriep et al., 2011, Therrien, 2004, Walker et al., 2005).

Six linguistic prosodic

Acknowledgements

There are a number of individuals who have made significant contributions this work. We are indebted to Isabel Meirelles for her collaboration on designing the visual renderings used here, to Sheelah Sweeny for her guidance and work on developing the stories and comprehension questions, and to William Furr for his dedication to implementing a robust and user-friendly software system. We also thank our participants and their families for their time and commitment to this multi-week study. Last

References (48)

  • W. Blevins

    Building fluency: Lessons and strategies for reading success

    (2001)
  • Boersma, P., Weenink, D., 2014. Praat: Doing Phonetics by Computer. <http://www.praat.org> 5.3.62 (retrieved...
  • D. Bolinger

    Intonation and Its Uses: Melody in Grammar and Discourse

    (1989)
  • K. Carlson

    How prosody influences sentence comprehension

    Lang. Linguist. Compass

    (2009)
  • R. Christensen

    Plane Answers to Complex Questions: The Theory of Linear Models

    (2002)
  • W.E. Cooper et al.

    Fundamental Frequency in Sentence Production

    (1981)
  • W.E. Cooper et al.

    Acoustical aspects of contrastive stress in question answer contexts

    J. Acoust. Soc. Am.

    (1985)
  • W. Cromer

    The difference model: a new explanation for some reading difficulties

    J. Educ. Psychol.

    (1970)
  • A. Cruttenden

    Intonation comprehension in ten-year-olds

    J. Child Lang.

    (1985)
  • D. Crystal

    The analysis of intonation in young children

  • D. Crystal

    Prosodic development

  • A. Cutler et al.

    Prosody in the comprehension of spoken language: a literature review

    Lang. Speech

    (1997)
  • Daane, M.C., Campbell, J.R., Grigg, W.S., Goodman, M.J., Oranje, A., 2005. Fourth-Grade Students Reading Aloud: NAEP...
  • S.L. Dowhower

    Speaking of prosody: fluency’s unattended bedfellow

    Theory Pract.

    (1991)
  • Cited by (6)

    • Computer-assisted assessment of phonetic fluency in a second language: a longitudinal study of Japanese learners of French

      2020, Speech Communication
      Citation Excerpt :

      Some researchers also paid attention to the need to include the qualitative perception of fluency by native listeners in the assessment process (Préfontaine and Kormos, 2016). Other researchers have included fluency assessment tasks in their efforts to provide annotated non-native speech corpora useful for applications in CAPT systems (Chen et al., 2016), while others have focused on reading abilities (Patel et al., 2014). Still, part of the explanation for the lack of available pedagogical tools may lie in the prevalent tendency to rely on automatic speech recognition (ASR) systems, trained on native or non-native data which makes it difficult to extend it to a generic usage, despite methodological advances over the past twenty years – from read-aloud data with Hidden-Markov-Models-based speech recognition systems (Franco et al., 2000) to spontaneous speech (Zechner et al., 2009) and Deep-Neural-Networks-based approaches (Wang et al., 2018).

    • Technology-based reading intervention programs for elementary grades: An analytical review

      2019, Computers and Education
      Citation Excerpt :

      However, there was not any significant difference in prosodic variations of standard reading between baseline and post-training. Later, the next version of this software replaced manipulated text formats with augmented text by overlaid cues of pitch, duration, and intensity (Patel, Kember, & Natale, 2014). The results of a longer three-session training with this new version showed that participants transferred pitch and word duration variations to the standard reading of post-training session.

    • The Prosodic Marionette: A method to visualize speech prosody and assess perceptual and expressive prosodic abilities

      2018, Speech Communication
      Citation Excerpt :

      Specifically, we represent pitch on the vertical axis and time on the horizontal axis with space between widgets indicating pauses and the width of widgets for duration. We did not initially include intensity in order to reduce interface complexity and given our past results that changes in pitch, pause length, and word duration were affected the most through the use of augmented displays of speech prosody (Patel et al., 2014), since the Prosodic Marionette resembles the interface of Patel et al. (2014) though used for prosody synthesis control rather than augmenting text for expressive reading. Further, the software interface was designed to be configured for a range of possible participants across ages (children through adults) and vocal motor function (with and without neuromotor disorders and dysarthria), and retains the same set of capabilities independent from the specific participant group.

    • Learning prosody in a video game-based learning approach

      2019, Multimodal Technologies and Interaction
    • Engaging Adolescents with Down Syndrome in an Educational Video Game

      2017, International Journal of Human-Computer Interaction
    View full text