skip to main content
10.1145/3571884.3597134acmconferencesArticle/Chapter ViewAbstractPublication PagescuiConference Proceedingsconference-collections
research-article

Gist and Verbatim: Understanding Speech to Inform New Interfaces for Verbal Text Composition

Published: 19 July 2023 Publication History

Abstract

Recent interest in speech-to-text applications has found speech to be an efficient modality for text input. However, the spontaneity of speech makes direct transcriptions of spoken compositions effortful to edit. While previous works in Human-Computer Interaction (HCI) domain focus on improving error correction, there is a lack of theoretical ground around the understanding of speech as an input modality. This work explores literature from Cognitive Science to synthesize relevant theories and findings for the HCI audience to reference. Motivated by the literature indicating a fast memory decay of speech production and a preference towards gist abstraction in memory traces, an experiment was conducted to observe users’ immediate recall of their verbal composition. Based on the theories and findings, we introduce new interaction concepts and workflows that adapt to the characteristics of speech input.

References

[1]
John R Anderson. 1974. Verbatim and propositional representation of sentences in immediate and long-term memory. Journal of Verbal Learning and Verbal Behavior 13, 2 (1974), 149–162.
[2]
Matthew P Aylett, Per Ola Kristensson, Steve Whittaker, and Yolanda Vazquez-Alvarez. 2014. None of a CHInd: relationship counselling for HCI and speech technology. In CHI’14 Extended Abstracts on Human Factors in Computing Systems. 749–760.
[3]
James Bigelow and Amy Poremba. 2014. Achilles’ ear? Inferior human short-term and recognition memory in the auditory modality. PloS one 9, 2 (2014), e89914.
[4]
Charles J Brainerd and Johannes Kingma. 1985. On the independence of short-term memory and working memory in cognitive development. Cognitive Psychology 17, 2 (1985), 210–247.
[5]
Charles J Brainerd and Valerie F Reyna. 2005. The science of false memory. Oxford University Press.
[6]
John D Bransford and Jeffery J Franks. 1971. The abstraction of linguistic ideas. Cognitive psychology 2, 4 (1971), 331–350.
[7]
Alfonso Caramazza. 1991. Some aspects of language processing revealed through the analysis of acquired aphasia: The lexical system. Issues in reading, writing and speaking (1991), 15–44.
[8]
Stuart K Card, Thomas P Moran, and Allen Newell. 1980. Computer text-editing: An information-processing analysis of a routine cognitive skill. Cognitive psychology 12, 1 (1980), 32–74.
[9]
Wallace Chafe. 1994. Discourse, consciousness, and time: The flow and displacement of conscious experience in speaking and writing. University of Chicago Press.
[10]
Wallace Chafe and Deborah Tannen. 1987. The relation between written and spoken language. Annual review of anthropology 16 (1987), 383–407.
[11]
Harald Clahsen and Claudia Felser. 2006. How native-like is non-native language processing?Trends in cognitive sciences 10, 12 (2006), 564–570.
[12]
Leigh Clark, Philip Doyle, Diego Garaialde, Emer Gilmartin, Stephan Schlögl, Jens Edlund, Matthew Aylett, João Cabral, Cosmin Munteanu, Justin Edwards, 2019. The state of speech in HCI: Trends, themes and challenges. Interacting with Computers 31, 4 (2019), 349–371.
[13]
Jan Cuřín, Martin Labskỳ, Tomáš Macek, Jan Kleindienst, Hoi Young, Ann Thyme-Gobbel, Holger Quast, and Lars König. 2011. Dictating and editing short texts while driving: Distraction and task completion. In Proceedings of the 3rd International Conference on Automotive User Interfaces and Interactive Vehicular Applications. 13–20.
[14]
Susan De La Paz. 1999. Composing via dictation and speech recognition systems: Compensatory technology for students with learning disabilities. Learning Disability Quarterly 22, 3 (1999), 173–182.
[15]
Wanyu Du, Zae Myung Kim, Vipul Raheja, Dhruv Kumar, and Dongyeop Kang. 2022. Read, Revise, Repeat: A System Demonstration for Human-in-the-loop Iterative Text Revision. arXiv preprint arXiv:2204.03685 (2022).
[16]
Konrad Ehlich. 1989. Deictic expressions and the connexity of text. Text and discourse connectedness (1989), 33–52.
[17]
John F Ehrich. 2006. Vygotskyan inner speech and the reading process. (2006).
[18]
Lisa B Elliot, Michael S Stinson, Barbara G McKee, Victoria S Everhart, and Pamela J Francis. 2001. College students’ perceptions of the C-Print speech-to-text transcription system. Journal of deaf studies and deaf education 6, 4 (2001), 285–298.
[19]
Eric Enge. 2020. Mobile voice usage trends in 2020. https://www.perficient.com/insights/research-hub/voice-usage-trends
[20]
Jiayue Fan, Chenning Xu, Chun Yu, and Yuanchun Shi. 2021. Just Speak It: Minimize Cognitive Load for Eyes-Free Text Editing with a Smart Voice Assistant. In The 34th Annual ACM Symposium on User Interface Software and Technology. 910–921.
[21]
Kengo Fujita and Tsuneo Kato. 2011. Design and development of eyes-and hands-free voice interface for mobile phone. In International Conference on Human Centered Design. Springer, 207–216.
[22]
Paul L Garvin. 1989. Professor Vachek (revisited)-some contemporary issues in the study of speech and writing. (1989).
[23]
Morton Ann Gernsbacher. 1985. Surface information loss in comprehension. Cognitive psychology 17, 3 (1985), 324–363.
[24]
Debjyoti Ghosh. 2021. Voice-based Interactions for Editing Text On The Go. In 2021 Joint Workshop of the German Research Training Groups in Computer Science. 143.
[25]
Debjyoti Ghosh, Pin Sym Foong, Shengdong Zhao, Di Chen, and Morten Fjeld. 2018. EDITalk: towards designing eyes-free interactions for mobile word processing. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1–10.
[26]
Debjyoti Ghosh, Pin Sym Foong, Shengdong Zhao, Can Liu, Nuwan Janaka, and Vinitha Erusu. 2020. Eyeditor: Towards on-the-go heads-up text editing using voice and manual input. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–13.
[27]
Debjyoti Ghosh, Can Liu, Shengdong Zhao, and Kotaro Hara. 2020. Commanding and Re-Dictation: Developing Eyes-Free Voice-Based Interaction for Editing Dictated Text. ACM Transactions on Computer-Human Interaction (TOCHI) 27, 4 (2020), 1–31.
[28]
Michele E Gloede and Melissa K Gregg. 2019. The fidelity of visual and auditory memory. Psychonomic Bulletin & Review 26 (2019), 1325–1332.
[29]
Michele E Gloede, Emily E Paulauskas, and Melissa K Gregg. 2017. Experience and information loss in auditory and visual memory. Quarterly Journal of Experimental Psychology 70, 7 (2017), 1344–1352.
[30]
Florian Habler, Marco Peisker, and Niels Henze. 2019. Differences between smart speakers and graphical user interfaces for music search considering gender effects. In Proceedings of the 18th International Conference on Mobile and Ubiquitous Multimedia. 1–7.
[31]
James Hartley, Eric Sotto, and James Pennebaker. 2003. Speaking versus typing: a case-study of the effects of using voice-recognition software on academic correspondence. British Journal of Educational Technology 34, 1 (2003), 5–16.
[32]
Charles F Hockett and Charles D Hockett. 1960. The origin of speech. Scientific American 203, 3 (1960), 88–97.
[33]
Lee Honeycutt. 2003. Researching the use of voice recognition writing software. Computers and Composition 20, 1 (2003), 77–95.
[34]
Biing-Hwang Juang and Lawrence R Rabiner. 2005. Automatic speech recognition–a brief history of the technology development. Georgia Institute of Technology. Atlanta Rutgers University and the University of California. Santa Barbara 1 (2005), 67.
[35]
Clare-Marie Karat, Christine Halverson, Daniel Horn, and John Karat. 1999. Patterns of entry and correction in large vocabulary continuous speech recognition systems. In Proceedings of the SIGCHI conference on Human Factors in Computing Systems. 568–575.
[36]
Azlina Amir Kassim, Rehan Rehman, and Jessica M Price. 2018. Effects of modality and repetition in a continuous recognition memory task: Repetition has no effect on auditory recognition memory. Acta Psychologica 185 (2018), 72–80.
[37]
Ronald T Kellogg. 2007. Are written and spoken recall of text equivalent?The American Journal of Psychology 120, 3 (2007), 415–428.
[38]
Stephen D Krashen. 1984. Writing, research, theory, and applications. Pergamon.
[39]
Anuj Kumar, Tim Paek, and Bongshin Lee. 2012. Voice typing: a new speech interaction model for dictation on touchscreen devices. In Proceedings of the SIGCHI Conference on Human Factors in Computing Systems. 2277–2286.
[40]
Eric Lambert and Pauline Quémart. 2019. Introduction to the special issue on the dynamics of written word production: methods, models and processing units. Reading and Writing 32, 1 (2019), 1–12.
[41]
Yuan-Hsuan Lee. 2015. The effectiveness of using inner speech and communicative speech in reading literacy development: A synthesis of research. International Journal of Social Science and Humanity 5, 8 (2015), 720.
[42]
Arthur E McNair and Alex Waibel. 1994. Improving recognizer acceptance through robust, natural speech repair. In Third International Conference on Spoken Language Processing.
[43]
James Moffett. 1982. Writing, inner speech, and meditation. College English 44, 3 (1982), 231–246.
[44]
Cosmin Munteanu and Gerald Penn. 2017. Speech-based interaction: Myths, challenges, and opportunities. In Proceedings of the 2017 CHI Conference Extended Abstracts on Human Factors in Computing Systems. 1196–1199.
[45]
Timothy N Odegard and James M Lampinen. 2005. Recollection rejection: Gist cuing of verbatim memory. Memory & Cognition 33, 8 (2005), 1422–1430.
[46]
Walter J Ong. 2013. Orality and literacy. Routledge.
[47]
Sharon Oviatt, Phil Cohen, Lizhong Wu, Lisbeth Duncan, Bernhard Suhm, Josh Bers, Thomas Holzman, Terry Winograd, James Landay, Jim Larson, 2000. Designing the user interface for multimodal speech and pen-based gesture applications: State-of-the-art systems and future research directions. Human-computer interaction 15, 4 (2000), 263–322.
[48]
Joseph S Perkell. 2012. Movement goals and feedback and feedforward control mechanisms in speech production. Journal of neurolinguistics 25, 5 (2012), 382–407.
[49]
Thomas G Poder, Jean-François Fisette, and Véronique Déry. 2018. Speech recognition for medical dictation: overview in Quebec and systematic review. Journal of medical systems 42, 5 (2018), 1–8.
[50]
Jeremy J Purcell, Peter E Turkeltaub, Guinevere F Eden, and Brenda Rapp. 2011. Examining the central and peripheral processes of written word production through meta-analysis. Frontiers in psychology 2 (2011), 239.
[51]
Valerie F Reyna. 2012. A new intuitionism: Meaning, memory, and development in Fuzzy-Trace Theory.Judgment and Decision making (2012).
[52]
Valerie F Reyna and Charles J Brainerd. 1995. Fuzzy-trace theory: An interim synthesis. Learning and individual Differences 7, 1 (1995), 1–75.
[53]
Radiah Rivu, Yasmeen Abdrabou, Ken Pfeuffer, Mariam Hassib, and Florian Alt. 2020. Gaze’N’Touch: Enhancing Text Selection on Mobile Devices Using Gaze. In Extended Abstracts of the 2020 CHI Conference on Human Factors in Computing Systems. 1–8.
[54]
David A Rosenbaum. 2010. Human motor control. (2010).
[55]
Sherry Ruan, Jacob O Wobbrock, Kenny Liou, Andrew Ng, and James Landay. 2016. Speech is 3x faster than typing for english and mandarin text entry on mobile devices. arXiv preprint arXiv:1608.07323 (2016).
[56]
Jacqueline Strunk Sachs. 1967. Recognition memory for syntactic and semantic aspects of connected discourse. Perception & Psychophysics 2, 9 (1967), 437–442.
[57]
Timothy A Salthouse. 1986. Perceptual, cognitive, and motoric aspects of transcription typing.Psychological bulletin 99, 3 (1986), 303.
[58]
Herbert Schriefers and Gabriella Vigliocco. 2015. Speech production, psychology of [repr.]. In International Encyclopedia of the Social & Behavioral Sciences (2nd ed) Vol. 23. Elsevier, 255–258.
[59]
Judith Schweppe, Sandra Barth, Almut Ketzer-Nöltge, and Ralf Rummer. 2015. Does verbatim sentence recall underestimate the language competence of near-native speakers?Frontiers in psychology 6 (2015), 63.
[60]
Andrew Sears, Jinhuan Feng, Kwesi Oseitutu, and Claire-Marie Karat. 2003. Hands-free, speech-based navigation during dictation: difficulties, consequences, and solutions. Human-computer interaction 18, 3 (2003), 229–257.
[61]
Korok Sengupta, Sabin Bhattarai, Sayan Sarcar, I Scott MacKenzie, and Steffen Staab. 2020. Leveraging error correction in voice-based text entry by Talk-and-Gaze. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems. 1–11.
[62]
Rustam Shadiev, Wu-Yuin Hwang, Nian-Shing Chen, and Yueh-Min Huang. 2014. Review of speech-to-text recognition technology for enhancing learning. Journal of Educational Technology & Society 17, 4 (2014), 65–84.
[63]
Ben Shneiderman. 2000. The limits of speech recognition. Commun. ACM 43, 9 (2000), 63–65.
[64]
Khe Chai Sim. 2010. Haptic voice recognition: Augmenting speech modality with touch events for efficient speech recognition. In 2010 IEEE spoken language technology workshop. IEEE, 73–78.
[65]
Khe Chai Sim. 2012. Speak-as-you-swipe (SAYS) a multimodal interface combining speech and gesture keyboard synchronously for continuous mobile text entry. In Proceedings of the 14th ACM international conference on Multimodal interaction. 555–560.
[66]
Shyamli Sindhwani, Christof Lutteroth, and Gerald Weber. 2019. ReType: Quick text editing with keyboard and gaze. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–13.
[67]
Michael S Stinson, Lisa B Elliot, Ronald R Kelly, and Yufang Liu. 2009. Deaf and hard-of-hearing students’ memory of lectures with speech-to-text and interpreting/note taking services. The Journal of Special Education 43, 1 (2009), 52–64.
[68]
Bernhard Suhm, Brad Myers, and Alex Waibel. 2001. Multimodal error correction for speech user interfaces. ACM transactions on computer-human interaction (TOCHI) 8, 1 (2001), 60–98.
[69]
Oanh Thi Tran and Viet The Bui. 2021. Neural text normalization in speech-to-text systems with rich features. Applied Artificial Intelligence 35, 3 (2021), 193–205.
[70]
Rein Turn. 1974. Speech as a man-computer communication channel. In Proceedings of the May 6-10, 1974, national computer conference and exposition. 139–143.
[71]
Ovid J Tzeng. 1975. Sentence memory: Recognition and inferences.Journal of Experimental Psychology: Human Learning and Memory 1, 6 (1975), 720.
[72]
Josef Vachek. 1973. Written language: General problems and problems of English Mouton. The Hague (1973).
[73]
Keith Vertanen and Per Ola Kristensson. 2010. Getting it right the second time: Recognition of spoken corrections. In 2010 IEEE Spoken Language Technology Workshop. IEEE, 289–294.
[74]
Karen Ward and David G Novick. 2003. Hands-free documentation. In Proceedings of the 21st annual international conference on Documentation. 147–154.
[75]
Maozheng Zhao, Henry Huang, Zhi Li, Rui Liu, Wenzhe Cui, Kajal Toshniwal, Ananya Goel, Andrew Wang, Xia Zhao, Sina Rashidian, 2022. EyeSayCorrect: Eye Gaze and Voice Based Hands-free Text Correction for Mobile Devices. In 27th International Conference on Intelligent User Interfaces. 470–482.

Cited By

View all
  • (2024)Rambler: Supporting Writing With Speech via LLM-Assisted Gist ManipulationProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642217(1-19)Online publication date: 11-May-2024
  • (2024)Leveraging Prompt-Based Large Language Models: Predicting Pandemic Health Decisions and Outcomes Through Social Media LanguageProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642117(1-20)Online publication date: 11-May-2024

Index Terms

  1. Gist and Verbatim: Understanding Speech to Inform New Interfaces for Verbal Text Composition

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        CUI '23: Proceedings of the 5th International Conference on Conversational User Interfaces
        July 2023
        504 pages
        ISBN:9798400700149
        DOI:10.1145/3571884
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 19 July 2023

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. STT
        2. dictation
        3. speech
        4. speech-to-text
        5. text composition
        6. text editing
        7. text entry

        Qualifiers

        • Research-article
        • Research
        • Refereed limited

        Funding Sources

        • Hong Kong Research Grants Council

        Conference

        CUI '23
        Sponsor:
        CUI '23: ACM conference on Conversational User Interfaces
        July 19 - 21, 2023
        Eindhoven, Netherlands

        Acceptance Rates

        Overall Acceptance Rate 34 of 100 submissions, 34%

        Upcoming Conference

        CUI '25
        ACM Conversational User Interfaces 2025
        July 7 - 9, 2025
        Waterloo , ON , Canada

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)138
        • Downloads (Last 6 weeks)23
        Reflects downloads up to 14 Feb 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2024)Rambler: Supporting Writing With Speech via LLM-Assisted Gist ManipulationProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642217(1-19)Online publication date: 11-May-2024
        • (2024)Leveraging Prompt-Based Large Language Models: Predicting Pandemic Health Decisions and Outcomes Through Social Media LanguageProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642117(1-20)Online publication date: 11-May-2024

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        HTML Format

        View this article in HTML Format.

        HTML Format

        Figures

        Tables

        Media

        Share

        Share

        Share this Publication link

        Share on social media