skip to main content
10.1145/3442188.3445893acmconferencesArticle/Chapter ViewAbstractPublication PagesfacctConference Proceedingsconference-collections
abstract

Spoken Corpora Data, Automatic Speech Recognition, and Bias Against African American Language: The case of Habitual 'Be'

Published: 01 March 2021 Publication History

Abstract

Recent work has revealed that major automatic speech recognition (ASR) systems such as Apple, Amazon, Google, IBM, and Microsoft perform much more poorly for Black U.S. speakers than for white U.S. speakers. Researchers postulate that this may be a result of biased datasets which are largely racially homogeneous. However, while the study of ASR performance with regards to the intersection of racial identity and language use is slowly gaining traction within AI, machine learning, and algorithmic bias research, little to nothing has been done to examine the data drawn from the spoken corpora which are commonly used in the training and evaluation of ASRs in order to understand whether or not they are actually biased, this study seeks to begin addressing this gap in the research by investigating spoken corpora used for ASR training and evaluation for a grammatical linguistic feature of what the field of linguistics terms African American Language (AAL), a systematic, rule-governed, and legitimate linguistic variety spoken by many (but not all) African Americans in the U.S. This grammatical feature, habitual 'be', is an uninflected form of 'be' that encodes the characteristic of habituality, as in "I be in my office by 7:30am", paraphrasable as "I am usually in my office by 7:30" in Standardized American English. This study utilizes established corpus linguistics methods on the transcribed data of four major spoken corpora -- Switchboard, Fisher, TIMIT, and LibriSpeech -- to understand the frequency, distribution, and usage of habitual 'be' within each corpus as compared to a reference corpus of spoken AAL -- the Corpus of Regional African American Language (CORAAL). The results find that habitual 'be' appears far less frequently, is dispersed in far fewer transcribed texts, and is surrounded by a much less diverse set of word types and parts of speech in the four ASR corpora as compared with CORAAL. This work provides foundational evidence that spoken corpora used in the training and evaluation of widely used ASR systems are, in fact, biased against AAL and likely contribute to poorer ASR performance for Black users.

Cited By

View all
  • (2023)Augmented Datasheets for Speech Datasets and Ethical Decision-MakingProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency10.1145/3593013.3594049(881-904)Online publication date: 12-Jun-2023
  • (2023)Considerations for Ethical Speech Recognition DatasetsProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining10.1145/3539597.3575793(1287-1288)Online publication date: 27-Feb-2023
  • (2022)Language variation and algorithmic bias: understanding algorithmic bias in British English automatic speech recognitionProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency10.1145/3531146.3533117(521-534)Online publication date: 21-Jun-2022
  • Show More Cited By

Index Terms

  1. Spoken Corpora Data, Automatic Speech Recognition, and Bias Against African American Language: The case of Habitual 'Be'

      Recommendations

      Comments

      Information & Contributors

      Information

      Published In

      cover image ACM Conferences
      FAccT '21: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency
      March 2021
      899 pages
      ISBN:9781450383097
      DOI:10.1145/3442188
      Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

      Sponsors

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      Published: 01 March 2021

      Check for updates

      Author Tags

      1. African American Language
      2. automatic speech recognition
      3. datasets
      4. linguistic bias
      5. racial bias
      6. spoken corpora

      Qualifiers

      • Abstract
      • Research
      • Refereed limited

      Conference

      FAccT '21
      Sponsor:

      Upcoming Conference

      Contributors

      Other Metrics

      Bibliometrics & Citations

      Bibliometrics

      Article Metrics

      • Downloads (Last 12 months)29
      • Downloads (Last 6 weeks)5
      Reflects downloads up to 08 Mar 2025

      Other Metrics

      Citations

      Cited By

      View all
      • (2023)Augmented Datasheets for Speech Datasets and Ethical Decision-MakingProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency10.1145/3593013.3594049(881-904)Online publication date: 12-Jun-2023
      • (2023)Considerations for Ethical Speech Recognition DatasetsProceedings of the Sixteenth ACM International Conference on Web Search and Data Mining10.1145/3539597.3575793(1287-1288)Online publication date: 27-Feb-2023
      • (2022)Language variation and algorithmic bias: understanding algorithmic bias in British English automatic speech recognitionProceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency10.1145/3531146.3533117(521-534)Online publication date: 21-Jun-2022
      • (2022)BiasHacker: Voice Command Disruption by Exploiting Speaker Biases in Automatic Speech RecognitionProceedings of the 15th ACM Conference on Security and Privacy in Wireless and Mobile Networks10.1145/3507657.3528558(119-124)Online publication date: 16-May-2022
      • (2022)Don't Speak Too Fast: The Impact of Data Bias on Self-Supervised Speech ModelsICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP43922.2022.9747897(3258-3262)Online publication date: 23-May-2022
      • (2022)Bias in Automatic Speech Recognition: The Case of African American LanguageApplied Linguistics10.1093/applin/amac06644:4(613-630)Online publication date: 14-Dec-2022
      • (2021)Artificial intelligence language models and the false fantasy of participatory language policiesWorking papers in Applied Linguistics and Linguistics at York10.25071/2564-2855.51(4-15)Online publication date: 13-Sep-2021

      View Options

      Login options

      View options

      PDF

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader

      Figures

      Tables

      Media

      Share

      Share

      Share this Publication link

      Share on social media