Skip to main content

Advertisement

Log in

Dirty data labeled dirt cheap: epistemic injustice in machine learning systems

  • Original Paper
  • Published:
Ethics and Information Technology Aims and scope Submit manuscript

Abstract

Artificial intelligence (AI) and machine learning (ML) systems increasingly purport to deliver knowledge about people and the world. Unfortunately, they also seem to frequently present results that repeat or magnify biased treatment of racial and other vulnerable minorities. This paper proposes that at least some of the problems with AI’s treatment of minorities can be captured by the concept of epistemic injustice. To substantiate this claim, I argue that (1) pretrial detention and physiognomic AI systems commit testimonial injustice because their target variables reflect inaccurate and unjust proxies for what they claim to measure; (2) classification systems, such as facial recognition, commit hermeneutic injustice because their classification taxonomies, almost no matter how they are derived, reflect and perpetuate racial and other stereotypes; and (3) epistemic injustice better explains what is going wrong in these types of situations than does the more common focus on procedural (un)fairness.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Notes

  1. John Symons and Ramón Alvarado (2022), for example, identify injustice in the opacity of ML systems. Giorgia Pozzi (2023a, 2023b) looks at epistemic injustice in the context of automated opioid misuse prediction systems in healthcare. Epistemic injustice has also been identified in social media algorithms (Stewart, Cichocki, and McLeod, 2022) and natural-language processing systems (De Proost and Pozzi, 2023; Laacke, 2023).

  2. For some prominent examples, see, e.g., (Barocas & Selbst, 2016; Benjamin, 2019; Buolamwini & Gebru, 2018; Eubanks, 2017; Friedman & Nissenbaum, 1996; Noble, 2018). I do not claim that this is a comprehensive list.

  3. For facial recognition, see (Buolamwini & Gebru, 2018); on policing, see the discussion and references below.

  4. On this concept, see, e.g., (Buchman, Ho, & Goldberg, 2017; Carel & Kidd, 2017; Collins, 2017; Fricker, 2007; Kidd, Medina, & Pohlhaus, 2017; Medina, 2018; Wardrope, 2015). I draw primarily from Fricker and Medina, though clearly other treatments are both possible and needed.

  5. Similar obstacles have faced Black defendants improperly identified by facial recognition systems. See, e.g., Alicia Solow-Niederman’s account of the month Robert Williams spent in detention after being misidentified by facial recognition (2023, pp. 131–133). Solow-Niederman takes Williams’ case as exemplary of the “grey holes” that algorithmic systems can introduce: people are nominally protected from those systems, but their protections are practically unusable.

  6. The example is analogous to Fabian Beigang’s example of a disease-modeling algorithm that correctly identifies 95% of people who have a given disease, and does so robustly across gender. However, because disease prevalence is much higher among men, the positive predictive value of the algorithm—“the probability of actually having the disease given that one receives a positive test result”—is much higher for men than women, who will much more frequently be erroneously told they have the disease. Beigang notes that “this is not due to bias in the testing device, but just to the prevalence of the disease, which differs across genders” (2023, p. 175).

  7. For detailed summaries and relevant citations, see, e.g., (Green, 2021; Mayson, 2019).

  8. For the impossibility theorem, see especially (Beigang, 2023).

  9. Eva proposes base-rate tracking.

  10. These epistemic problems also present a limitation to base-rate tracking: it only works if the base rate of the phenomenon in question is knowable. This is a particular problem for things like crime rates.

  11. Cf. (Pozzi, 2023a), outlining the stigmatizing effects of poor proxy variable selection in opioid misuse risk assessment systems.

  12. The New Inquiry’s “Heatmap of White Collar Crime” makes the point vividly. See: https://whitecollar.thenewinquiry.com/

  13. The problem is not just pretrial detention. For example, the PredPol system seems to disproportionally target Black and Latino neighborhoods (Sankin, Mehrotra, Mattu, & Gilbertson, 2021). On predictive policing see also (Selbst, 2017).

  14. For a thorough dismantling of this paper that connects it to physiognomic systems, see (Agüera y Arcas, Mitchell, & Todorov, 2017). In an apologia response piece, the authors claim that their only intent is academic research, and that they are shocked that it was taken otherwise, though they admit that “taking a court conviction at its face value, i.e., as the ‘ground truth’ for machine learning, was indeed a serious oversight on our part.” They also emphatically insist that critics commit a base rate fallacy: China has a low crime rate, so someone flagged by the algorithm “is found to have a probability of only 4.39% to break the law, despite being tested positive by a method of unbelievably high accuracy” (Wu and Zhang, 2017, pp. 1, 2). It seems to me that this underscores the epistemic injustice of taking a 4% chance that someone will commit a crime over evidence that they might provide through testimony of virtually any sort. Of course, someone who tests positive is predicted to be much more likely to commit a crime than someone who tests negative, and this jump in relative risk is all that a carceral system needs to claim that it ought to expand surveillance, harassment etc. for a positive case.

  15. See also Ruha Benjamin’s (2019) discussion of the differential hypervisibility of Black celebrities versus communities. For the hyper-surveillance of the homeless, day laborers, undocumented migrants and those with felony convictions see (Gilman and Green, 2018).

  16. This could “be due to a biological difference such as a difference in facial brightness. It could also be due to a group preference for makeup, or perhaps related to how the photograph is taken. Some people might use a mobile phone to take photographs for dating profiles and others might have them taken in a professionally lit photographic studio. The types of post-processing applied to photographs might vary between groups. It is also possible that photographs from different types of mobile phones or those that are uploaded to different dating websites are processed with different image compression algorithms and that there are artifacts resulting from these methods that are easily detectable by ML models” (Leuner, 2019, p. 51).

  17. For this point in the context of pretrial detention programs, see (Mayson, 2019). The scalability of facial recognition systems is part of what is behind calls for their abolition, as for example in (Selinger & Hartzog, 2019).

  18. (Citron, 2008, pp. 1271–1272). For some recent work, see (Araujo, Helberger, Kruikemeier, & de Vreese, 2020). The interaction between automated systems and human decisionmakers is complex and an area of ongoing research. See (Gerdon, Bach, Kern, and Kreuter, 2022, pp. 7–9).

  19. Interestingly for the argument being developed here, those who felt that their harassment didn’t readily fit HeartMob’s classificatory system tended to report feeling unsupported (Blackwell et al., 2017).

  20. For example, facial recognition systems tend to under-recognize dark-skinned women because they rely on training data that overrepresents white men (Buolamwini & Gebru, 2018). Natural language processing datasets underrepresent non-Western languages (Bender, Gebru, McMillan-Major, & Mitchell, 2021). Datasets of household objects perform poorly on objects from low and middle income countries (LMICs) because they rely on Flickr and English (DeVries, Misra, Wang, & Maaten, 2019). The English Colossal Clean Crawled Corpus contains surprising amounts of military text (from.mil domains), patents, and machine-generated translations, especially of non-English patents. The implications for that are unclear, but it should be apparent that most people to not speak in the idiom of patent applications. Initial efforts to curate datasets can introduce further such problems. For example, the cleaned version of the English Colossal Clean Crawled Corpus disproportionately blocks out mentions of sexual orientation, as well as texts that appear to be African-American English or Latinx English (Birhane et al., 2021b).

  21. Other analyses show that this is typical (Scheuerman, Paul, & Brubaker, 2019; Scheuerman et al., 2020). For a comprehensive study of the ways that gender binarism is built into governmental database systems, see (Waldman, 2023).

  22. For example, the computer vision datasets analyzed by Scheuerman, Denton and Hanna (2021) expressed concerns about scale, and comprehensiveness in order to achieve higher accuracy. (Bender et al., 2021) note a similar trajectory for NLP bases, as does (Birhane et al., 2021b) for multimodal datasets.

  23. This state of affairs does not seem to trouble most of those who work on the datasets. As Scheuerman, Denton and Hanna summarize their comprehensive look at the documentation of several datasets, “valuing efficiency was at the cost of care, valuing slow and thoughtful decision-making and data processes, considering more ethical ways to collect data and treat annotators, and seeking fairer compensation—or even reporting compensation—for data labor” and “in general, there was little to no discussion about ethics when conducting work with annotators or with human subjects as data instances” (Scheuerman et al., 2021, p. 25).

  24. This is particularly troubling given the prevalence of nonconsensual pornography online; not only are victims harmed when that material is disseminated, they are then forced to further their own sexualization by serving as data for the classification of people who look like them. One recent study shows that 1 in 12 people, mostly women, have been victims of nonconsensual pornography at least once (Ruvalcaba & Eaton, 2020).

  25. For Mars Clickworkers, see (Benkler, 2006). On the failures of scaling, see (Birhane et al., 2021b) and the references in those papers.

  26. For similar sentiments, see, e.g., (Birhane, 2021; Green, 2020; Green & Viljoen, 2020; Kalluri, 2020; Keyes, Hutson, & Durbin, 2019; Lin & Cameron Chen, 2022). See also the literature review in (Weinberg, 2022).

  27. For a thorough discussion of the ways algorithmic systems embed values, and how those generate various forms of bias, see (Fazelpour & Danks, 2021). Beigang (2023) argues that the contradiction between predictive parity and equalized odds fairness can be resolved by accounting for different prevalence rates in the predicted populations. For example, to know if an algorithmic system discriminated against women in disease prediction, it would be necessary to know the prevalence of the disease in male and female subpopulations. This strategy essentially writes in the importance of context for understanding algorithmic fairness, though it does not resolve problems with knowing base rates, whether the target variable is a good proxy for the underlying social phenomenon, and whether reliance on (for example) carceral data is justified.

  28. Poorly defined and controversial terms also risk sliding into the space of essentially-contested concepts (Mulligan, Koopman, & Doty, 2016; Mulligan, Kroll, Kohli, & Wong, 2019).

  29. For a general critique of ideal theory, see (Mills, 2005).

  30. For some initial work from within the computing community on the possible social roles of computing, see (Abebe et al., 2020). See also Barabas (2022) on the importance of developing capacities to refuse datafication.

References

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gordon Hull.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Hull, G. Dirty data labeled dirt cheap: epistemic injustice in machine learning systems. Ethics Inf Technol 25, 38 (2023). https://doi.org/10.1007/s10676-023-09712-y

Download citation

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s10676-023-09712-y

Keywords

Navigation