Skip to main content

Color Shadows (Part I): Exploratory Usability Evaluation of Activation Maps in Radiological Machine Learning

  • Conference paper
  • First Online:
Machine Learning and Knowledge Extraction (CD-MAKE 2022)

Abstract

Although deep learning-based AI systems for diagnostic imaging tasks have virtually showed superhuman accuracy, their use in medical settings has been questioned due to their “black box”, not interpretable nature. To address this shortcoming, several methods have been proposed to make AI eXplainable (XAI), including Pixel Attribution Methods; however, it is still unclear whether these methods are actually effective in “opening” the black-box and improving diagnosis, particularly in tasks where pathological conditions are difficult to detect. In this study, we focus on the detection of thoraco-lumbar fractures from X-rays with the goal of assessing the impact of PAMs on diagnostic decision making by addressing two separate research questions: first, whether activation maps (as an instance of PAM) were perceived as useful in the aforementioned task; and, second, whether maps were also capable to reduce the diagnostic error rate. We show that, even though AMs were not considered significantly useful by physicians, the image readers found high value in the maps in relation to other perceptual dimensions (i.e., pertinency, coherence) and, most importantly, their accuracy significantly improved when given XAI support in a pilot study involving 7 doctors in the interpretation of a small, but carefully chosen, set of images.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    It is worth of note that activation maps are different from saliency maps, although the two terms are often used interchangeably. In fact, the two approaches rely on different methods to compute heatmaps: saliency maps are usually generated by means of back-propagation w.r.t. to the input of the network [35], while activation maps are obtained by means of the feature maps obtained at a specific layer of the network [42].

  2. 2.

    More precisely, both specialists had to convene that the images were at least of level 3 (“sufficient image quality: moderate limitations for clinical use but no substantial loss of information”) on the absolute Visual Grading Analysis (VGA) scale [28], which is a 5-value ordinal scale from 1 (“excellent image quality: no limitations for clinical use”) to 5 (“poor image quality: image not usable, loss of information, image must be repeated”).

  3. 3.

    Balance in model accuracy was also guaranteed at class level: in each group (fractured vs non-fractured), 2 images were associated with a misdiagnosis by the model while 4 were correctly classified.

  4. 4.

    https://www.limesurvey.org/.

  5. 5.

    https://juxtapose.knightlab.com/.

  6. 6.

    Clarity would be a fourth relevant dimension of heatmaps for radiological use or XAI settings, as it relates to the accurate presentation of anatomical or pathological structures. However, this dimension was not assessed by the sample of readers involved, as images had already been selected to be of optimal clarity. Moreover, correlation between clarity and utility has been conjectured to be obvious and not worthy of investigation.

  7. 7.

    It is noteworthy that the white-box paradox can also mislead doctors when the advice is wrong, in that it can convince them of the opposite, as it has been reported in [7].

  8. 8.

    It should be noted that “the implicit assumption [...] that the specific (diagnostic) message of the X-ray images resided inside them from the beginning, and that it is obscured either by technological or epistemological problems [is contestable as too naive]”. Conversely it has been argued [33] that “the specific content of the images was shaped by the activities of X-ray workers within the context of medical developments of the time” when x-ray imaging was introduced in medical practice at the beginning of the 20th century. In these days, we could be witnessing the same phenomenon, in which radiologists, specialists, data scientists and ML developers could participatorily co-develop a machine semiotics for specific diagnostic tasks, if they are willing.

References

  1. Aggarwal, R., et al.: Diagnostic accuracy of deep learning in medical imaging: a systematic review and meta-analysis. NPJ Digit. Med. 4(1), 1–23 (2021)

    Article  Google Scholar 

  2. Alqaraawi, A., Schuessler, M., Weiß, P., Costanza, E., Berthouze, N.: Evaluating saliency map explanations for convolutional neural networks: a user study. In: Proceedings of the 25th International Conference on Intelligent User Interfaces, pp. 275–285 (2020)

    Google Scholar 

  3. Arun, N., et al.: Assessing the trustworthiness of saliency maps for localizing abnormalities in medical imaging. Radiol. Artif. Intell. 3(6), e200267 (2021)

    Article  Google Scholar 

  4. Ayhan, M.S., et al.: Clinical validation of saliency maps for understanding deep neural networks in ophthalmology. Med. Image Anal. 77, 102364 (2022)

    Article  Google Scholar 

  5. Badgeley, M.A., et al.: Deep learning predicts hip fracture using confounding patient and healthcare variables. NPJ Digit. Med. 2(1), 1–10 (2019)

    Article  Google Scholar 

  6. Balki, I., et al.: Sample-size determination methodologies for machine learning in medical imaging research: a systematic review. Can. Assoc. Radiol. J. 70(4), 344–353 (2019)

    Article  Google Scholar 

  7. Bansal, G., et al.: Does the whole exceed its parts? The effect of AI explanations on complementary team performance. In: Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, pp. 1–16 (2021)

    Google Scholar 

  8. Becherer, N., Pecarina, J., Nykl, S., Hopkinson, K.: Improving optimization of convolutional neural networks through parameter fine-tuning. Neural Comput. Appl. 31(8), 3469–3479 (2017). https://doi.org/10.1007/s00521-017-3285-0

    Article  Google Scholar 

  9. Brynjolfsson, E., Mitchell, T.: What can machine learning do? Workforce implications. Science 358(6370), 1530–1534 (2017)

    Article  Google Scholar 

  10. Cabitza, F., Campagner, A., Del Zotti, F., Ravizza, A., Sternini, F.: All you need is higher accuracy? On the quest for minimum acceptable accuracy for medical artificial intelligence. In: e-Health 2020, Proceedings of the 12th International Conference on e-Health, pp. 159–166 (2020)

    Google Scholar 

  11. Cabitza, F.: Biases affecting human decision making in AI-supported second opinion settings. In: Torra, V., Narukawa, Y., Pasi, G., Viviani, M. (eds.) MDAI 2019. LNCS (LNAI), vol. 11676, pp. 283–294. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-26773-5_25

    Chapter  Google Scholar 

  12. Cabitza, F., Campagner, A., Cavosi, V.: Assessing the impact of medical AI: a survey of physicians’ perceptions. In: 2021 5th International Conference on Medical and Health Informatics, pp. 225–231 (2021)

    Google Scholar 

  13. Cabitza, F., Campagner, A., Simone, C.: The need to move away from agential-AI: empirical investigations, useful concepts and open issues. Int. J. Hum Comput Stud. 155, 102696 (2021)

    Article  Google Scholar 

  14. Chinn, S.: A simple method for converting an odds ratio to effect size for use in meta-analysis. Stat. Med. 19(22), 3127–3131 (2000)

    Article  Google Scholar 

  15. Chlap, P., Min, H., Vandenberg, N., Dowling, J., Holloway, L., Haworth, A.: A review of medical image data augmentation techniques for deep learning applications. J. Med. Imaging Radiat. Oncol. 65(5), 545–563 (2021)

    Article  Google Scholar 

  16. Croskerry, P., Cosby, K., Graber, M.L., Singh, H.: Diagnosis: Interpreting the Shadows. CRC Press, Boca Raton (2017)

    Book  Google Scholar 

  17. Delmas, P.D., et al.: Underdiagnosis of vertebral fractures is a worldwide problem: the IMPACT study. J. Bone Miner. Res. 20(4), 557–563 (2005)

    Article  Google Scholar 

  18. Ghassemi, M., Oakden-Rayner, L., Beam, A.L.: The false hope of current approaches to explainable artificial intelligence in health care. Lancet Digit. Health 3(11), e745–e750 (2021)

    Article  Google Scholar 

  19. Han, T., et al.: Advancing diagnostic performance and clinical usability of neural networks via adversarial training and dual batch normalization. Nat. Commun. 12(1), 1–11 (2021)

    Article  Google Scholar 

  20. Handelman, G.S., et al.: Peering into the black box of artificial intelligence: evaluation metrics of machine learning methods. Am. J. Roentgenol. 212(1), 38–43 (2019)

    Article  Google Scholar 

  21. Holzinger, A., Saranti, A., Molnar, C., Biecek, P., Samek, W.: Explainable AI methods - a brief overview. In: Holzinger, A., Goebel, R., Fong, R., Moon, T., Müller, K.R., Samek, W. (eds.) xxAI 2020. LNCS, vol. 13200, pp. 13–38. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-04083-2_2

    Chapter  Google Scholar 

  22. Holzinger, A.T., Muller, H.: Toward human-AI interfaces to support explainability and causability in medical AI. Computer 54(10), 78–86 (2021)

    Article  Google Scholar 

  23. Hwang, E.J., et al.: Development and validation of a deep learning-based automated detection algorithm for major thoracic diseases on chest radiographs. JAMA Netw. Open 2(3), e191095 (2019)

    Article  Google Scholar 

  24. Jha, S., Topol, E.J.: Adapting to artificial intelligence: radiologists and pathologists as information specialists. JAMA 316(22), 2353–2354 (2016)

    Article  Google Scholar 

  25. Ke, A., Ellsworth, W., Banerjee, O., Ng, A.Y., Rajpurkar, P.: CheXtransfer: performance and parameter efficiency of ImageNet models for chest X-Ray interpretation. CoRR abs/2101.06871 (2021)

    Google Scholar 

  26. Liu, X., et al.: A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit. Health 1(6), e271–e297 (2019)

    Article  Google Scholar 

  27. Lohoff, L., Rühr, A.: Introducing (machine) learning ability as antecedent of trust in intelligent systems. In: ECIS 2021 Research Papers, vol. 23 (2021)

    Google Scholar 

  28. Ludewig, E., Richter, A., Frame, M.: Diagnostic imaging-evaluating image quality using visual grading characteristic (VGC) analysis. Vet. Res. Commun. 34(5), 473–479 (2010). https://doi.org/10.1007/s11259-010-9413-2

    Article  Google Scholar 

  29. Lyell, D., Coiera, E.: Automation bias and verification complexity: a systematic review. J. Am. Med. Inform. Assoc. 24(2), 423–431 (2017)

    Article  Google Scholar 

  30. Nandi, A., Pal, A.K.: Detailing image interpretability methods. In: Nandi, A., Pal, A.K. (eds.) Interpreting Machine Learning Models, pp. 271–293. Springer, Cham (2022). https://doi.org/10.1007/978-1-4842-7802-4_12

    Chapter  Google Scholar 

  31. Neves, I., et al.: Interpretable heartbeat classification using local model-agnostic explanations on ECGs. Comput. Biol. Med. 133, 104393 (2021)

    Article  Google Scholar 

  32. Olczak, J., et al.: Artificial intelligence for analyzing orthopedic trauma radiographs. Acta Orthop. 88(6), 581–586 (2017)

    Article  Google Scholar 

  33. Pasveer, B.: Knowledge of shadows: the introduction of X-ray images in medicine. Sociol. Health Illn. 11(4), 360–381 (1989)

    Article  Google Scholar 

  34. Ribeiro, M.T., Singh, S., Guestrin, C.: Model-agnostic interpretability of machine learning. arXiv preprint arXiv:1606.05386 (2016)

  35. Simonyan, K., Vedaldi, A., Zisserman, A.: Deep inside convolutional networks: visualising image classification models and saliency maps. In: Workshop at International Conference on Learning Representations (2014)

    Google Scholar 

  36. Spinks, G., Moens, M.F.: Justifying diagnosis decisions by deep neural networks. J. Biomed. Inform. 96, 103248 (2019)

    Article  Google Scholar 

  37. Taylor, R.: Interpretation of the correlation coefficient: a basic review. J. Diagn. Med. Sonogr. 6(1), 35–39 (1990)

    Article  Google Scholar 

  38. Tiulpin, A., Thevenot, J., Rahtu, E., Lehenkari, P., Saarakkala, S.: Automatic knee osteoarthritis diagnosis from plain radiographs: a deep learning-based approach. Sci. Rep. 8(1), 1–10 (2018)

    Article  Google Scholar 

  39. Tschandl, P., et al.: Human-computer collaboration for skin cancer recognition. Nat. Med. 26(8), 1229–1234 (2020)

    Article  Google Scholar 

  40. Xie, S., Girshick, R., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1492–1500 (2017)

    Google Scholar 

  41. Yang, S., Yin, B., Cao, W., Feng, C., Fan, G., He, S.: Diagnostic accuracy of deep learning in orthopaedic fractures: a systematic review and meta-analysis. Clin. Radiol. 75(9), 713-e17 (2020)

    Article  Google Scholar 

  42. Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., Torralba, A.: Learning deep features for discriminative localization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2016

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Federico Cabitza .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 IFIP International Federation for Information Processing

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cabitza, F., Campagner, A., Famiglini, L., Gallazzi, E., La Maida, G.A. (2022). Color Shadows (Part I): Exploratory Usability Evaluation of Activation Maps in Radiological Machine Learning. In: Holzinger, A., Kieseberg, P., Tjoa, A.M., Weippl, E. (eds) Machine Learning and Knowledge Extraction. CD-MAKE 2022. Lecture Notes in Computer Science, vol 13480. Springer, Cham. https://doi.org/10.1007/978-3-031-14463-9_3

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-14463-9_3

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-14462-2

  • Online ISBN: 978-3-031-14463-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics