Skip to main content

Process of Medical Dataset Construction for Machine Learning - Multifield Study and Guidelines

  • Conference paper
  • First Online:
New Trends in Database and Information Systems (ADBIS 2021)

Abstract

The acquisition of high-quality data and annotations is essential for the training of efficient machine learning algorithms, while being an expensive and time-consuming process. Although the process of data processing and training and testing of machine learning models is well studied and considered in the literature, the actual procedures of obtaining data and their annotations in collaboration with physicians are in most cases based on the personal intuition and suppositions of the researchers.

This article focuses on investigating various practical aspects of medical data acquisition and annotation, as well as various methods of collaboration between IT and medical teams to build datasets that fulfill the desired quality, quantity, and time requirements. Based on five projects undertaken by the authors in diverse medical fields, in which the dataset construction procedure was iteratively optimized, a set of guidelines and good practices to be followed when building new medical datasets was developed as described.

https://cvlab.eti.pg.gda.pl/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://developer.nvidia.com/blog/federated-learning-clara/.

  2. 2.

    https://cvlab.eti.pg.gda.pl/projects/ers.

  3. 3.

    http://www.itksnap.org/pmwiki/pmwiki.php.

  4. 4.

    https://www.osirix-viewer.com/.

  5. 5.

    https://download.slicer.org.

  6. 6.

    https://www.medmetric.ai/.

References

  1. Aabakken, L., et al.: Minimal standard terminology for gastrointestinal endoscopy - MST 3.0. Endoscopy 41(8), 727–728 (2009). https://doi.org/10.1055/s-0029-1214949

    Article  Google Scholar 

  2. Albarqouni, S., Baur, C., Achilles, F., Belagiannis, V., Demirci, S., Navab, N.: AggNet: deep learning from crowds for mitosis detection in breast cancer histology images. IEEE Trans. Med. Imaging 35(5), 1313–1321 (2016). https://doi.org/10.1109/TMI.2016.2528120

  3. Blokus, A., Brzeski, A., Cychnerski, J., Dziubich, T., Jȩdrzejewski, M.: Real-time gastrointestinal tract video analysis on a cluster supercomputer. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds.) Dependability and Complex Systems, vol. 170, pp. 55–68. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30662-4_4

  4. Dorożyński, P., Brzeski, A., Cychnerski, J., Dziubich, T.: Towards healthcare cloud computing. Adv. Intell. Syst. Comput. 431, 87–97 (2016). https://doi.org/10.1007/978-3-319-28564-1_8

  5. Dziubich, T., Białas, P., Znaniecki, Ł, Halman, J., Brzeziński, J.: Abdominal aortic aneurysm segmentation from contrast-enhanced computed tomography angiography using deep convolutional networks. In: Bellatreche, L., et al. (eds.) TPDL/ADBIS -2020. CCIS, vol. 1260, pp. 158–168. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-55814-7_13

    Chapter  Google Scholar 

  6. Glegoła, W., Karpus, A., Przybyłek, A.: MobileNet family tailored for Raspberry Pi. In: 25th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems (KES) (2021)

    Google Scholar 

  7. Hanbury, A., Langs, G.: Cloud-Based Benchmarking of Medical Image Analysis. Springer, Heidelberg (2017). https://doi.org/10.1007/978-3-319-49644-3

    Book  MATH  Google Scholar 

  8. Herrman, J.P.R., Azar, A., Umans, V.A., Boersma, E., Es, G.A.V., Serruys, P.W.: Inter- and intra-observer variability in the qualitative categorization of coronary angiograms. Int. J. Cardiac Imaging 12(1), 21–30 (1996). https://doi.org/10.1007/BF01798114

  9. Joskowicz, L., Cohen, D., Caplan, N., Sosna, J.: Inter-observer variability of manual contour delineation of structures in CT. Eur. Radiol. 29(3), 1391–1399 (2019). https://doi.org/10.1007/s00330-018-5695-5

    Article  Google Scholar 

  10. Kohli, M.D., Summers, R.M., Geis, J.R.: Medical image data and datasets in the era of machine learning-whitepaper from the 2016 C-MIMI meeting dataset session. J. Digit. Imaging 30(4), 392–399 (2017). https://doi.org/10.1007/s10278-017-9976-3

  11. Luo, W., et al.: Guidelines for developing and reporting machine learning predictive models in biomedical research: a multidisciplinary view. J. Med. Internet Res. 18(12), e323 (2016). https://doi.org/10.2196/jmir.5870. http://www.jmir.org/2016/12/e323/. ISSN 1438-8871

  12. Lutnick, B., et al.: An integrated iterative annotation technique for easing neural network training in medical image analysis. Nat. Mach. Intell. 1(2), 112–119 (2020). https://doi.org/10.1038/s42256-019-0018-3.An

    Article  Google Scholar 

  13. Montagnon, E., et al.: Deep learning workflow in radiology (2020). https://doi.org/10.1186/s13244-019-0832-5

  14. Vinod, S.K., Min, M., Jameson, M.G., Holloway, L.C.: A review of interventions to reduce inter-observer variability in volume delineation in radiation oncology. J. Med. Imaging Radiat. Oncol. 60(3), 393–406 (2016). https://doi.org/10.1111/1754-9485.12462

  15. Willemink, M.J., et al.: Preparing medical imaging data for machine learning. Radiology 295(1), 4–15 (2020). https://doi.org/10.1148/radiol.2020192224

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tomasz Dziubich .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cychnerski, J., Dziubich, T. (2021). Process of Medical Dataset Construction for Machine Learning - Multifield Study and Guidelines. In: Bellatreche, L., et al. New Trends in Database and Information Systems. ADBIS 2021. Communications in Computer and Information Science, vol 1450. Springer, Cham. https://doi.org/10.1007/978-3-030-85082-1_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-85082-1_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-85081-4

  • Online ISBN: 978-3-030-85082-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics