Abstract
The acquisition of high-quality data and annotations is essential for the training of efficient machine learning algorithms, while being an expensive and time-consuming process. Although the process of data processing and training and testing of machine learning models is well studied and considered in the literature, the actual procedures of obtaining data and their annotations in collaboration with physicians are in most cases based on the personal intuition and suppositions of the researchers.
This article focuses on investigating various practical aspects of medical data acquisition and annotation, as well as various methods of collaboration between IT and medical teams to build datasets that fulfill the desired quality, quantity, and time requirements. Based on five projects undertaken by the authors in diverse medical fields, in which the dataset construction procedure was iteratively optimized, a set of guidelines and good practices to be followed when building new medical datasets was developed as described.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Aabakken, L., et al.: Minimal standard terminology for gastrointestinal endoscopy - MST 3.0. Endoscopy 41(8), 727–728 (2009). https://doi.org/10.1055/s-0029-1214949
Albarqouni, S., Baur, C., Achilles, F., Belagiannis, V., Demirci, S., Navab, N.: AggNet: deep learning from crowds for mitosis detection in breast cancer histology images. IEEE Trans. Med. Imaging 35(5), 1313–1321 (2016). https://doi.org/10.1109/TMI.2016.2528120
Blokus, A., Brzeski, A., Cychnerski, J., Dziubich, T., Jȩdrzejewski, M.: Real-time gastrointestinal tract video analysis on a cluster supercomputer. In: Zamojski, W., Mazurkiewicz, J., Sugier, J., Walkowiak, T., Kacprzyk, J. (eds.) Dependability and Complex Systems, vol. 170, pp. 55–68. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-30662-4_4
Dorożyński, P., Brzeski, A., Cychnerski, J., Dziubich, T.: Towards healthcare cloud computing. Adv. Intell. Syst. Comput. 431, 87–97 (2016). https://doi.org/10.1007/978-3-319-28564-1_8
Dziubich, T., Białas, P., Znaniecki, Ł, Halman, J., Brzeziński, J.: Abdominal aortic aneurysm segmentation from contrast-enhanced computed tomography angiography using deep convolutional networks. In: Bellatreche, L., et al. (eds.) TPDL/ADBIS -2020. CCIS, vol. 1260, pp. 158–168. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-55814-7_13
Glegoła, W., Karpus, A., Przybyłek, A.: MobileNet family tailored for Raspberry Pi. In: 25th International Conference on Knowledge-Based and Intelligent Information & Engineering Systems (KES) (2021)
Hanbury, A., Langs, G.: Cloud-Based Benchmarking of Medical Image Analysis. Springer, Heidelberg (2017). https://doi.org/10.1007/978-3-319-49644-3
Herrman, J.P.R., Azar, A., Umans, V.A., Boersma, E., Es, G.A.V., Serruys, P.W.: Inter- and intra-observer variability in the qualitative categorization of coronary angiograms. Int. J. Cardiac Imaging 12(1), 21–30 (1996). https://doi.org/10.1007/BF01798114
Joskowicz, L., Cohen, D., Caplan, N., Sosna, J.: Inter-observer variability of manual contour delineation of structures in CT. Eur. Radiol. 29(3), 1391–1399 (2019). https://doi.org/10.1007/s00330-018-5695-5
Kohli, M.D., Summers, R.M., Geis, J.R.: Medical image data and datasets in the era of machine learning-whitepaper from the 2016 C-MIMI meeting dataset session. J. Digit. Imaging 30(4), 392–399 (2017). https://doi.org/10.1007/s10278-017-9976-3
Luo, W., et al.: Guidelines for developing and reporting machine learning predictive models in biomedical research: a multidisciplinary view. J. Med. Internet Res. 18(12), e323 (2016). https://doi.org/10.2196/jmir.5870. http://www.jmir.org/2016/12/e323/. ISSN 1438-8871
Lutnick, B., et al.: An integrated iterative annotation technique for easing neural network training in medical image analysis. Nat. Mach. Intell. 1(2), 112–119 (2020). https://doi.org/10.1038/s42256-019-0018-3.An
Montagnon, E., et al.: Deep learning workflow in radiology (2020). https://doi.org/10.1186/s13244-019-0832-5
Vinod, S.K., Min, M., Jameson, M.G., Holloway, L.C.: A review of interventions to reduce inter-observer variability in volume delineation in radiation oncology. J. Med. Imaging Radiat. Oncol. 60(3), 393–406 (2016). https://doi.org/10.1111/1754-9485.12462
Willemink, M.J., et al.: Preparing medical imaging data for machine learning. Radiology 295(1), 4–15 (2020). https://doi.org/10.1148/radiol.2020192224
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Cychnerski, J., Dziubich, T. (2021). Process of Medical Dataset Construction for Machine Learning - Multifield Study and Guidelines. In: Bellatreche, L., et al. New Trends in Database and Information Systems. ADBIS 2021. Communications in Computer and Information Science, vol 1450. Springer, Cham. https://doi.org/10.1007/978-3-030-85082-1_20
Download citation
DOI: https://doi.org/10.1007/978-3-030-85082-1_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-85081-4
Online ISBN: 978-3-030-85082-1
eBook Packages: Computer ScienceComputer Science (R0)