Abstract
High-performance deep learning methods typically rely on large annotated training datasets, which are difficult to obtain in many clinical applications due to the high cost of medical image labeling. Existing data assessment methods commonly require knowing the labels in advance, which are not feasible to achieve our goal of ‘knowing which data to label.’ To this end, we formulate and propose a novel and efficient data assessment strategy, EXponentiAl Marginal sINgular valuE (\(\textsf{EXAMINE}\)) score, to rank the quality of unlabeled medical image data based on their useful latent representations extracted via Self-supervised Learning (SSL) networks. Motivated by theoretical implication of SSL embedding space, we leverage a Masked Autoencoder [8] for feature extraction. Furthermore, we evaluate data quality based on the marginal change of the largest singular value after excluding the data point in the dataset. We conduct extensive experiments on a pathology dataset. Our results indicate the effectiveness and efficiency of our proposed methods for selecting the most valuable data to label.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
In practice, there are approximation methods for calculating Shapley value, but the it still requires around \(\mathcal {O}(T\textrm{poly}(N))\) [11].
- 2.
\(\lambda _S > \lambda _{S\backslash \{i\}}\) is for sure given the properties of singular value.
- 3.
The running time for LOO and Data Shapley can significantly increase if we use a deep neural network as the utility model.
References
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
Chen, Y., Wei, C., Kumar, A., Ma, T.: Self-training avoids using spurious features under domain shift. Adv. Neural Inf. Process. Syst. 33, 21061–21071 (2020)
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)
Fadahunsi, K.P., et al.: Protocol for a systematic review and qualitative synthesis of information quality frameworks in eHealth. BMJ Open 9(3), e024722 (2019)
Fadahunsi, K.P., et al.: Information quality frameworks for digital health technologies: systematic review. J. Med. Internet Res. 23(5), e23479 (2021)
Ghorbani, A., Zou, J.: Data shapley: equitable valuation of data for machine learning. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 2242–2251. PMLR (2019)
Grill, J.B., e al.: Bootstrap your own latent-a new approach to self-supervised learning. Adv. Neural Inf. Process. Syst. 33, 21271–21284 (2020)
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 16000–16009 (2022)
Hinton, G.E., Salakhutdinov, R.R.: Reducing the dimensionality of data with neural networks. Science 313(5786), 504–507 (2006)
Jia, R., et al.: Towards efficient data valuation based on the shapley value. In: The 22nd International Conference on Artificial Intelligence and Statistics, pp. 1167–1176. PMLR (2019)
Jia, R., Sun, X., Xu, J., Zhang, C., Li, B., Song, D.: An empirical and comparative analysis of data valuation with scalable algorithms (2019)
Kenton, J.D.M.W.C., Toutanova, L.K.: Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of naacL-HLT, pp. 4171–4186 (2019)
Lee, J.D., Lei, Q., Saunshi, N., Zhuo, J.: Predicting what you already know helps: provable self-supervised learning. Adv. Neural Inf. Process. Syst. 34 (2021)
Loshchilov, I., Hutter, F.: Sgdr: stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Redman, T.C.: Data Driven: Profiting from Your Most Important Business Asset. Harvard Business Press (2008)
Tosh, C., Krishnamurthy, A., Hsu, D.: Contrastive learning, multi-view redundancy, and linear models. In: Algorithmic Learning Theory, pp. 1179–1206. PMLR (2021)
Veeling, B.S., Linmans, J., Winkens, J., Cohen, T., Welling, M.: Rotation equivariant CNNs for digital pathology. In: Frangi, A.F., Schnabel, J.A., Davatzikos, C., Alberola-López, C., Fichtinger, G. (eds.) MICCAI 2018. LNCS, vol. 11071, pp. 210–218. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-00934-2_24
Zhang, R., Isola, P., Efros, A.A.: Colorful image colorization. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 649–666. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_40
Acknowledgement
This work is supported in part by the Natural Sciences and Engineering Research Council of Canada (NSERC) and NVIDIA Hardware Award.
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Huang, CY., Lei, Q., Li, X. (2022). Efficient Medical Image Assessment via Self-supervised Learning. In: Nguyen, H.V., Huang, S.X., Xue, Y. (eds) Data Augmentation, Labelling, and Imperfections. DALI 2022. Lecture Notes in Computer Science, vol 13567. Springer, Cham. https://doi.org/10.1007/978-3-031-17027-0_11
Download citation
DOI: https://doi.org/10.1007/978-3-031-17027-0_11
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-17026-3
Online ISBN: 978-3-031-17027-0
eBook Packages: Computer ScienceComputer Science (R0)