Abstract
The annotation of domain experts is important for some medical applications where the objective ground truth is ambiguous to define, e.g., the rehabilitation for some chronic diseases, and the prescreening of some musculoskeletal abnormalities without further medical examinations. However, improper uses of the annotations may hinder developing reliable models. On one hand, forcing the use of a single ground truth generated from multiple annotations is less informative for the modeling. On the other hand, feeding the model with all the annotations without proper regularization is noisy given existing disagreements. For such issues, we propose a novel Learning to Agreement (Learn2Agree) framework to tackle the challenge of learning from multiple annotators without objective ground truth. The framework has two streams, with one stream fitting with the multiple annotators and the other stream learning agreement information between annotators. In particular, the agreement learning stream produces regularization information to the classifier stream, tuning its decision to be better in line with the agreement between annotators. The proposed method can be easily added to existing backbones, with experiments on two medical datasets showed better agreement levels with annotators.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Aung, M.S., et al.: The automatic detection of chronic pain-related expression: requirements, challenges and the multimodal emopain dataset. IEEE Trans. Affect. Comput. 7(4), 435–451 (2015)
Charpentier, B., Zügner, D., Günnemann, S.: Posterior network: uncertainty estimation without ood samples via density-based pseudo-counts. arXiv preprint arXiv:2006.09239 (2020)
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20(1), 37–46 (1960)
Cohen, J.: Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit. Psychol. Bull. 70(4), 213 (1968)
Cui, Y., Jia, M., Lin, T.Y., Song, Y., Belongie, S.: Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9268–9277 (2019)
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Fan, C., et al.: Multi-horizon time series forecasting with temporal attention learning. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2527–2535 (2019)
Felipe, S., Singh, A., Bradley, C., Williams, A.C., Bianchi-Berthouze, N.: Roles for personal informatics in chronic pain. In: 2015 9th International Conference on Pervasive Computing Technologies for Healthcare, pp. 161–168. IEEE (2015)
Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76(5), 378 (1971)
Guan, M., Gulshan, V., Dai, A., Hinton, G.: Who said what: modeling individual labelers improves classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Hao, L., Naiman, D.Q., Naiman, D.Q.: Quantile Regression. Sage (2007)
Healey, J.: Recording affect in the field: towards methods and metrics for improving ground truth labels. In: D’Mello, S., Graesser, A., Schuller, B., Martin, J.-C. (eds.) ACII 2011. LNCS, vol. 6974, pp. 107–116. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24600-5_14
Hu, N., Englebienne, G., Lou, Z., Kröse, B.: Learning to recognize human activities using soft labels. IEEE Trans. Pattern Anal. Mach. Intell. 39(10), 1973–1984 (2016)
Hu, P., Sclaroff, S., Saenko, K.: Uncertainty-aware learning for zero-shot semantic segmentation. Adv. Neural Inf. Process. Syst. 33, 21713–21724 (2020)
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)
Ji, W., et al.: Learning calibrated medical image segmentation via multi-rater agreement modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12341–12351 (2021)
Karimi, D., Dou, H., Warfield, S.K., Gholipour, A.: Deep learning with noisy labels: exploring techniques and remedies in medical image analysis. Med. Image Anal. 65, 101759 (2020)
Kendall, A., Badrinarayanan, V., Cipolla, R.: Bayesian segnet: model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. In: British Machine Vision Conference (2017)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kleinsmith, A., Bianchi-Berthouze, N., Steed, A.: Automatic recognition of non-acted affective postures. IEEE Trans. Syst. Man Cybern. Part B 41(4), 1027–1038 (2011)
Koenker, R., Hallock, K.F.: Quantile regression. J. Econ. Perspect. 15(4), 143–156 (2001)
de La Torre, J., Puig, D., Valls, A.: Weighted kappa loss function for multi-class classification of ordinal data in deep learning. Pattern Recogn. Lett. 105, 144–154 (2018)
Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. arXiv preprint arXiv:1612.01474 (2016)
Lampert, T.A., Stumpf, A., Gançarski, P.: An empirical study into annotator agreement, ground truth estimation, and algorithm evaluation. IEEE Trans. Image Process. 25(6), 2557–2572 (2016)
Leibig, C., Allken, V., Ayhan, M.S., Berens, P., Wahl, S.: Leveraging uncertainty information from deep neural networks for disease detection. Sci. Rep. 7(1), 1–14 (2017)
Li, X., Wang, W., Hu, X., Li, J., Tang, J., Yang, J.: Generalized focal loss V2: learning reliable localization quality estimation for dense object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11632–11641 (2021)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Long, C., Hua, G.: Multi-class multi-annotator active learning with robust gaussian process for visual recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2839–2847 (2015)
Long, C., Hua, G., Kapoor, A.: Active visual recognition with expertise estimation in crowdsourcing. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3000–3007 (2013)
Lovchinsky, I., et al.: Discrepancy ratio: evaluating model performance when even experts disagree on the truth. In: International Conference on Learning Representations (2019)
Ma, L., Stückler, J., Kerl, C., Cremers, D.: Multi-view deep learning for consistent semantic mapping with RGB-D cameras. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 598–605. IEEE (2017)
Meng, H., Kleinsmith, A., Bianchi-Berthouze, N.: Multi-score learning for affect recognition: the case of body postures. In: D’Mello, S., Graesser, A., Schuller, B., Martin, J.-C. (eds.) ACII 2011. LNCS, vol. 6974, pp. 225–234. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24600-5_26
Postels, J., Ferroni, F., Coskun, H., Navab, N., Tombari, F.: Sampling-free epistemic uncertainty estimation using approximated variance propagation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2931–2940 (2019)
Rajpurkar, P., et al.: MURA: large dataset for abnormality detection in musculoskeletal radiographs. arXiv preprint arXiv:1712.06957 (2017)
Shen, Y., Zhang, Z., Sabuncu, M.R., Sun, L.: Real-time uncertainty estimation in computer vision via uncertainty-aware distribution distillation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 707–716 (2021)
Singh, A., et al.: Go-with-the-flow: tracking, analysis and sonification of movement and breathing to build confidence in activity despite chronic pain. Hum.-Comput. Interact. 31(3–4), 335–383 (2016)
Surowiecki, J.: The Wisdom of Crowds. Anchor (2005)
Tanno, R., Saeedi, A., Sankaranarayanan, S., Alexander, D.C., Silberman, N.: Learning from noisy labels by regularized estimation of annotator confusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11244–11253 (2019)
Wang, C., Gao, Y., Mathur, A., Williams, A.C.D.C., Lane, N.D., Bianchi-Berthouze, N.: Leveraging activity recognition to enable protective behavior detection in continuous data. Proc. ACM Interact. Mob. Wearable Ubiquit. Technol. 5(2) (2021)
Wang, C., Olugbade, T.A., Mathur, A., Williams, A.C.D.C., Lane, N.D., Bianchi-Berthouze, N.: Chronic pain protective behavior detection with deep learning. ACM Trans. Comput. Healthc. 2(3), 1–24 (2021)
Warfield, S.K., Zou, K.H., Wells, W.M.: Simultaneous truth and performance level estimation (staple): an algorithm for the validation of image segmentation. IEEE Trans. Med. Imaging 23(7), 903–921 (2004)
Yan, Y., et al.: Modeling annotator expertise: Learning when everybody knows a bit of something. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 932–939. JMLR Workshop and Conference Proceedings (2010)
Yan, Y., Rosales, R., Fung, G., Subramanian, R., Dy, J.: Learning from multiple annotators with varying expertise. Mach. Learn. 95(3), 291–327 (2014)
Zhang, L., et al.: Disentangling human error from the ground truth in segmentation of medical images. arXiv preprint arXiv:2007.15963 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Appendix
A Appendix
1.1 A.1 Datasets
Two medical datasets are selected, involving data of body movement sequences and bone X-rays. Please kindly refer to the EmoPain [1] and MURA datasets [34] for more details.
1.2 A.2 Implementation Details
For experiments on the EmoPain dataset, the state-of-the-art HAR-PBD network [39] is adopted as the backbone, and Leave-One-Subject-Out validation is conducted across the participants with CP. The average of the performances achieved on all the folds is reported. The training data is augmented by adding Gaussian noise and cropping, as seen in [40]. The number of bins used in the general agreement distribution is set to 10, i.e., the respective softmax layer has 11 nodes. The \(\lambda \) used in the regularization function is set to 3.0. For experiments on the MURA dataset, the Dense-169 network [15] pretrained on the ImageNet dataset [6] is used as the backbone. The original validation set is used as the testing set, where the first view (image) from each of the 7 upper extremity types of a subject is used. Images are all resized to be \(224\times 224\), while images in the training set are further augmented with random lateral inversions and rotations of up to 30\(^\circ \). The number of bins is set to 5, and the \(\lambda \) is set to 3.0. The setting of number of bins (namely, n in the distribution) and \(\lambda \) was found based on a grid search across their possible ranges, i.e., \(n\in \{5,10,15,20,25,30\}\) and \(\lambda \in \{1.0,1.5,2.0,2.5,3.0,3.5\}\).
For all the experiments, the classifier stream is implemented with a fully-connected layer using a Softmax activation with two output nodes for the binary classification task. Adam [19] is used as the optimizer with a learning rate \(lr=\)1e-4, which is reduced by 1/10 if the performance is not improved after 10 epochs. The number of epochs is set to 50. The logarithmic loss is adopted by default as written in Eq. 5 and 6, while the WKL loss (8) is used for comparison when mentioned. For the agreement learning stream, the AR loss is used for its distributional variant, while the RMSE is used for its linear regression variant. We implement our method with TensorFlow deep learning library on a PC with a RTX 3080 GPU and 32 GB memory.
1.3 A.3 Agreement Computation
For a binary task, the agreement level \(\alpha _i\) between annotators is computed as follows.
where \(\grave{J}\) is the number of annotators that have labelled the sample \(x_i\). In this way, \(\alpha _i\in [0,1]\) stands for the agreement of annotators toward the positive class of the current binary task. In this work, we assume each sample was labelled by at least one annotator. \(w_i^j\) is the weight for the annotation provided by j-th annotator that could be used to show the different levels of expertise of annotators. The weight can be set manually given prior knowledge about the annotator, or used as a learnable parameter for the model to estimate. In this work, we treat annotators equally by setting \(w_i^j\) to 1. We leave the discussion on other situations to future works.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wang, C. et al. (2023). Learn2Agree: Fitting with Multiple Annotators Without Objective Ground Truth. In: Chen, H., Luo, L. (eds) Trustworthy Machine Learning for Healthcare. TML4H 2023. Lecture Notes in Computer Science, vol 13932. Springer, Cham. https://doi.org/10.1007/978-3-031-39539-0_13
Download citation
DOI: https://doi.org/10.1007/978-3-031-39539-0_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-39538-3
Online ISBN: 978-3-031-39539-0
eBook Packages: Computer ScienceComputer Science (R0)