Skip to main content

Learn2Agree: Fitting with Multiple Annotators Without Objective Ground Truth

  • Conference paper
  • First Online:
Trustworthy Machine Learning for Healthcare (TML4H 2023)

Abstract

The annotation of domain experts is important for some medical applications where the objective ground truth is ambiguous to define, e.g., the rehabilitation for some chronic diseases, and the prescreening of some musculoskeletal abnormalities without further medical examinations. However, improper uses of the annotations may hinder developing reliable models. On one hand, forcing the use of a single ground truth generated from multiple annotations is less informative for the modeling. On the other hand, feeding the model with all the annotations without proper regularization is noisy given existing disagreements. For such issues, we propose a novel Learning to Agreement (Learn2Agree) framework to tackle the challenge of learning from multiple annotators without objective ground truth. The framework has two streams, with one stream fitting with the multiple annotators and the other stream learning agreement information between annotators. In particular, the agreement learning stream produces regularization information to the classifier stream, tuning its decision to be better in line with the agreement between annotators. The proposed method can be easily added to existing backbones, with experiments on two medical datasets showed better agreement levels with annotators.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Aung, M.S., et al.: The automatic detection of chronic pain-related expression: requirements, challenges and the multimodal emopain dataset. IEEE Trans. Affect. Comput. 7(4), 435–451 (2015)

    Article  Google Scholar 

  2. Charpentier, B., Zügner, D., Günnemann, S.: Posterior network: uncertainty estimation without ood samples via density-based pseudo-counts. arXiv preprint arXiv:2006.09239 (2020)

  3. Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20(1), 37–46 (1960)

    Article  Google Scholar 

  4. Cohen, J.: Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit. Psychol. Bull. 70(4), 213 (1968)

    Article  Google Scholar 

  5. Cui, Y., Jia, M., Lin, T.Y., Song, Y., Belongie, S.: Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9268–9277 (2019)

    Google Scholar 

  6. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)

    Google Scholar 

  7. Fan, C., et al.: Multi-horizon time series forecasting with temporal attention learning. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2527–2535 (2019)

    Google Scholar 

  8. Felipe, S., Singh, A., Bradley, C., Williams, A.C., Bianchi-Berthouze, N.: Roles for personal informatics in chronic pain. In: 2015 9th International Conference on Pervasive Computing Technologies for Healthcare, pp. 161–168. IEEE (2015)

    Google Scholar 

  9. Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76(5), 378 (1971)

    Article  Google Scholar 

  10. Guan, M., Gulshan, V., Dai, A., Hinton, G.: Who said what: modeling individual labelers improves classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)

    Google Scholar 

  11. Hao, L., Naiman, D.Q., Naiman, D.Q.: Quantile Regression. Sage (2007)

    Google Scholar 

  12. Healey, J.: Recording affect in the field: towards methods and metrics for improving ground truth labels. In: D’Mello, S., Graesser, A., Schuller, B., Martin, J.-C. (eds.) ACII 2011. LNCS, vol. 6974, pp. 107–116. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24600-5_14

    Chapter  Google Scholar 

  13. Hu, N., Englebienne, G., Lou, Z., Kröse, B.: Learning to recognize human activities using soft labels. IEEE Trans. Pattern Anal. Mach. Intell. 39(10), 1973–1984 (2016)

    Article  Google Scholar 

  14. Hu, P., Sclaroff, S., Saenko, K.: Uncertainty-aware learning for zero-shot semantic segmentation. Adv. Neural Inf. Process. Syst. 33, 21713–21724 (2020)

    Google Scholar 

  15. Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)

    Google Scholar 

  16. Ji, W., et al.: Learning calibrated medical image segmentation via multi-rater agreement modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12341–12351 (2021)

    Google Scholar 

  17. Karimi, D., Dou, H., Warfield, S.K., Gholipour, A.: Deep learning with noisy labels: exploring techniques and remedies in medical image analysis. Med. Image Anal. 65, 101759 (2020)

    Google Scholar 

  18. Kendall, A., Badrinarayanan, V., Cipolla, R.: Bayesian segnet: model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. In: British Machine Vision Conference (2017)

    Google Scholar 

  19. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  20. Kleinsmith, A., Bianchi-Berthouze, N., Steed, A.: Automatic recognition of non-acted affective postures. IEEE Trans. Syst. Man Cybern. Part B 41(4), 1027–1038 (2011)

    Article  Google Scholar 

  21. Koenker, R., Hallock, K.F.: Quantile regression. J. Econ. Perspect. 15(4), 143–156 (2001)

    Article  Google Scholar 

  22. de La Torre, J., Puig, D., Valls, A.: Weighted kappa loss function for multi-class classification of ordinal data in deep learning. Pattern Recogn. Lett. 105, 144–154 (2018)

    Article  Google Scholar 

  23. Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. arXiv preprint arXiv:1612.01474 (2016)

  24. Lampert, T.A., Stumpf, A., Gançarski, P.: An empirical study into annotator agreement, ground truth estimation, and algorithm evaluation. IEEE Trans. Image Process. 25(6), 2557–2572 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  25. Leibig, C., Allken, V., Ayhan, M.S., Berens, P., Wahl, S.: Leveraging uncertainty information from deep neural networks for disease detection. Sci. Rep. 7(1), 1–14 (2017)

    Article  Google Scholar 

  26. Li, X., Wang, W., Hu, X., Li, J., Tang, J., Yang, J.: Generalized focal loss V2: learning reliable localization quality estimation for dense object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11632–11641 (2021)

    Google Scholar 

  27. Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)

    Google Scholar 

  28. Long, C., Hua, G.: Multi-class multi-annotator active learning with robust gaussian process for visual recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2839–2847 (2015)

    Google Scholar 

  29. Long, C., Hua, G., Kapoor, A.: Active visual recognition with expertise estimation in crowdsourcing. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3000–3007 (2013)

    Google Scholar 

  30. Lovchinsky, I., et al.: Discrepancy ratio: evaluating model performance when even experts disagree on the truth. In: International Conference on Learning Representations (2019)

    Google Scholar 

  31. Ma, L., Stückler, J., Kerl, C., Cremers, D.: Multi-view deep learning for consistent semantic mapping with RGB-D cameras. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 598–605. IEEE (2017)

    Google Scholar 

  32. Meng, H., Kleinsmith, A., Bianchi-Berthouze, N.: Multi-score learning for affect recognition: the case of body postures. In: D’Mello, S., Graesser, A., Schuller, B., Martin, J.-C. (eds.) ACII 2011. LNCS, vol. 6974, pp. 225–234. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24600-5_26

    Chapter  Google Scholar 

  33. Postels, J., Ferroni, F., Coskun, H., Navab, N., Tombari, F.: Sampling-free epistemic uncertainty estimation using approximated variance propagation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2931–2940 (2019)

    Google Scholar 

  34. Rajpurkar, P., et al.: MURA: large dataset for abnormality detection in musculoskeletal radiographs. arXiv preprint arXiv:1712.06957 (2017)

  35. Shen, Y., Zhang, Z., Sabuncu, M.R., Sun, L.: Real-time uncertainty estimation in computer vision via uncertainty-aware distribution distillation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 707–716 (2021)

    Google Scholar 

  36. Singh, A., et al.: Go-with-the-flow: tracking, analysis and sonification of movement and breathing to build confidence in activity despite chronic pain. Hum.-Comput. Interact. 31(3–4), 335–383 (2016)

    Article  Google Scholar 

  37. Surowiecki, J.: The Wisdom of Crowds. Anchor (2005)

    Google Scholar 

  38. Tanno, R., Saeedi, A., Sankaranarayanan, S., Alexander, D.C., Silberman, N.: Learning from noisy labels by regularized estimation of annotator confusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11244–11253 (2019)

    Google Scholar 

  39. Wang, C., Gao, Y., Mathur, A., Williams, A.C.D.C., Lane, N.D., Bianchi-Berthouze, N.: Leveraging activity recognition to enable protective behavior detection in continuous data. Proc. ACM Interact. Mob. Wearable Ubiquit. Technol. 5(2) (2021)

    Google Scholar 

  40. Wang, C., Olugbade, T.A., Mathur, A., Williams, A.C.D.C., Lane, N.D., Bianchi-Berthouze, N.: Chronic pain protective behavior detection with deep learning. ACM Trans. Comput. Healthc. 2(3), 1–24 (2021)

    Article  Google Scholar 

  41. Warfield, S.K., Zou, K.H., Wells, W.M.: Simultaneous truth and performance level estimation (staple): an algorithm for the validation of image segmentation. IEEE Trans. Med. Imaging 23(7), 903–921 (2004)

    Article  Google Scholar 

  42. Yan, Y., et al.: Modeling annotator expertise: Learning when everybody knows a bit of something. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 932–939. JMLR Workshop and Conference Proceedings (2010)

    Google Scholar 

  43. Yan, Y., Rosales, R., Fung, G., Subramanian, R., Dy, J.: Learning from multiple annotators with varying expertise. Mach. Learn. 95(3), 291–327 (2014)

    Article  MathSciNet  MATH  Google Scholar 

  44. Zhang, L., et al.: Disentangling human error from the ground truth in segmentation of medical images. arXiv preprint arXiv:2007.15963 (2020)

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Chongyang Wang .

Editor information

Editors and Affiliations

A Appendix

A Appendix

1.1 A.1 Datasets

Two medical datasets are selected, involving data of body movement sequences and bone X-rays. Please kindly refer to the EmoPain [1] and MURA datasets [34] for more details.

1.2 A.2 Implementation Details

For experiments on the EmoPain dataset, the state-of-the-art HAR-PBD network [39] is adopted as the backbone, and Leave-One-Subject-Out validation is conducted across the participants with CP. The average of the performances achieved on all the folds is reported. The training data is augmented by adding Gaussian noise and cropping, as seen in [40]. The number of bins used in the general agreement distribution is set to 10, i.e., the respective softmax layer has 11 nodes. The \(\lambda \) used in the regularization function is set to 3.0. For experiments on the MURA dataset, the Dense-169 network [15] pretrained on the ImageNet dataset [6] is used as the backbone. The original validation set is used as the testing set, where the first view (image) from each of the 7 upper extremity types of a subject is used. Images are all resized to be \(224\times 224\), while images in the training set are further augmented with random lateral inversions and rotations of up to 30\(^\circ \). The number of bins is set to 5, and the \(\lambda \) is set to 3.0. The setting of number of bins (namely, n in the distribution) and \(\lambda \) was found based on a grid search across their possible ranges, i.e., \(n\in \{5,10,15,20,25,30\}\) and \(\lambda \in \{1.0,1.5,2.0,2.5,3.0,3.5\}\).

For all the experiments, the classifier stream is implemented with a fully-connected layer using a Softmax activation with two output nodes for the binary classification task. Adam [19] is used as the optimizer with a learning rate \(lr=\)1e-4, which is reduced by 1/10 if the performance is not improved after 10 epochs. The number of epochs is set to 50. The logarithmic loss is adopted by default as written in Eq. 5 and 6, while the WKL loss (8) is used for comparison when mentioned. For the agreement learning stream, the AR loss is used for its distributional variant, while the RMSE is used for its linear regression variant. We implement our method with TensorFlow deep learning library on a PC with a RTX 3080 GPU and 32 GB memory.

1.3 A.3 Agreement Computation

For a binary task, the agreement level \(\alpha _i\) between annotators is computed as follows.

$$\begin{aligned} \alpha _i=\frac{1}{\grave{J}}\sum _{j=1}^{\grave{J}}w_i^jr_i^j, \end{aligned}$$
(10)

where \(\grave{J}\) is the number of annotators that have labelled the sample \(x_i\). In this way, \(\alpha _i\in [0,1]\) stands for the agreement of annotators toward the positive class of the current binary task. In this work, we assume each sample was labelled by at least one annotator. \(w_i^j\) is the weight for the annotation provided by j-th annotator that could be used to show the different levels of expertise of annotators. The weight can be set manually given prior knowledge about the annotator, or used as a learnable parameter for the model to estimate. In this work, we treat annotators equally by setting \(w_i^j\) to 1. We leave the discussion on other situations to future works.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wang, C. et al. (2023). Learn2Agree: Fitting with Multiple Annotators Without Objective Ground Truth. In: Chen, H., Luo, L. (eds) Trustworthy Machine Learning for Healthcare. TML4H 2023. Lecture Notes in Computer Science, vol 13932. Springer, Cham. https://doi.org/10.1007/978-3-031-39539-0_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-39539-0_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-39538-3

  • Online ISBN: 978-3-031-39539-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics