Learn2Agree: Fitting with Multiple Annotators Without Objective Ground Truth

Wang, Chongyang; Gao, Yuan; Fan, Chenyou; Hu, Junjie; Lam, Tin Lum; Lane, Nicholas D.; Bianchi-Berthouze, Nadia

doi:10.1007/978-3-031-39539-0_13

Chongyang Wang ORCID: orcid.org/0000-0002-9819-088X⁹,
Yuan Gao¹⁰,
Chenyou Fan¹¹,
Junjie Hu¹⁰,
Tin Lum Lam¹⁰,
Nicholas D. Lane¹² &
…
Nadia Bianchi-Berthouze¹³

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13932))

Included in the following conference series:

International Workshop on Trustworthy Machine Learning for Healthcare

361 Accesses
1 Altmetric

Abstract

The annotation of domain experts is important for some medical applications where the objective ground truth is ambiguous to define, e.g., the rehabilitation for some chronic diseases, and the prescreening of some musculoskeletal abnormalities without further medical examinations. However, improper uses of the annotations may hinder developing reliable models. On one hand, forcing the use of a single ground truth generated from multiple annotations is less informative for the modeling. On the other hand, feeding the model with all the annotations without proper regularization is noisy given existing disagreements. For such issues, we propose a novel Learning to Agreement (Learn2Agree) framework to tackle the challenge of learning from multiple annotators without objective ground truth. The framework has two streams, with one stream fitting with the multiple annotators and the other stream learning agreement information between annotators. In particular, the agreement learning stream produces regularization information to the classifier stream, tuning its decision to be better in line with the agreement between annotators. The proposed method can be easily added to existing backbones, with experiments on two medical datasets showed better agreement levels with annotators.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 109.00; Price excludes VAT (USA)

Softcover Book: USD 139.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Aung, M.S., et al.: The automatic detection of chronic pain-related expression: requirements, challenges and the multimodal emopain dataset. IEEE Trans. Affect. Comput. 7(4), 435–451 (2015)
Article Google Scholar
Charpentier, B., Zügner, D., Günnemann, S.: Posterior network: uncertainty estimation without ood samples via density-based pseudo-counts. arXiv preprint arXiv:2006.09239 (2020)
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20(1), 37–46 (1960)
Article Google Scholar
Cohen, J.: Weighted kappa: nominal scale agreement provision for scaled disagreement or partial credit. Psychol. Bull. 70(4), 213 (1968)
Article Google Scholar
Cui, Y., Jia, M., Lin, T.Y., Song, Y., Belongie, S.: Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9268–9277 (2019)
Google Scholar
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)
Google Scholar
Fan, C., et al.: Multi-horizon time series forecasting with temporal attention learning. In: Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 2527–2535 (2019)
Google Scholar
Felipe, S., Singh, A., Bradley, C., Williams, A.C., Bianchi-Berthouze, N.: Roles for personal informatics in chronic pain. In: 2015 9th International Conference on Pervasive Computing Technologies for Healthcare, pp. 161–168. IEEE (2015)
Google Scholar
Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76(5), 378 (1971)
Article Google Scholar
Guan, M., Gulshan, V., Dai, A., Hinton, G.: Who said what: modeling individual labelers improves classification. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32 (2018)
Google Scholar
Hao, L., Naiman, D.Q., Naiman, D.Q.: Quantile Regression. Sage (2007)
Google Scholar
Healey, J.: Recording affect in the field: towards methods and metrics for improving ground truth labels. In: D’Mello, S., Graesser, A., Schuller, B., Martin, J.-C. (eds.) ACII 2011. LNCS, vol. 6974, pp. 107–116. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24600-5_14
Chapter Google Scholar
Hu, N., Englebienne, G., Lou, Z., Kröse, B.: Learning to recognize human activities using soft labels. IEEE Trans. Pattern Anal. Mach. Intell. 39(10), 1973–1984 (2016)
Article Google Scholar
Hu, P., Sclaroff, S., Saenko, K.: Uncertainty-aware learning for zero-shot semantic segmentation. Adv. Neural Inf. Process. Syst. 33, 21713–21724 (2020)
Google Scholar
Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected convolutional networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4700–4708 (2017)
Google Scholar
Ji, W., et al.: Learning calibrated medical image segmentation via multi-rater agreement modeling. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12341–12351 (2021)
Google Scholar
Karimi, D., Dou, H., Warfield, S.K., Gholipour, A.: Deep learning with noisy labels: exploring techniques and remedies in medical image analysis. Med. Image Anal. 65, 101759 (2020)
Google Scholar
Kendall, A., Badrinarayanan, V., Cipolla, R.: Bayesian segnet: model uncertainty in deep convolutional encoder-decoder architectures for scene understanding. In: British Machine Vision Conference (2017)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kleinsmith, A., Bianchi-Berthouze, N., Steed, A.: Automatic recognition of non-acted affective postures. IEEE Trans. Syst. Man Cybern. Part B 41(4), 1027–1038 (2011)
Article Google Scholar
Koenker, R., Hallock, K.F.: Quantile regression. J. Econ. Perspect. 15(4), 143–156 (2001)
Article Google Scholar
de La Torre, J., Puig, D., Valls, A.: Weighted kappa loss function for multi-class classification of ordinal data in deep learning. Pattern Recogn. Lett. 105, 144–154 (2018)
Article Google Scholar
Lakshminarayanan, B., Pritzel, A., Blundell, C.: Simple and scalable predictive uncertainty estimation using deep ensembles. arXiv preprint arXiv:1612.01474 (2016)
Lampert, T.A., Stumpf, A., Gançarski, P.: An empirical study into annotator agreement, ground truth estimation, and algorithm evaluation. IEEE Trans. Image Process. 25(6), 2557–2572 (2016)
Article MathSciNet MATH Google Scholar
Leibig, C., Allken, V., Ayhan, M.S., Berens, P., Wahl, S.: Leveraging uncertainty information from deep neural networks for disease detection. Sci. Rep. 7(1), 1–14 (2017)
Article Google Scholar
Li, X., Wang, W., Hu, X., Li, J., Tang, J., Yang, J.: Generalized focal loss V2: learning reliable localization quality estimation for dense object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11632–11641 (2021)
Google Scholar
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2980–2988 (2017)
Google Scholar
Long, C., Hua, G.: Multi-class multi-annotator active learning with robust gaussian process for visual recognition. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2839–2847 (2015)
Google Scholar
Long, C., Hua, G., Kapoor, A.: Active visual recognition with expertise estimation in crowdsourcing. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3000–3007 (2013)
Google Scholar
Lovchinsky, I., et al.: Discrepancy ratio: evaluating model performance when even experts disagree on the truth. In: International Conference on Learning Representations (2019)
Google Scholar
Ma, L., Stückler, J., Kerl, C., Cremers, D.: Multi-view deep learning for consistent semantic mapping with RGB-D cameras. In: 2017 IEEE/RSJ International Conference on Intelligent Robots and Systems, pp. 598–605. IEEE (2017)
Google Scholar
Meng, H., Kleinsmith, A., Bianchi-Berthouze, N.: Multi-score learning for affect recognition: the case of body postures. In: D’Mello, S., Graesser, A., Schuller, B., Martin, J.-C. (eds.) ACII 2011. LNCS, vol. 6974, pp. 225–234. Springer, Heidelberg (2011). https://doi.org/10.1007/978-3-642-24600-5_26
Chapter Google Scholar
Postels, J., Ferroni, F., Coskun, H., Navab, N., Tombari, F.: Sampling-free epistemic uncertainty estimation using approximated variance propagation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2931–2940 (2019)
Google Scholar
Rajpurkar, P., et al.: MURA: large dataset for abnormality detection in musculoskeletal radiographs. arXiv preprint arXiv:1712.06957 (2017)
Shen, Y., Zhang, Z., Sabuncu, M.R., Sun, L.: Real-time uncertainty estimation in computer vision via uncertainty-aware distribution distillation. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 707–716 (2021)
Google Scholar
Singh, A., et al.: Go-with-the-flow: tracking, analysis and sonification of movement and breathing to build confidence in activity despite chronic pain. Hum.-Comput. Interact. 31(3–4), 335–383 (2016)
Article Google Scholar
Surowiecki, J.: The Wisdom of Crowds. Anchor (2005)
Google Scholar
Tanno, R., Saeedi, A., Sankaranarayanan, S., Alexander, D.C., Silberman, N.: Learning from noisy labels by regularized estimation of annotator confusion. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11244–11253 (2019)
Google Scholar
Wang, C., Gao, Y., Mathur, A., Williams, A.C.D.C., Lane, N.D., Bianchi-Berthouze, N.: Leveraging activity recognition to enable protective behavior detection in continuous data. Proc. ACM Interact. Mob. Wearable Ubiquit. Technol. 5(2) (2021)
Google Scholar
Wang, C., Olugbade, T.A., Mathur, A., Williams, A.C.D.C., Lane, N.D., Bianchi-Berthouze, N.: Chronic pain protective behavior detection with deep learning. ACM Trans. Comput. Healthc. 2(3), 1–24 (2021)
Article Google Scholar
Warfield, S.K., Zou, K.H., Wells, W.M.: Simultaneous truth and performance level estimation (staple): an algorithm for the validation of image segmentation. IEEE Trans. Med. Imaging 23(7), 903–921 (2004)
Article Google Scholar
Yan, Y., et al.: Modeling annotator expertise: Learning when everybody knows a bit of something. In: Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, pp. 932–939. JMLR Workshop and Conference Proceedings (2010)
Google Scholar
Yan, Y., Rosales, R., Fung, G., Subramanian, R., Dy, J.: Learning from multiple annotators with varying expertise. Mach. Learn. 95(3), 291–327 (2014)
Article MathSciNet MATH Google Scholar
Zhang, L., et al.: Disentangling human error from the ground truth in segmentation of medical images. arXiv preprint arXiv:2007.15963 (2020)

Download references

Author information

Authors and Affiliations

Tsinghua University, Beijing, China
Chongyang Wang
AIRS, Shenzhen, China
Yuan Gao, Junjie Hu & Tin Lum Lam
South China Normal University, Guangzhou, China
Chenyou Fan
University of Cambridge, Cambridge, UK
Nicholas D. Lane
University College London, London, UK
Nadia Bianchi-Berthouze

Authors

Chongyang Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yuan Gao
View author publications
You can also search for this author in PubMed Google Scholar
Chenyou Fan
View author publications
You can also search for this author in PubMed Google Scholar
Junjie Hu
View author publications
You can also search for this author in PubMed Google Scholar
Tin Lum Lam
View author publications
You can also search for this author in PubMed Google Scholar
Nicholas D. Lane
View author publications
You can also search for this author in PubMed Google Scholar
Nadia Bianchi-Berthouze
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Chongyang Wang .

Editor information

Editors and Affiliations

Hong Kong University of Science and Technology, Hong Kong, Hong Kong
Hao Chen
Hong Kong University of Science and Technology, Hong Kong, Hong Kong
Luyang Luo

A Appendix

1.1 A.1 Datasets

Two medical datasets are selected, involving data of body movement sequences and bone X-rays. Please kindly refer to the EmoPain [1] and MURA datasets [34] for more details.

1.2 A.2 Implementation Details

For experiments on the EmoPain dataset, the state-of-the-art HAR-PBD network [39] is adopted as the backbone, and Leave-One-Subject-Out validation is conducted across the participants with CP. The average of the performances achieved on all the folds is reported. The training data is augmented by adding Gaussian noise and cropping, as seen in [40]. The number of bins used in the general agreement distribution is set to 10, i.e., the respective softmax layer has 11 nodes. The $\lambda $ used in the regularization function is set to 3.0. For experiments on the MURA dataset, the Dense-169 network [15] pretrained on the ImageNet dataset [6] is used as the backbone. The original validation set is used as the testing set, where the first view (image) from each of the 7 upper extremity types of a subject is used. Images are all resized to be $224\times 224$, while images in the training set are further augmented with random lateral inversions and rotations of up to 30$^\circ $. The number of bins is set to 5, and the $\lambda $ is set to 3.0. The setting of number of bins (namely, n in the distribution) and $\lambda $ was found based on a grid search across their possible ranges, i.e., $n\in \{5,10,15,20,25,30\}$ and $\lambda \in \{1.0,1.5,2.0,2.5,3.0,3.5\}$.

For all the experiments, the classifier stream is implemented with a fully-connected layer using a Softmax activation with two output nodes for the binary classification task. Adam [19] is used as the optimizer with a learning rate $lr=$1e-4, which is reduced by 1/10 if the performance is not improved after 10 epochs. The number of epochs is set to 50. The logarithmic loss is adopted by default as written in Eq. 5 and 6, while the WKL loss (8) is used for comparison when mentioned. For the agreement learning stream, the AR loss is used for its distributional variant, while the RMSE is used for its linear regression variant. We implement our method with TensorFlow deep learning library on a PC with a RTX 3080 GPU and 32 GB memory.

1.3 A.3 Agreement Computation

For a binary task, the agreement level $\alpha _i$ between annotators is computed as follows.

$$\begin{aligned} \alpha _i=\frac{1}{\grave{J}}\sum _{j=1}^{\grave{J}}w_i^jr_i^j, \end{aligned}$$

(10)

where $\grave{J}$ is the number of annotators that have labelled the sample $x_i$. In this way, $\alpha _i\in [0,1]$ stands for the agreement of annotators toward the positive class of the current binary task. In this work, we assume each sample was labelled by at least one annotator. $w_i^j$ is the weight for the annotation provided by j-th annotator that could be used to show the different levels of expertise of annotators. The weight can be set manually given prior knowledge about the annotator, or used as a learnable parameter for the model to estimate. In this work, we treat annotators equally by setting $w_i^j$ to 1. We leave the discussion on other situations to future works.

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, C. et al. (2023). Learn2Agree: Fitting with Multiple Annotators Without Objective Ground Truth. In: Chen, H., Luo, L. (eds) Trustworthy Machine Learning for Healthcare. TML4H 2023. Lecture Notes in Computer Science, vol 13932. Springer, Cham. https://doi.org/10.1007/978-3-031-39539-0_13

Download citation

DOI: https://doi.org/10.1007/978-3-031-39539-0_13
Published: 30 July 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-39538-3
Online ISBN: 978-3-031-39539-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Learn2Agree: Fitting with Multiple Annotators Without Objective Ground Truth

Abstract

Access this chapter

Subscribe and save

Buy Now

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

A Appendix

A Appendix

1.1 A.1 Datasets

1.2 A.2 Implementation Details

1.3 A.3 Agreement Computation

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation