Annotation Quality Measurement in Multi-Label Annotations

Li, Sheng; Yan, Rong; Wang, Qing; Zeng, Juru; Zhu, Xun; Liu, Yueke; Li, Henghua

doi:10.1007/978-3-031-44696-2_3

Sheng Li¹¹,
Rong Yan¹¹,
Qing Wang¹¹,
Juru Zeng¹¹,
Xun Zhu¹¹,
Yueke Liu¹¹ &
…
Henghua Li¹¹

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14303))

Included in the following conference series:

CCF International Conference on Natural Language Processing and Chinese Computing

791 Accesses

Abstract

Annotation quality measurement is crucial when building a supervised dataset for either general purpose research or domain applications. Inter-rater agreement measure is one of the most vital aspects in terms of establishing annotation quality. The traditional inter-rater agreement measures cannot address the issue in multi-label scenario. To adapt to multi-label annotations, the recent research has developed a bootstrapping method to measure the level of agreement between two raters. In this paper we propose a fine-grained multi-label agreement measure MLA, which attends to discover slight differences in inter-rater agreement across different annotations when multiple raters are involved. We demonstrate its compatibility with traditional measures through mathematics and experiments. The experimental results show it can interpret the agreement more accurately and consistently with intuitive understanding. In addition, a toolset is provided to enable users to generate the multi-label annotations that mimic different annotators, and calculate various agreement coefficients for several scenarios.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
The common expression of Krippendorff’s alpha is the diversity-based expression, it can be transformed into the form of Eq. 1.
2.
Krippendorff’S alpha’s embrace of other methods is not our focus, some papers have discussed on it [12].

References

PaddlePaddle AI studio. https://aistudio.baidu.com/aistudio/datasetdetail/181754
Liu, W.W., Wang, H.B., Shen, X.B., Tsang, I.W.: The emerging trends of multi-label learning, arXiv preprint arXiv: 2011.11197 (2021)
Google Scholar
Xu, D., Shi, Y., Tsang, I.W., Ong, Y.S., Gong, C., Shen, X.B.: A survey on multi-output learning, arXiv preprint arXiv: 1901.00248 (2019)
Google Scholar
Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Measur. 20(1), 37–46 (1960)
Article Google Scholar
Scott, W.: Reliability of content analysis: the case of nominal scale coding. Public Opin. Q. 19(3), 321–325 (1955)
Article Google Scholar
Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76(5), 378–382 (1971)
Article Google Scholar
Cohen, J.: Weighed kappa: nominal scale agreement with provision for scaled disagreement or partial credit. Psychol. Bull. 70(4), 213–220 (1968)
Article Google Scholar
Krippendorff, K.: Computing krippendorff’s alpha-reliability. https://repository.upenn.edu/asc_papers/43. Accessed 25 Jan 2011
Marchal, M., Scholman, M., Yung, F., Demberg, V.: Establishing annotation quality in multi-label annotations. In: Proceedings of the 29th International Conference on Computational Linguistics, pp. 3659–3668, International Committee on Computational Linguistics, Gyeongju, Republic of Korea (2022)
Google Scholar
Ji, A.Y., et al.: Abstract visual reasoning with tangram shapes. In: Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp. 582–601, Association for Computational Linguistics, Abu Dhabi, United Arab Emirates (2022)
Google Scholar
Beck, C., Booth, H., El-Assady, M., Butt, M.: Representation problems in linguistic annotations: ambiguity, variation, uncertainty, error and bias. In: 14th Linguistic Annotation Workshop, pp. 60–73, Association for Computational Linguistics, Barcelona, Spain (2020)
Google Scholar
Zapf, A., Castell, S., Morawietz, L., Karch, A.: Measuring inter-rater reliability for nominal data – which coefficients and confidence intervals are appropriate? BMC Med. Res. Methodol. 16, 93 (2016)
Article Google Scholar
Zettersten, M., Lupyan, G.: Finding categories through words: more nameable features improve category learning. Cognition 196, 104135 (2020)
Google Scholar
Passonneau, R.: Measuring agreement on set-valued items (MASI) for semantic and pragmatic annotation. Communicative Events, Columbia University New York, New York, USA (2006)
Google Scholar
Mohammadreza, H., Doyle, T.E., Samavi, R.: MLCM: multi-label confusion matrix. IEEE Access 10, 19083–19095 (2022)
Article Google Scholar
Kim, Y., Kim, J.M., Akata, Z., Lee, J.: Large loss matters in weakly supervised multi-label classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14136–14146 (2022)
Google Scholar

Download references

Acknowledgements

Thank all the anonymous reviewers and chairs for their meaningful suggestions.

Author information

Authors and Affiliations

China National Clearing Center, People’s Bank of China, Beijing, China
Sheng Li, Rong Yan, Qing Wang, Juru Zeng, Xun Zhu, Yueke Liu & Henghua Li

Authors

Sheng Li
View author publications
You can also search for this author in PubMed Google Scholar
Rong Yan
View author publications
You can also search for this author in PubMed Google Scholar
Qing Wang
View author publications
You can also search for this author in PubMed Google Scholar
Juru Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Xun Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Yueke Liu
View author publications
You can also search for this author in PubMed Google Scholar
Henghua Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sheng Li .

Editor information

Editors and Affiliations

Emory University, Atlanta, GA, USA
Fei Liu
Microsoft Research Asia, Beijing, China
Nan Duan
Soochow University, Suzhou, China
Qingting Xu
Soochow University, Suzhou, China
Yu Hong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Li, S. et al. (2023). Annotation Quality Measurement in Multi-Label Annotations. In: Liu, F., Duan, N., Xu, Q., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2023. Lecture Notes in Computer Science(), vol 14303. Springer, Cham. https://doi.org/10.1007/978-3-031-44696-2_3

Download citation

DOI: https://doi.org/10.1007/978-3-031-44696-2_3
Published: 08 October 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44695-5
Online ISBN: 978-3-031-44696-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the China Computer Federation (CCF) (opens in a new tab)