Abstract
Datasets that are subjectively labeled by a number of experts are becoming more common in tasks such as biological text annotation where class definitions are necessarily somewhat subjective. Standard classification and regression models are not suited to multiple labels and typically a pre-processing step (normally assigning the majority class) is performed. We propose Bayesian models for classification and ordinal regression that naturally incorporate multiple expert opinions in defining predictive distributions. The models make use of Gaussian process priors, resulting in great flexibility and particular suitability to text based problems where the number of covariates can be far greater than the number of data instances. We show that using all labels rather than just the majority improves performance on a recent biological dataset.
Similar content being viewed by others
References
Albert, J., Chib, S.: Sequential ordinal modeling with applications to survival data. Biometrics 57, 829–836 (2001)
Albert, J.H., Chib, S.: Bayesian analysis of binary and polychotomous response data. J. Am. Stat. Assoc. 88(422), 669–679 (1993)
Bickel, S., Brefeld, U., Faulstich, L., Hakenberg, J., Leser, U., Plake, C., Scheffer, T.: A support vector machine classifier for gene name recognition. In: EMBO Workshop: A Critical Assessment of Text Mining Methods in Molecular Biology, Granada, Spain, March 2004
Chu, W., Ghahramani, Z.: Gaussian processes for ordinal regression. J. Mach. Learn. Res. 6, 1–48 (2005)
Cohen, K., Fox, L., Ogren, P., Hunter, L.: Corpus design for biomedical natural language processing. In: Proceedings of the ACL-ISMB Workshop on Linking Biological Literature, Ontologies and Databases (Jan. 2005)
Cowles, M.K.: Accelerating Monte Carlo Markov Chain convergence for cumulative-link generalized linear models. Stat. Comput. 6, 101–111 (1996)
Dawid, A.P., Skene, A.M.: Maximum likelihood estimation of observer error-rates using the em algorithm. Appl. Stat. 28(1), 20–28 (1979)
Gelman, A., Carlin, J., Stern, H., Rubin, D.: Bayesian Data Analysis. Chapman&Hall, London (2004)
Girolami, M., Rogers, S.: Variational Bayesian multinomial probit regression with Gaussian process priors. Neural Comput. 18(8), 1790–1817 (2006). doi:10.1162/neco.2006.18.8.1790
Girolami, M., Zhong, M.: Data integration for classification problems emplying Gaussian process priors. Adv. Neural Inf. Process. Syst. 21 (2007)
Johnson, V.: An alternative to traditional GPA for evaluating student performance. Stat. Sci. 12(4), 251–269 (1997)
Johnson, V., Albert, J.: Ordinal Data Modeling. books.google.com (Jan. 1999)
Johnson, V.E.: On Bayesian analysis of multirater ordinal data: An application to automated essay grading. J. Am. Stat. Assoc. 91(433), 42–51 (1996)
Rogers, S., Girolami, M.: Multi-class semi-supervised learning with the ε-truncated multinomial probit Gaussian process. J. Mach. Learn. Res. Workshop Conf. Proc. 1, 17–32 (2007)
Smyth, P., Fayyad, U., Burl, M., Perona, P., Baldi, P.: Inferring ground truth from subjective labelling of venus images. Adv. Neural Inf. Process. Syst. 7 (1995)
Uebersax, J.S.: Statistical modeling of expert ratings on medical treatment appropriateness. J. Am. Stat. Assoc. 88(422), 421–427 (1993)
Versley, Y.: Disagreement dissected: Vagueness as a source of ambiguity in nominal (co-) reference. In: Ambiguity in Anaphora Workshop Proceedings (2006)
Wilbur, W.J., Rzhetsky, A., Shatkay, H.: New directions in biomedical text annotation: definitions, guidelines and corpus construction. BMC Bioinf. 7, 356–356 (2006)
Williams, C.K., Barber, D.: Bayesian classification with Gaussian processes. IEEE Trans. Pattern Anal. Mach. Intell. 20(12), 1342–1351 (1998)
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Rogers, S., Girolami, M. & Polajnar, T. Semi-parametric analysis of multi-rater data. Stat Comput 20, 317–334 (2010). https://doi.org/10.1007/s11222-009-9125-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-009-9125-z