Abstract
As semi-supervised classification drawing more attention, many practical semi-supervised learning methods have been proposed. However,one important issue was ignored by current literature–how to estimate the exact size of labelled samples given many unlabelled samples. Such an estimation method is important because of the rareness and expensiveness of labelled examples and is also crucial in exploring the relative value of labelled and unlabelled samples given a specific model. Based on the assumption of a latent gaussian-distribution to the domain, we described a method to estimate the number of labels required in a dataset for semi-supervised linear discriminant classifiers (Transductive LDA) to reach an desired accuracy. Our technique extends naturally to handle two difficult problems: learning from gaussian distributions with different covariances, and learning for multiple classes. This method is evaluated on two datasets, one toy dataset and one real-world wine dataset. The result of this research can be used in areas such text mining, information retrieval or bioinformatics.
Chapter PDF
Similar content being viewed by others
Keywords
- Linear Discriminant Analysis
- Unlabelled Data
- Quadratic Discriminant Analysis
- Classification Error Rate
- Multiple Discriminant Analysis
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
References
Seeger, M.: Learning with labeled and unlabeled data (Technical Report), Institute for Adaptive and Neural Computation, University of Edinburgh, Edinburgh, United Kingdom, pp. 609-616 (2001)
Nigam, K., McCallum, A.K., Thrun, S., Mitchell, T.: Text classification from labeled and unlabeled documents using EM. Machine Learning 39(2/3), 103–134 (2000)
Bennett, K., Demiriz, A.: Semi-supervised support vector machines. In: Advances in Neural Information Processing Systems (NIPS) [NIPS99], pp. 1–7 (1999)
Blum, A., Mitchell, T.: Combing labeled and unlabeled data with cotraining. In: Proc. Of the 1998 Conference on Computational Learning Theory, pp. 1–10 (1998)
Blum, A., Chawla Learning, S.: from labeled and unlabeled data using graph mincut. In: Proc. 17th Intl Conf. on Machine Learning (ICML), pp. 1181–1188 (2001)
Mardia, K., Kent, J., Bibby, J.: Multivariate Analysis. Academic Press, London (1979)
O’Neil, T.: Normal discrimination with unclassified observations. Journal of American Statistical Association 73(364), 821–826 (1978)
Duda, R., Hart, P.: Pattern Classification and Scene Analysis. Wiley, New York (1973)
Wu, Y., Tian, Q., Huang, T.S.: Discriminant-EM Algorithm with Application to Image Retrieval, Technical Report, UIUC, USA (1999)
Zhang, T., Oles, F.: A probability Analysis on the value of unlabeled data for classification problem. In: ICML, pp. 1191–1198 (2000)
Forina, M., et al.: PARVUS. An Extensdible Package for Data Exploration, Classification and Correlation. In: Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, Italy
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2004 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Liu, H., Yuan, X., Tang, Q., Kustra, R. (2004). An Efficient Method to Estimate Labelled Sample Size for Transductive LDA(QDA/MDA) Based on Bayes Risk. In: Boulicaut, JF., Esposito, F., Giannotti, F., Pedreschi, D. (eds) Machine Learning: ECML 2004. ECML 2004. Lecture Notes in Computer Science(), vol 3201. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-30115-8_27
Download citation
DOI: https://doi.org/10.1007/978-3-540-30115-8_27
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-23105-9
Online ISBN: 978-3-540-30115-8
eBook Packages: Springer Book Archive