Boosting and Microarray Data

Long, Philip M.; Vega, Vinsensius Berlian

doi:10.1023/A:1023937123600

Boosting and Microarray Data

Published: July 2003

Volume 52, pages 31–44, (2003)
Cite this article

Download PDF

Machine Learning Aims and scope Submit manuscript

Boosting and Microarray Data

Download PDF

Philip M. Long¹ &
Vinsensius Berlian Vega¹

564 Accesses
36 Citations
Explore all metrics

Abstract

We have found one reason why AdaBoost tends not to perform well on gene expression data, and identified simple modifications that improve its ability to find accurate class prediction rules. These modifications appear especially to be needed when there is a strong association between expression profiles and class designations. Cross-validation analysis of six microarray datasets with different characteristics suggests that, suitably modified, boosting provides competitive classification accuracy in general.

Sometimes the goal in a microarray analysis is to find a class prediction rule that is not only accurate, but that depends on the level of expression of few genes. Because boosting makes an effort to find genes that are complementary sources of evidence of the correct classification of a tissue sample, it appears especially useful for such gene-efficient class prediction. This appears particularly to be true when there is a strong association between expression profiles and class designations, which is often the case for example when comparing tumor and normal samples.

References

Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., & Levine, A. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon cancer tissues probed by oligonucleotide arrays. Cell Biology, 96, 6745–6750.
Google Scholar
Ambroise, C., & McLachlan, G. J. (2002). Selection bias in gene extraction on the basis of microarray gene-expression data. Proc. Natl. Acad. Sci. USA, 99:10, 6562–6566.
Google Scholar
Anthony, M., & Bartlett, P. L. (1999). Neural Network Learning: Theoretical Foundations. Cambridge University Press.
Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer, M., & Yakhini, Z. (2000). Tissue classification with gene expression profiles. Journal of Computational Biology, 7, 559–584.
Google Scholar
Breiman, L. (1998). Arcing classifiers. The Annals of Statistics.
Dubhashi, D., & Ranjan, D. (1998). Balls and bins: A study in negative dependence. Random Structures and Algorithms, 13:2, 99–124.
Google Scholar
Duda, R. O., & Hart, P. E. (1973). Pattern Classification and Scene Analysis. Wiley.
Dudoit, S., Fridlyand, J., & Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97:457, 77–87.
Google Scholar
Freund, Y. (1995). Boosting a weak learning algorithm by majority. Information and Computation, 121:2, 256–285.
Google Scholar
Freund, Y., & Schapire, R. (1996). Experiments with a new boosting algorithm. In Proceedings of the Thirteenth International Conference on Machine Learning. J. Japan. Soc. for Artif. Intel., 14:5, 771–780.
Google Scholar
Freund, Y., & Schapire, R. E. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55:1, 119–139.
Google Scholar
Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., Coller, H., Loh, M., Downing, J. R., Caligiuri, M. A., Bloomfield, C. D., & Lander, E. S. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286, 531–537.
Google Scholar
Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46: 1–3, 389–422.
Google Scholar
Haussler, D. (1992). Decision theoretic generalizations of the PAC model for neural net and other learning applications. Information and Computation, 100:1, 78–150.
Google Scholar
Haussler, D., Littlestone, N., & Warmuth, M. K. (1994). Predicting {0, 1}-functions on randomly drawn points. Information and Computation, 115:2, 129–161.
Google Scholar
Iba, W., & Langley, P. (1992). Induction of one-level decision trees. In Proc. of the 9th International Workshop on Machine Learning.
Joachims, T. (1998). Making large-scale support vector machines learning practical. In B. S. Olkopf, C. Burges, & A. Smola (Eds.), Advances in Kernel Methods: Support vector machines, pp. 169–184.
Kearns, M., Mansour, Y., Ng, A. Y., & Ron, D. (1997). An experimental and theoretical comparison of model selection methods. Machine Learning, 27, 7–50.
Google Scholar
Kivinen, J., & Warmuth, M. (1999). Boosting as entropy projection. In Proc. COLT'99.
Li, Y., Campbell, C., & Tipping, M. (2002). Bayesian automatic relevance determination algorithms for classifying gene expression data. Bioinformatics, 18:10, 1332–1339.
Google Scholar
Li, Y., Long, P. M., & Srinivasan, A. (2001). Improved bounds on the sample complexity of learning. Journal of Computer and System Sciences, 62:3, 516–527.
Google Scholar
Mason, L., Bartlett, P. L., & Baxter, J. (2000). Improved generalization through explicit optimization of margins. Machine Learning, 38:3, 243–255.
Google Scholar
Miller, L. D., Long, P. M., Wong, L., Mukherjee, S., McShane, L. M., & Liu, E. T. (2002). Optimal gene expression analysis by microarrays. Cancer Cell, 2:5, 353–361.
Google Scholar
Panchenko, D., & Koltchinskii, V. (2002). Empirical margin distributions and bounding the generalization error of combined classifiers. Annals of Statistics, 30:1.
Parker, C. W. (1990). Immunoassays. In M. P. Deutscher (Ed.), Guide to Protein Purification. Academic Press.
Pomeroy, S. L., Tamayo, P., Gaasenbeek, M., Sturla, L. M., Angelo, M., McLaughlin, M. E., Kim, J. Y., Goumnerova, L. C., Black, P. M., Lau, C., Allen, J. C., Zagzag, D., Olson, J. M., Curran, T., Wetmore, C., Biegel, J. A., Poggio, T., Mukherjee, S., Rifkin, R., Califano, A., Stolovitzky, G., Louis, D. N., Mesirov, J. P., Lander, E. S., & Golub, T. R. (2002). Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 415, 436–442.
Google Scholar
Rätsch, G., Onoda, T., & Müller, K.-R. (2001). Soft margins for adaBoost. Machine Learning, 42:3, 287–320. Also NeuroCOLT Technical Report NC-TR-1998-021. In press.
Google Scholar
Schapire, R., & Singer, Y. (1999). Improved boosting algorithms using confidence-rated predictions. Machine Learning, 37:3, 297–336.
Google Scholar
Schapire, R. E. (1990). The strength of weak learnability. Machine Learning, 5:2, 197–226.
Google Scholar
Schapire, R. E., Freund, Y., Bartlett, P., & Lee, W. S. (1998). Boosting the Margin: A new explanation for the effectiveness of voting methods. The Annals of Statistics, 26:5, 1651–1686.
Google Scholar
Shawe-Taylor, J., Bartlett, P., Williamson, R., & Anthony, M. (1998). Structural risk minimization over data-dependent hierarchies. IEEE Transactions on Information Theory, 44:5, 1926–1940.
Google Scholar
Talagrand, M. (1994). Sharper bounds for Gaussian and empirical processes. Annals of Probability, 22, 28–76.
Google Scholar
Valiant, L. G. (1984). A theory of the learnable. Communications of the ACM, 27:11, 1134–1142.
Google Scholar
Vapnik, V. (1998). Statistical Learning Theory. New York.
Vapnik, V. N. (1982). Estimation of Dependencies based on Empirical Data. Springer Verlag.
Vapnik, V. N. (1989). Inductive principles of the search for empirical dependences (methods based on weak convergence of probability measures). In Proceedings of the 1989 Workshop on Computational Learning Theory.
Vapnik, V. N. (1995). The Nature of Statistical Learning Theory. Springer
Vapnik, V. N., & Chervonenkis, A. Y. (1971). On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability and its Applications, 16:2, 264–280.
Google Scholar
West, M., Blanchette, C., Dressman, H., Huang, E., Ishida, S., Spang, R., Zuzan, H., J. A. O., Jr., Marks, J. R., & Nevins, J. R. (2001). Predicting the clinical status of human breast cancer by using gene expression profiles. Proc. Natl. Acad. Sci. USA, 98:20, 11462–11467.
Google Scholar
Weston, J., Mukherjee, S., Chapelle, O., Pontil, M., Poggio, T., & Vapnik, V. (2000). Feature selection for SVMs. In NIPS, pp. 668–674.
Xing, E., Jordan, M., & Karp, R. (2001). Feature selection for high-dimensional genomic microarray data. Eigh-teenth International Conference on Machine Learning.

Download references

Author information

Authors and Affiliations

Genome Institute of Singapore, Singapore
Philip M. Long & Vinsensius Berlian Vega

Authors

Philip M. Long
View author publications
You can also search for this author in PubMed Google Scholar
Vinsensius Berlian Vega
View author publications
You can also search for this author in PubMed Google Scholar

Rights and permissions

Reprints and permissions

About this article

Cite this article

Long, P.M., Vega, V.B. Boosting and Microarray Data. Machine Learning 52, 31–44 (2003). https://doi.org/10.1023/A:1023937123600

Download citation

Issue Date: July 2003
DOI: https://doi.org/10.1023/A:1023937123600

Use our pre-submission checklist

Avoid common mistakes on your manuscript.

Boosting and Microarray Data

Abstract

Article PDF

Similar content being viewed by others

Boosting for high-dimensional two-class prediction

A review of boosting methods for imbalanced data classification

I-Boost: an integrative boosting approach for predicting survival time with multiple genomics platforms

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Navigation

Boosting and Microarray Data

Abstract

Article PDF

Similar content being viewed by others

Boosting for high-dimensional two-class prediction

A review of boosting methods for imbalanced data classification

I-Boost: an integrative boosting approach for predicting survival time with multiple genomics platforms

References

Author information

Authors and Affiliations

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation