Abstract
This paper describes and experimentally analyses a new dimension reduction method for microarray data. Microarrays, which allow simultaneous measurement of the level of expression of thousands of genes in a given situation (tissue, cell or time), produce data which poses particular machine-learning problems. The disproportion between the number of attributes (tens of thousands) and the number of examples (hundreds) requires a reduction in dimension. While gene/class mutual information is often used to filter the genes we propose an approach which takes into account gene-pair/class information. A gene selection heuristic based on this principle is proposed as well as an automatic feature-construction procedure forcing the learning algorithms to make use of these gene pairs. We report significant improvements in accuracy on several public microarray databases.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Bellman, R.: Adaptive Control Processes: A Guided Tour. Princeton University Press, Princeton (1961)
Ben-Dor, A., Friedman, N., Yakhini, Z.: Scoring genes for relevance. Technical Report AGL-2000-13, Agilent Technologies (2000)
Bo, T., Jonassen, I.: New feature subset selection procedures for classification of expression profiles. Genome Biology (2002)
Braga-Neto, U.M., Dougherty, E.: Is cross-validation valid for small-sample microarray classification? Bioinformatics 20(3), 374–380 (2004)
Cakmakov, D., Bennani, Y.: Feature selection for pattern recognition (2002)
Clément, K.: Monogenic forms of obesity: From mice to human. Ann. Endocrinol. (2000)
Dudoit, S., Fridlyand, J., Speed, P.: Comparison of discrimination methods for classification of tumors using gene expression data. Journal of American Statististial Association 97, 77–87 (2002)
Efron, B.: Estimating the error rate of a prediction rule: Improvement on cross-validation. Journal of American Statistical Association 78, 316–331 (1983)
Geman, D., D’Avignon, C., Naiman, D., Winslow, R., Zeboulon, A.: Gene expression comparisons for class prediction in cancer studies. In: Proceedings 36’th Symposium on the Interface: Computing Science and Statistics (2004)
Hanczar, B., Courtine, M., Benis, A., Henegar, C., Clément, K., Zucker, J.D.: Improving classification of microarray data using prototype-based feature selection. SIGKDD Explorations 5, 23–30 (2003)
Hwang, K.B., Cho, D.Y., Park, S.W., Kim, S.D., Zhang, B.T.: Applying machine learning techniques to analysis of gene expression data: Cancer diagnosis. In: Methods of Microarray Data Analysis (Proceedings of CAMDA 2000), pp. 167–182. Kluwer Academic Publichers, Dordrecht (2002)
Inza, I., Sierra, B., Blanco, R., Larrañaga, P.: Gene selection by sequential wrapper approaches in microarray cancer class prediction. Journal of Intelligent and Fuzzy Systems, 25–34 (2002)
Jakulin, A., Bratko, I.: Analyzing attribute dependencies. In: Lavrač, N., Gamberger, D., Todorovski, L., Blockeel, H. (eds.) PKDD 2003. LNCS (LNAI), vol. 2838, pp. 229–240. Springer, Heidelberg (2003)
Lee, J.W., Lee, J.B., Park, M., Song, S.H.: An extensive comparison of recent classification tools applied to microarray data. Computational Statistics and Data Analysis (in press)
Li, L., Darden, T.A., Weinberg, C.R., Levine, A.J., Pedersen, L.G.: Gene assessment and sample classification for gene expression data using a genetic algorithm/k-nearest neighbor method. Combinatorial Chemistry and High Throughput Screening, 727–739 (2001)
Qi, H.: Feature selection and knn fusion in molecular classification of multiple tumor types. In: International Conference on Mathematics and Engineering Techniques in Medicine and Biological Sciences, METMBS 2002 (2002)
Wu, X., Ye, Y., Zhang, L.: Graphical modeling based gene interaction analysis for microarray data. SIGKDD Exploration 5, 91–100 (2003)
Xing, E.P., Jordan, M.I., Karp, R.M.: Feature selection for high-dimensional genomic microarray data. In: Proceedings of the Eighteenth International Conference in Machine Learning, ICML 2001 (2001)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2005 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Hanczar, B. (2005). Combining Feature Selection and Feature Construction to Improve Concept Learning for High Dimensional Data. In: Zucker, JD., Saitta, L. (eds) Abstraction, Reformulation and Approximation. SARA 2005. Lecture Notes in Computer Science(), vol 3607. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11527862_19
Download citation
DOI: https://doi.org/10.1007/11527862_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-27872-6
Online ISBN: 978-3-540-31882-8
eBook Packages: Computer ScienceComputer Science (R0)