Feature Selection for Genomic Signal Processing: Unsupervised, Supervised, and Self-Supervised Scenarios

Kung, S. Y.; Luo, Yuhui; Mak, Man-Wai

doi:10.1007/s11265-008-0273-8

Feature Selection for Genomic Signal Processing: Unsupervised, Supervised, and Self-Supervised Scenarios

Published: 09 October 2008

Volume 61, pages 3–20, (2010)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

S. Y. Kung^1,2,
Yuhui Luo¹ &
Man-Wai Mak³

453 Accesses
14 Citations
Explore all metrics

Abstract

An effective data mining system lies in the representation of pattern vectors. For many bioinformatic applications, data are represented as vectors of extremely high dimension. This motivates the research on feature selection. In the literature, there are plenty of reports on feature selection methods. In terms of training data types, they are divided into the unsupervised and supervised categories. In terms of selection methods, they fall into filter and wrapper categories. This paper will provide a brief overview on the state-of-the-arts feature selection methods on all these categories. Sample applications of these methods for genomic signal processing will be highlighted. This paper also describes a notion of self-supervision. A special method called vector index adaptive SVM (VIA-SVM) is described for selecting features under the self-supervision scenario. Furthermore, the paper makes use of a more powerful symmetric doubly supervised formulation, for which VIA-SVM is particularly useful. Based on several subcellular localization experiments, and microarray time course experiments, the VIA-SVM algorithm when combined with some filter-type metrics appears to deliver a substantial dimension reduction (one-order of magnitude) with only little degradation on accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Heuristic algorithms for feature selection under Bayesian models with block-diagonal covariance structure

Article Open access 21 March 2018

Learning biologically-interpretable latent representations for gene expression data

Article Open access 29 April 2022

PreCLAS: An Evolutionary Tool for Unsupervised Feature Selection

Notes

For such a huge dimensionality, a preliminary Signal-to-Noise ratio (SNR)-based filtering method can be applied to weed out those k-mers patterns (i.e. columns) that are below certain low threshold.
In addition to the SNR-type filter and SVM-RFE, there exist an extremely large number of application studies based on microarray data. Two recent ones are the MRMR [38] and Markov blanket [39], which are based on the Multivariate techniques. Another recent approach is the VIA-SVM [40], which is more amendable to the self-supervised scenario explained in Section 6.
Downloadable from the official site http://www.genome.wi.mit.edu/mpr
Here, we use bold face to represent both vectorial data such as gene expression profiles and non-vectorial data such as sequences.

References

Guo, J., Mak, M. W., & Kung, S. Y. (2006). Eukaryotic protein subcellular localization based on local pairwise profile alignment SVM. In 2006 IEEE international workshop on machine learning for signal processing (MLSP’06) (pp. 391–396).
Reinhardt, A., & Hubbard, T. (1998). Using neural networks for prediction of the subcellular location of proteins. Nucleic Acids Research, 26, 2230–2236.
Article Google Scholar
Pavlidis, P., Weston, J., Cai, J., & Grundy, W. N. (2001). Gene functional classification from heterogeneous data. In Int. conf. on computational biology (pp. 249–255). Pittsburgh: PA.
Google Scholar
Leslie, C., ESKIN, E., & Noble, W. S. (2002). The spectrum kernel: A string kernel for SVM protein classification. In Altman, R. B., Dunker, A. K., Hunter, L., Lauredale, K., & Klein, T. E. (Eds.) Proc. of the pacific symposium on biocomputing. River Edge: World Scientific.
Google Scholar
Leslie, C. S., Eskin, E., Cohen, A., Weston, J., & Noble, W. S. (2004). Mismatch string kernels for discriminative protein classification. Bioinformatics, 20(4), 467–476.
Article Google Scholar
Ben-Hur, A., & Brutlag, D. (2004). Sequence motifs: Highly predictive features of protein function. Neural Information Processing Systems 2004.
Kuang, R., Ie, E., Wang, K., Wang, K., Siddiqi, M., Freund, Y., & Leslie, C. (2004). Profile-based string kernels for remote homology detection and motif extraction. Computational Systems Bioinformatics Conference, 2004. CSB 2004. Proceedings. 2004 IEEE (pp. 152–160).
Gao, Q., & Wang, Z. (2006). Feature subset selection for protein subcellular localization prediction. Lecture Notes in Computer Science, (Vol. 4115, p. 433).
Su, Y., Murali, T. M., Pavlovic, V., Schaffer, M., & Kasif, S. (2003). RankGene: Identification of diagnostic genes based on expression data (vol. 19). Oxford: Oxford University Press.
Google Scholar
Kung, S. Y., & Mak, M. W. (2008). Feature selection for self-supervised classification with applications to microarray and sequence data. IEEE Journal of Selected Topics in Signal Processing: Special Issue on Genomic and Proteomic Signal Processing, 2, 297–309.
Google Scholar
Huang, C., Lin, C., & Pal, N. (2003). Hierarchical learning architecture with automatic feature selection for multiclass protein fold classification. NanoBioscience, IEEE Transactions on, 2, 221–232.
Article Google Scholar
Kohavi, R., & John, G. H. (1997). Wrappers for feature selection. Artificial Intelligence, 97(1–2), 273–324.
Article MATH Google Scholar
Golub, T. R., Slonim, D. K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J. P., et al. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286, 531–537.
Article Google Scholar
Kudo, M., & Sklansky, J. (2000). Comparison of algorithms that select features for pattern classifiers. Pattern Recognition, 33(1), 25–41.
Article Google Scholar
Simon, R. (2003). Diagnostic and prognostic prediction using gene expression profiles in high-dimensional microarray data. British Journal of Cancer, 89(9), 1599–1604.
Article Google Scholar
Hastie, T., Tibshirani, R., Eisen, M., Alizadeh, A., Levy, R., Staudt, L., et al. (2000). ’Gene shaving’ as a method for identifying distinct sets of genes with similar expression patterns. Genome Biology, 1(2), research0003.1–research0003.21.
Article Google Scholar
Ding, C. (2003). Unsupervised feature selection via two-way ordering in gene expression analysis. Bioinformatics, 19(10), 1259–1266.
Article Google Scholar
Varshavsky, R., Gottlieb, A., Linial, M., & Horn, D. (2006). Novel unsupervised feature filtering of biological data. Bioinformatics, 22(14), e507–e513.
Article Google Scholar
Golub, G. H., & Loan, C. F. V. (1996) Matrix computations. Baltimore: Johns Hopkins University Press.
MATH Google Scholar
Steinbach, M., Ertöz, L., & Kumar, V. (2003). The challenges of clustering high dimensional data. In: New vistas in statistical physics: Applications in econophysics, bioinformatics, and pattern recognition. New York: Springer.
Google Scholar
Guyon, I., Elisseefi, A., & Kaelbling, L. (2003). An introduction to variable and feature selection. Journal of Machine Learning Research, 3(7–8), 1157–1182.
Article MATH Google Scholar
Tamayo, P., Slonim, D., Mesirov, J., Zhu, Q., Kitareewan, S., Dmitrovsky, E., et al. (1999). Interpreting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation. Proceedings of the National Academy of Sciences, 96, 2907–2912, Mar.
Article Google Scholar
Kohane, I. S., Kho, A. T., & Butte, A. J. (2003) Microarrays for an integrative genomics. Cambridge: MIT.
Google Scholar
Xing, E., & Karp, R. (2001). CLIFF: Clustering of high-dimensional microarray data via iterative feature filtering using normalized cuts. Bioinformatics, 17(90001), 306–315.
Google Scholar
Roth, V., & Lange, T. (2004). Bayesian class discovery in microarray datasets. Biomedical Engineering, IEEE Transactions on, 51(5), 707–718.
Article Google Scholar
Niijima, S., & Okuno, Y. (2008). Laplacian linear discriminant analysis approach to unsupervised feature selection. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 10, 20 Oct. doi:10.1109/TCBB.2007.70257.
He, X., Cai, D., & Niyogi, P. (2005). Laplacian score for feature selection. Advances in Neural Information Processing Systems, 18, 507–514.
Google Scholar
Wolf, L., & Shashua, A. (2005). Feature selection for unsupervised and supervised inference: The emergence of sparsity in a weight-based approach. The Journal of Machine Learning Research, 6, 1855–1887.
MathSciNet Google Scholar
Li, H., Jiang, T., & Zhang, K. (2006). Efficient and robust feature extraction by maximum margin criterion. Neural Networks, IEEE Transactions on, 17, 157–165.
Article Google Scholar
Fukunaga, K. (1990). Introduction to statistical pattern recognition. London: Academic.
MATH Google Scholar
Alon, U., Barkai, N., Notterman, D., Gish, K., Ybarra, S., Mack, D., et al. (1999). Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proceedings of the National Academy of Sciences of the United States of America, 96(12), 6745.
Article Google Scholar
Armstrong, S., Staunton, J., Silverman, L., Pieters, R., den Boer, M., Minden, M., et al. (2002). MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature Genetics, 30(1), 41–47.
Article Google Scholar
Fauquet, C., Desbois, D., Fargette, D., & Vidal, G. (1988). Classification of furoviruses based on the amino acid composition of their coat proteins. Viruses with fungal vectors (pp. 19–38). Wellesbourne: Association of Applied Biologists.
Google Scholar
Pomeroy, S., Tamayo, P., Gaasenbeek, M., Sturla, L., Angelo, M., McLaughlin, M., et al. (2002). Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature, 415(6870), 436–442.
Article Google Scholar
van ’t Veer, L. J., Dai, H., van de Vijver, M. J., He, Y. D., Hart, A. A. M., Mao, M., et al. (2002). Gene expression profiling predicts clinical outcome of breast cancer. Nature, 415, 530–536.
Article Google Scholar
Beer, D. G., Kardia, S. L., Huang, C.-C., Giordano, T. J., Levin, A. M., Misek, D. E., et al. (2002). Gene-expression profiles predict survival of patients with lung adenocarcinoma. Natural Medicines, 8, 816–824.
Google Scholar
Khan, J., Wei, J. S., Ringner, M., Saal, L. H., Ladanyi, M., Westermann, F., et al. (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Natural Medicines, 7, 673–679, June.
Article Google Scholar
Ding, C., & Peng, H. (2003). Minimum redundancy feature selection from microarray gene expression data. Bioinformatics Conference, 2003. CSB 2003. Proceedings of the 2003 IEEE (pp. 523–528).
Gevaert, O., Smet, F. D., Timmerman, D., Moreau, Y., & Moor, B. D. (2006). Predicting the prognosis of breast cancer by integrating clinical and microarray data with bayesian networks. Bioinformatics, 22, 184–190.
Article Google Scholar
Kung, S. Y., & Mak, M. W. (2008). Machine learning for bioinformatics: An introduction to engineers. Cambridge: Cambridge University Press.
Google Scholar
Mak, M. W., & Kung, S. Y. (2006). A solution to the curse of dimensionality problem in pairwise scoring techniques. In Int. conf. on neural information processing (pp. 314–323).
Jafari, P., & Azuaje, F. (2006). An assessment of recently published gene expression data analyses: Reporting experimental design and statistical factors. BMC Medical Informatics, 6(27), 27–35.
Article Google Scholar
Baldi, P., & Brunak, S. (2001) Bioinformatics: The machine learning approach (2nd ed). Cambridge: MIT.
MATH Google Scholar
Fox, R. J., & Dimmic, M. W. (2006). A two-sample bayesian t-test for microarray data. BMC Bioinformatics, 7, 126.
Article Google Scholar
Dudoit, S., Fridlyand, J., & Speed, T. P. (2002). Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association, 97, 77–88.
Article MATH MathSciNet Google Scholar
Ben-Dor, A., Bruhn, L., Friedman, N., Nachman, I., Schummer, M., & Yakhini, Z. (2000). Tissue classification with gene expression profiles. Journal of Computational Biology, 7, 559–583.
Article Google Scholar
Mak, M. W., & Kung, S. Y. (2008). Fusion of feature selection methods for pairwise scoring svm. Neurocomputing, special issue for ICONIP’06.
Guyon, I., Weston, J., Barnhill, S., & Vapnik, V. (2002). Gene selection for cancer classification using support vector machines. Machine Learning, 46, 389–422.
Article MATH Google Scholar
Zhang, X. G., Lu, X., Shi, Q., Xu, X. Q., Leung, H. C. E., Harris, L. N., et al. (2006). Recursive SVM feature selection and sample classification for mass-spectrometry and microarray data. BMC Bioinformatics, 7(197), 197–210.
Article Google Scholar
Golub, T. R., Slonim, D. K., Tamayo, C. H. P., Gaasenbeek, M., Mesirov, J. P., Coller, H., et al. (1999). Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science, 286, 531–537, Oct.
Article Google Scholar
Dudoit, S., Fridlyand, J., & Speed, T. P. (2000). Comparison of discrimination methods for the classification of tumors using gene expression data. Technical Report 576, Dept. of Statistics, University of California, Berkeley, Berkeley, CA 94720-3860.
Smith, T. F., & Waterman, M. S. (1981). Comparison of biosequences. Advances in Applied Mathematics, 2, 482–489.
Article MATH MathSciNet Google Scholar
Huang, Y., & Li, Y. D. (2004). Prediction of protein subcellular locations using fuzzy K-NN method. Bioinformatics, 20(1), 21–28.
Article Google Scholar
Saeys, Y., Inza, I., & Larranaga, P. (2007). A review of feature selection techniques in bioinformatics. Bioinformatics, 23(19), 2507.
Article Google Scholar

Download references

Acknowledgements

This work was in part supported by The Research Grant Council of the Hong Kong SAR (Project No. PolyU 5241/07E, PolyU 5251/08E, and A-PH18).

Author information

Authors and Affiliations

Princeton University, Princeton, NJ, USA
S. Y. Kung & Yuhui Luo
National Chung Hsing University, 250 Kuo Kuang Rd., Taichung, 402, Taiwan, Republic of China
S. Y. Kung
Department of Electronic and Information Engineering, The Hong Kong Polytechnic University, Hong Kong, SAR
Man-Wai Mak

Authors

S. Y. Kung
View author publications
You can also search for this author in PubMed Google Scholar
Yuhui Luo
View author publications
You can also search for this author in PubMed Google Scholar
Man-Wai Mak
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuhui Luo.

Additional information

Based on SY Kung’s Keynote Paper, Proceedings, IEEE Workshop on Machine Learning for Signal Processing, Thessaloniki, Greece, August 27–29, 2007.

The research was conducted in part while S.Y. Kung was on leave with the National Chung-Hsing University as a Chair Professor.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kung, S.Y., Luo, Y. & Mak, MW. Feature Selection for Genomic Signal Processing: Unsupervised, Supervised, and Self-Supervised Scenarios. J Sign Process Syst 61, 3–20 (2010). https://doi.org/10.1007/s11265-008-0273-8

Download citation

Received: 02 March 2008
Revised: 15 July 2008
Accepted: 02 September 2008
Published: 09 October 2008
Issue Date: October 2010
DOI: https://doi.org/10.1007/s11265-008-0273-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Feature Selection for Genomic Signal Processing: Unsupervised, Supervised, and Self-Supervised Scenarios

Abstract

Access this article

Similar content being viewed by others

Heuristic algorithms for feature selection under Bayesian models with block-diagonal covariance structure

Learning biologically-interpretable latent representations for gene expression data

PreCLAS: An Evolutionary Tool for Unsupervised Feature Selection

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Feature Selection for Genomic Signal Processing: Unsupervised, Supervised, and Self-Supervised Scenarios

Abstract

Access this article

Similar content being viewed by others

Heuristic algorithms for feature selection under Bayesian models with block-diagonal covariance structure

Learning biologically-interpretable latent representations for gene expression data

PreCLAS: An Evolutionary Tool for Unsupervised Feature Selection

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation