Elsevier

Neurocomputing

Volume 92, 1 September 2012, Pages 36-43
Neurocomputing

Biomarker discovery using 1-norm regularization for multiclass earthworm microarray gene expression data

https://doi.org/10.1016/j.neucom.2011.09.035Get rights and content

Abstract

Novel biomarkers can be discovered through mining high dimensional microarray datasets using machine learning techniques. Here we propose a novel recursive gene selection method which can handle the multiclass setting effectively and efficiently. The selection is performed iteratively. In each iteration, a linear multiclass classifier is trained using 1-norm regularization, which leads to sparse weight vectors, i.e., many feature weights are exactly zero. Those zero-weight features are eliminated in the next iteration. The empirical results demonstrate that the selected features (genes) have very competitive discriminative power. In addition, the selection process has fast rate of convergence.

Introduction

Discovery of novel biomarkers is one of the most important impetuses driving many biological studies including biomedical research. In this post-genomics era, many high throughput technologies such as microarrays have been applied to measure a biological system ranging from cell to tissue to whole animal. In the last decade, environmental scientists, particularly ecotoxicologists, have increasingly applied omics technologies in the hunt for biomarkers that display both high sensitivity and specificity. However, it is still a big challenge to sieve through high dimensional data sets and look for biomarker candidates that meet high standards and sustain experimental validation.

Previously, we developed an integrated statistical and machine learning (ISML) pipeline to analyze a multiclass earthworm gene expression microarray dataset [15]. As a continuation to this effort of biomarker discovery, here we developed a new feature selection method based on 1-norm regularization.

In machine learning, feature selection is a technique of seeking the most representative subset of features. It is the focus of research in applications where datasets have dramatic amounts of variables, e.g. text processing and gene expression data analysis. When applied to gene expression array analysis, the technique detects the influential genes by which biological researchers could discriminate normal instances from abnormal ones, and therefore, facilitates further biological research or judgments.

We only focus on supervised learning in this paper, which means a label is given for each instance. Unsupervised and semi-supervised learning could be found in other literatures [25], [16], [22]. Feature selection algorithms roughly fall into two categories, variable ranking (or feature ranking) and variable subset selection [9]. The latter essentially is divided into wrapper, filter, and embedded methods.

Variable ranking acts as a preprocessing step or auxiliary selection mechanism because of its simplicity and scalability. It ranks individual features by a metric, e.g. correlation, and eliminates features that do not exceed a given threshold. Variable ranking is computational efficient because it only computes feature scores. Nevertheless, this method only focuses on the predictive power of individual features. It is prone to the selection of redundant features.

Variable subset selection, on the other hand, attempts to select subsets of features that, jointly, produce good prediction performance. Filter methods consist of using a feature subset relevance criterion to yield a reduced subset of features which may be used for future prediction. Wrapper methods [13] search through feature subset space. Each subset is applied to a certain machine learning model and assessed by the learning performance. In these methods, learning models act as black boxes. Embedded approaches [14] implement feature selection in the process of learning. While wrapper methods search the space of all feature subsets, the searching step in embedded methods is guided by the learning algorithm. This guidance could be obtained from estimating changes in the objective function by adding or removing features. For example, Guyon et al. [10] proposed Support Vector Machine Recursive Feature Elimination (SVM-RFE) algorithm to recursively classify the instances by the SVM classifier and eliminate the feature(s) with the least weight(s). The number of features to be eliminated in each iteration is ad hoc. Moreover, there is no firm conclusion about when to terminate the recursive steps.

Most feature selection in the literature is designed for binary problems. When extended to real-life multiclass tasks, combining several binary classifiers are typically suggested, such as one-versus-all and one-versus-one [23]. For situations with k classes, one-versus-all constructs k binary classifiers, each of which is trained with all the instances in a certain class with positive labels and all other examples with negative labels. It is computationally expensive and has highly imbalanced data for each binary classifier. On the other hand, one-versus-one method constructs k(k1)/2 binary classifiers for all pairs of classes. An instance is predicted for the class with the majority vote. Similar to the one-versus-all approach, the one-versus-one approach has heavy computational burden. Platt et al. [20] proposed a directed acyclic graph SVM (DAGSVM) algorithm whose training phase is the same as one-versus-one by solving k(k1)/2 binary problems. However, DAGSVM uses a rooted acyclic graph to make a decision from k(k1)/2 prediction results. Some researchers proposed methods solving multiclass tasks in one step: build a piecewise separation of the k classes in a single optimization. This idea is comparable to the one-versus-all approach. It constructs k classifiers, each of which separates a class from the remaining classes, but all classifiers are obtained by solving one optimization problem. Weston and Watkins [27] proposed a formulation of the SVM that enables a multiclass problem. But, solving multiclass problem in one step results in a much larger scale optimization problem. Crammer and Singer [5] decomposed the dual problem into multiple optimization problems of reduced size and solved them by a fixed-point algorithm. A comparison of different methods for multiclass SVM was done by Hsu and Lin [12].

A multiclass optimization cost function typically comprises two parts, empirical error and model complexity. The model complexity is usually approximated by a regularizer, e.g. 2-norm or 1-norm [3]. The use of 1-norm was advocated in many applications, such as multi-instance learning [4], ranking [19] and boosting [6], because of its sparsity-favoring property. Several literatures discussed the multiclass problem based on 1-norm regularization and various loss functions for the empirical error. For example, Friedman et al. [8] introduced 1-norm into multinomial logistic regression which is capable of handling multiclass classification problems. Bi et al. [2] chose ϵ-insensitive loss function. Liu and Shen [17] defined a specific loss function ψ-loss that replaces the convex SVM loss function by a nonconvex function. Other works mainly used hinge loss with different variations [24], [26]. In this paper, the hinge loss function we apply is similar to that in [27], but has not be used in any 1-norm multiclass work.

Feature selection under the framework of 1-norm multiclass regularization is achieved by discarding the least significant features, i.e., features with zero weights. The sparsity of the weights is determined by a regularization parameter that controls the trade off between empirical error and model complexity. However, the selection of a proper regularization parameter is a challenging problem. We only know the trend of tuning the parameter to make the number of selected features smaller or larger, but it is difficult to associate a parameter value with a particular feature subset and at the same time achieve a high learning performance, unless the entire regularization path is computed. As 1-norm is non-differentiable (so is hinge loss), calculating the accurate regularization path is difficult (some other loss functions, such as logistic loss, have defined gradients). Even though the regularization path is piecewise linear, path-following methods are slow for large-scale problems. Instead of computing an approximate regularization path, we introduce an iterative 1-norm multiclass feature selection method that selects a small number of features with high performance.

In this paper, we propose a multiclass 1-norm regularization feature selection method, L1MR (Linear 1-norm Multiclass Regularization), and its simple variation SL1MR, that solve a single linear program. An iterative feature elimination framework is proposed to obtain a minimum feature subset. The sparsity favoring property of 1-norm regularization enables fast convergence of the iterative feature elimination process. In our empirical studies, the algorithm typically converges in no more than ten iterations. The reminder of the paper is organized as follows. Section 2 proposes the 1-norm multiclass regularization. Section 3 describes the iterative feature elimination process. Section 4 demonstrates the experimental results. Conclusions are presented in Section 5 along with a discussion of future work.

Section snippets

Learning a multiclass linear classifier via 1-norm regularization

Consider a set of l instances (X,Y) from an unknown fixed distribution, where XRn is the earthworm microarray gene expression data, and output Y is the class label. In a k-category classification task, y is coded as {1,,k}. For the earthworm data studied in this article, k=3 (control, TNT, RDX), n=869 and l=248.

Given k linear decision functions f1,,fk where fc corresponds to class c, each decision function is defined as fc(x)=wcTx+bc, c=1,,k, where, the parameters wc=[wc,1,,wc,n]TRn and bc

Recursive feature elimination via 1-norm regularization

It has frequently been observed that 1-norm regularization leads to many feature weights to be zero. This makes it a natural feature selection process, where features with zero weight values should be discarded with no risk. In this paper, we call a feature j having zero-weight if all k values, w1,j,w2,j,,wk,j, in the k weight vectors are zero. Otherwise, the corresponding feature has non-zero weight. The purpose of feature selection is to choose a small subset of features and achieve good

Experimental results

In this section, we first present the microarray experiment design. We then demonstrate the results.

Conclusion and future works

In this paper, we propose a novel multiclass joint feature selection and classification method L1MR and its simplified variation SL1MR. Both methods formulate multiclass classification as an optimization problem using 1-norm regularization. Because the 1-norm penalty tends to yield sparse solutions, the proposed formulation has embedded feature selection property. Combined with the idea of recursive feature selection, L1MR and SL1MR identify a small subset of discriminative features effectively

Acknowledgment

Xiaofei Nan, Nan Wang, Chaoyang Zhang, Yixin Chen, and Dawn Wilkins were supported in part by the US National Science Foundation under award number EPS-0903787. Ping Gong was supported by U.S. Army Environmental Quality Technology Research Program. Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the National Science Foundation. Permission to publish this information was granted by the U.S. Army

Xiaofei Nan is a Ph.D. student in Computer and Information Science, University of Mississippi. Her major research focuses are machine learning and artificial intelligence. She received her bachelor degree in Automatic Control and master degree in Pattern Recognition and Intelligent system from Northeastern University, China.

References (31)

  • R. Kohavi et al.

    Wrappers for feature selection

    Artif. Intell.

    (1997)
  • T. Suzuki et al.

    Valosine-containing proteins (VCP) in an annelid: identification of a novel spermatogenesis related factor

    Gene

    (2005)
  • C. Ambroise et al.

    Selection bias in gene extraction on the basis of microarray gene-expression data

    Proc. Natl. Acad. Sci. USA

    (2002)
  • J. Bi et al.

    Dimensionality reduction via sparse support vector machines

    J. Mach. Learn. Res.

    (2003)
  • O. Chapelle et al.

    Multi-class feature selection with support vector machine

  • Y. Chen et al.

    MILES: multiple-instance learning via embedded instance selection

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2006)
  • K. Crammer et al.

    On the algorithm implementation of multiclass kernel-based vector machines

    J. Mach. Learn. Res.

    (2001)
  • J. Duchi et al.

    Boosting with Structural sparsity

  • C. Dwork et al.

    Rank aggregation methods for the web

  • J. Friedman et al.

    Regularization paths for generalized linear models via coordinate descent

    J. Statist. Software

    (2010)
  • I. Guyon et al.

    An introduction to variable and feature selection

    J. Mach. Learn. Res.

    (2003)
  • I. Guyon et al.

    Gene selection for cancer classification using support vector machine

    Mach. Learn.

    (2002)
  • M.A. Hall, Correlation-Based Feature Selection for Machine Learning, Ph.D. Thesis, Department of Computer Science,...
  • C. Hsu et al.

    A comparison of methods for multiclass support vector machines

    IEEE Trans. Neural Networks

    (2002)
  • T.N. Lal et al.

    Embedded methods

  • Cited by (8)

    • Using Machine Learning to make nanomaterials sustainable

      2023, Science of the Total Environment
      Citation Excerpt :

      Omics datasets have vast numbers of variables, so their analysis requires the application of multivariate data treatments to individual layer and across layers, e.g., the genomic, transcriptomic, proteomic, and metabolomic layers (Barupal et al., 2018; Li et al., 2010; Peng et al., 2020b; Vishnoi et al., 2020; Wu and Wang, 2018; Zheng et al., 2018). It can be tedious and difficult to identify causal relationships, which makes ML-based approaches attractive (Ciaramella and Staiano, 2019; David, 2020; Ewald et al., 2020; Goez et al., 2018; Nan et al., 2012; Soufan et al., 2019). While omics is a key area of application for ML techniques, the developed methods have rarely been applied to materials and environmental risk (Daly and Hernandez, 2020; Eicher et al., 2020; Vishnoi et al., 2020).

    • Earthworm toxicogenomics: A renewed genome-wide quest for novel biomarkers and mechanistic insights

      2016, Applied Soil Ecology
      Citation Excerpt :

      Using the same predictive modeling approach, other complicated phenotypes of interest (e.g., cocoon production rate and tolerance or sensitivity to toxicants) can be predicted. Similarly, classifier genes have been identified to separate worms receiving different treatments (control, TNT and RDX) with a high accuracy (>90%) (Li et al., 2010; Nan et al., 2012). These applications of toxicotranscriptomics are not sufficiently explored and more research is needed.

    • Penalized logistic regression with the adaptive LASSO for gene selection in high-dimensional cancer classification

      2015, Expert Systems with Applications
      Citation Excerpt :

      Thus, the iteration methods such as Newton–Raphson's method cannot work (Bielza, Robles, & Larrañaga, 2011). Recently, there has been growing interest in applying the penalized methods in high-dimensional cancer classification (Bielza et al., 2011; Bootkrajang & Kabán, 2013; Nan et al., 2012; Zou et al., 2015). To tackle both estimating the gene coefficients and performing gene selection simultaneously, penalized logistic regression (PLR) was successfully applied in high-dimensional cancer classification (Cawley & Talbot, 2006; Li & Eng Chong, 2005; Shevade & Keerthi, 2003; Zhenqiu et al., 2007; Zhu & Hastie, 2004).

    • Regularized logistic regression with adjusted adaptive elastic net for gene selection in high dimensional cancer classification

      2015, Computers in Biology and Medicine
      Citation Excerpt :

      Many gene selection methods have been proposed to select a subset of genes that can have high classification accuracy for cancer classification. Recently, regularization methods, which are capable of conducting efficient gene selection and model estimation simultaneously, have gained popularity [15,16]. From the statistical perspective, regularization methods can control the effects of the overfitting and multicollinearity [17].

    View all citing articles on Scopus

    Xiaofei Nan is a Ph.D. student in Computer and Information Science, University of Mississippi. Her major research focuses are machine learning and artificial intelligence. She received her bachelor degree in Automatic Control and master degree in Pattern Recognition and Intelligent system from Northeastern University, China.

    Nan Wang is an Assistant Professor of School of Computing at University of Southern Mississippi. Her research in computational biology focuses on several areas. (1) Microarray data analysis, including biomarker identification, generation of reference gene regulatory network for non-model organisms and discovery of changes in pathways under different chemical conditions; (2) Glycan classification for the purpose of identifying the glycan molecules and the structural features in these glycans that determine canine influenza infection; (3) developing new methods for microRNA identification based on genome sequence; and (4) biological database construction with a primary goal of developing a platform for biologists in the EPSCoR community.

    Ping Gong is a senior Army contract scientist. He received his Ph.D. in environmental toxicology and has applied bioinformatics, genomics, molecular and computational biology to decode toxicological mechanisms and discover novel biomarkers. His current interests include toxicant effects on mRNA/microRNA expression and copy number variation, systems/synthetic biology, genome re-sequencing and assembly, gene regulatory networks inference, and bioinformatic/computational tool development.

    Chaoyang Zhang joined the Department of Computer Science at the University of Southern Mississippi as an assistant professor in 2003. He received his Ph.D. at Louisiana Tech University in 2001 and was a research assistant professor in the Department of Computer Science at the University of Vermont from 2001 to 2003. Currently He is the Director of School of Computing. His research interests include high performance computing (parallel computing, distributed computing and grid computing applications and algorithms), computational biology and bioinformatics (microarray data analysis, classification and gene network reconstruction), information technology (Web-based information retrieval, machine learning and data mining), imaging and visualization (3D image reconstruction, information visualization and inverse problems), data analysis and modeling.

    Yixin Chen received B.S. degree (1995) from the Department of Automation, Beijing Polytechnic University, the M.S. degree (1998) in control theory and application from Tsinghua University, and the M.S. (1999) and Ph.D. (2001) degrees in electrical engineering from the University of Wyoming. In 2003, he received the Ph.D. degree in computer science from The Pennsylvania State University. He had been an Assistant Professor of computer science at University of New Orleans. He is now an Associate Professor at the Department of Computer and Information Science, the University of Mississippi. His research interests include machine learning, data mining, computer vision, bioinformatics, and robotics and control. Dr. Chen is a member of the ACM, the IEEE, the IEEE Computer Society, the IEEE Neural Networks Society, and the IEEE Robotics and Automation Society.

    Dawn Wilkins received Bachelors and Masters degrees in Mathematical Systems from Sangamon State University (now University of Illinois–Springfield) in 1981 and 1983, respectively. She received the Ph.D. in Computer Science from Vanderbilt University in 1995. Currently she is an Associate Professor in the Computer and Information Science department at the University of Mississippi, where she has been a faculty member since 1995. Her primary research interests are in the areas of Machine Learning, Computational Biology, Bioinformatics and Database Systems.

    View full text