Elsevier

Neurocomputing

Volume 73, Issues 13–15, August 2010, Pages 2562-2570
Neurocomputing

Kernel based gene expression pattern discovery and its application on cancer classification

https://doi.org/10.1016/j.neucom.2010.05.019Get rights and content

Abstract

Association rules have been widely used in gene expression data analysis. However, there is no systematical way to select interesting rules from the millions of rules generated from high dimensional gene expression data. In this study, a kernel density estimation based measurement is proposed to evaluate the interestingness of the association rules. Several pruning strategies are also devised to efficiently discover the approximate top-k interesting patterns. Finally, over-fitting problem of the classification model is addressed by using conditional independence test to eliminate redundant rules. Experimental results show the effectiveness of the proposed interestingness measure and classification model.

Introduction

With the development of genomic techniques, research on molecular biology has shifted from individual genes to the entire genome. Microarray, SAGE are such techniques, which can measure the expression levels of thousands of genes in a single experiment. These techniques are very suitable for comparing the gene expression levels in tissues under different conditions, for example, healthy versus diseased. To accurately predict the stages of samples is of great interest for biomedical applications.

However, high dimensionality and a small number of noisy samples pose great challenges to the existing machine learning methods. The main approach to this problem is tailoring the existing algorithms to the needs of gene expression data analysis, for example, support vector machines (SVM) [1], [2], neural networks [3], logistic regression [4], GA/KNN [5] and between-group analysis [6]. However, almost all of those classifiers are ‘black-box’ and hard to interpret. Interpretation of classifier is an important task of gene expression classification which can help biologists gain insight into biological process and inspire further biological research.

Since the first work of Creighton [7], association rule based gene expression data analysis has received a lot of attention [8], [9], [10], because it is simple and easy to interpret. A typical association rule γ is in the form: LHSRHS, where LHS is the antecedent and RHS is the consequent. Support and confidence are two commonly used interestingness measures of the rules. Support, sup(γ), refers to the number of rows containing the item set LHSRHS; confidence, conf(γ), is computed as sup(γ)/sup(LHS).

In 1998, Liu [11] firstly used a special class of association rules, whose consequent is a single class label, to build rule based classifiers, and the results are promising. Since then, many different rule based classification models have been proposed. In 2004, Gao [8] successfully extend the rule based classifier to the high dimensional gene expression data, with the help of the proposed row-combination association rule mining algorithm. For example, CST3AML is a rule discovered from the Leukemia gene expression data, CST3 (a gene expression pattern) is the LHS, and AML (a specified class of Leukemia) is RHS. By taking more characteristic of the gene expression data into consideration, results are further improved by Gao [9]. Georgii's Quantitative Association Rule [10] and Li's Emerging Patterns [12] are another publications on gene expression data.

Discretization is a necessary preprocess before conducting associate rule mining on gene expression data, which transforms continuous gene expression levels to categorical item sets. If the gene g's expression level is higher than the discretization threshold, the gene is considered as expressed and presented as g; otherwise it is marked as repressed and denoted as g. Take the Leukemia data set [13] for example, if the expression level of gene CST3 falls in [1419.5,+], it is taken as expressed, otherwise if it falls in [,1419.5], it is taken as repressed.

However, a lot of information is lost in the above discretization procedure. Considering the following two expression levels of the gene CST3, one is 1, the other is 1419. According to the discretization procedure, CST3 is considered as repressed in both two samples. Though the two expression levels are quite different, both of them are considered repressed in the discretization procedure. Obviously a lot of information is lost in the above transformation, which is especially precious in the case of small sample gene expression data. Because of the information loss, millions of association rules have been reported with the same support and confidence [8], [9], which makes it is hard to discover the real interesting rules. But discretization is a necessary preprocess for association rule mining, and we cannot sacrifice interpretability for accuracy.

Fortunately, we find a new way to explore the original data: discover patterns in the discretized data, but evaluate them in the original data using the kernel density estimation method.

Fig. 1 illuminates the proposed kernel density estimation based approach. In the figure, pattern g1g2 and g3g4 are equivalent to the shadow subspace of Fig. 1(a) and (b), respectively. Thus, the support of g1g2 is corresponding to the cumulative probability that samples fall in the shadow region of Fig. 1(a). Here, we use the kernel density estimation framework to estimate the cumulative probability. According to the property of kernel density estimation and the position of the samples, we can get that the support of g1g2 is higher than that of g3g4, this is more reasonable than that g1g2 and g3g4 are with the same support. This concept is also closely related to the sub-space clustering algorithm [14].

Extended to association rule context, the association rule can be consider as a special item set with the target as a special item. Then, we use the similar kernel density estimation framework to estimate the support and confidence of association rules. According to the distribution of sample classes, g1g2+ is more interesting than g3g4+, which is in accordance with our intuition.

However, the new rule interestingness evaluation framework brings new challenges to the existing association rule mining methods. Firstly, it needs to estimate the joint probability density function of multi-genes’ expression levels. Secondly, the important downward closure property is not maintained in the proposed measure of support. For the first problem, kernel density estimation method is proposed to efficiently estimate the joint probability density function without any prior assumption of the distribution function. Moreover, cumulative density can be efficiently integrated under kernel density estimation framework. For the second challenge, we choose to mining approximate top-k interesting rules. Several pruning strategies are also devised to accelerate the mining procedure.

Serious over-fit phenomenon is another problem that need to be considered while constructing a rule based classifier on gene expression data. The main reason is that existing rule based classifiers are too much tailored to the training set by selecting rules to cover each training sample [8], [9]. In small sample size gene expression data, the full cover strategy raises very serious over-fit problem. In this study, we use conditional independence test to eliminate redundant rules, and alleviate this problem.

Section snippets

Kernel based gene expression pattern evaluation

A typical gene expression profile can be represented by an n dimension vector x where n is the size of the gene set {1,2,,n} and xi is the ith gene's expression level in this sample. Suppose there are m such samples {x1,x2,,xm}, and for each sample xj, we have the corresponding state of the sample, c, where c{c1,c2,,ck}.

Given a gene expression level xi and the corresponding discrete threshold ti, discretization procedure works as follows: if xi>ti, then gi; else gi. Here, gi is an item,

Kernel based gene expression pattern discovery

The proposed rule evaluation framework poses new challenges to the association rule mining algorithms. Close pattern [19], rule group [8], [9] and such state of the art techniques cannot be used to summarize millions of association rules. Even worse, the downward closure property is not maintained in the new proposed measure of support. Considering the trade-off between the effectiveness and efficiency, We choose to mine the approximate top-k interesting association rules with some user

Usefulness in classification

Gene expression patterns are useful in many aspects of gene expression data analysis, such as predicting the state of samples, reconstructing the gene regulatory network. In this study, we focus on using gene expression patterns to predict the state of samples, especially to classify the cancer samples from the normal ones.

CBA [11] and RCBT [9] are two popular rule based classification models. Both of them try to select enough rules to cover each training sample, which makes the classifiers are

Experiment

In this section, usefulness of the interesting measure and effectiveness of the classification model are studied experimentally. The algorithm is coded in C++ in the Visual C++ 6.0 environment and all the experiments are run on a PC with Pentinum IV 2.4 GHz CPU, 512 MB RAM and 120 G hard disk.

The algorithms are tested on six real life gene expression data sets. General information of the data sets is summarized in Table 1. The colon data set [23] consists of 62 patient samples from both colon

Conclusion

In this study, a kernel density estimation based association rule evaluation framework is proposed to discover interesting gene expression rules from noisy small sample gene expression data. A corresponding mining method is also developed to efficiently mine the interesting rules using the proposed interestingness measure. The rules are explored in a carefully devised classification model. Experimental results show that rules discovered according to the new criterion are biologically

Ruichu Cai born in 1983, PhD and Lecturer in the Faculty of Computer Science, Guangdong University of Technology. He has published more than 10 academic papers on data mining and related area. His current research interests include, feature selection, association rule mining, clustering and their applications to the gene expression data analysis.

References (28)

  • X. Zhou et al.

    Cancer classification and prediction using logistic regression with bayesian gene selection

    Journal of Biomedical Informatics

    (2004)
  • M. Brown et al.

    Knowledge-based analysis of microarray gene expression data by using support vector machines

    Proceedings of the National Academy of Sciences of the United States of America

    (2000)
  • G. Isabelle et al.

    Gene selection for cancer classification using support vector machines

    Machine Learning

    (2002)
  • T. Ah Hwee et al.

    Predictive neural networks for gene expression data analysis

    Neural Networks

    (2005)
  • L. Li et al.

    Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method

    Bioinformatics

    (2001)
  • A.C. Culhane et al.

    Between-group analysis of microarray data

    Bioinformatics

    (2002)
  • C. Creighton et al.

    Mining gene expression databases for association rules

    Bioinformatics

    (2003)
  • G. Cong, A.K.H. Tung, X. Xu, F. Pan, J. Yang, Farmer: finding interesting rule groups in microarray datasets, in:...
  • G. Cong, K.L. Tan, A.K.H. Tung, X. Xu, Mining top-k covering rule groups for gene expression data, in: Proceeding of...
  • E. Georgii et al.

    Analyzing microarray data using quantitative association rules

    Bioinformatics

    (2005)
  • B. Liu, W. Hsu, Y. Ma, Integrating classification and association rule mining, in: Proceeding of KDD Conference, 1998,...
  • J. Li et al.

    Identifying good diagnostic gene groups from gene expression profiles using the concept of emerging patterns

    Bioinformatics

    (2002)
  • T.R. Golub et al.

    Molecular classification of cancer: class discovery and class prediction by gene expression monitoring

    Science

    (1999)
  • H.P. Kriegel et al.

    Clustering high-dimensional data: a survey on subspace clustering, pattern-based clustering and correlation clustering

    ACM Transactions on Knowledge Discovery from Data

    (2009)
  • Cited by (0)

    Ruichu Cai born in 1983, PhD and Lecturer in the Faculty of Computer Science, Guangdong University of Technology. He has published more than 10 academic papers on data mining and related area. His current research interests include, feature selection, association rule mining, clustering and their applications to the gene expression data analysis.

    Zhifeng Hao born in 1968, PhD, and Professor in the Faculty of Computer Science, Guangdong University of Technology. He has over 80 publications in journals and conference proceedings. His research interests are mainly in the fields of algebra, machine learning, bioinformatics and intelligence computation.

    Wen Wen born in 1981, she is a Lecturer in the Faculty of Computer Science, Guangdong University of Technology. Her research interests include kernel methods and pattern recognition.

    Han Huang born in 1980, PhD and Lecturer in School of Software Engineering in South China University of Technology, and Senior Research Assistant in Department of Management Sciences, College of Business, City University of Hong Kong, Hong Kong. He has published more than 30 academic papers on evolutionary computation and biocomputing. His current research interests include ant colony optimization, genetic algorithm, evolutionary programming, particle swarm optimization and their foundation including mathematical modeling, convergence proof and runtime analysis.

    This work is partial supported by National Natural Science Foundation of China (60873078), Key Natural Science Foundation of Guangdong Province (9251009001000005, 9151600301000001), Key Technology Research and Development Programs of Guangdong Province (2008B080701005, 2009B010800026), Social Science Foundation of Guangdong Province (08O-01), Open Foundation of the State Key Laboratory of Information Security (04-01), Technology Research and Development Program of Huizhou (08-117), Doctoral Program of the Ministry of Education (20090172120035), and Fundamental Research Funds for the Central Universities, SCUT (2009ZM0052).

    View full text