Elsevier

Pattern Recognition

Volume 39, Issue 12, December 2006, Pages 2464-2477
Pattern Recognition

Multi-objective evolutionary biclustering of gene expression data

https://doi.org/10.1016/j.patcog.2006.03.003Get rights and content

Abstract

Biclustering or simultaneous clustering of both genes and conditions have generated considerable interest over the past few decades, particularly related to the analysis of high-dimensional gene expression data in information retrieval, knowledge discovery, and data mining. The objective is to find sub-matrices, i.e., maximal subgroups of genes and subgroups of conditions where the genes exhibit highly correlated activities over a range of conditions. Since these two objectives are mutually conflicting, they become suitable candidates for multi-objective modeling. In this study, a novel multi-objective evolutionary biclustering framework is introduced by incorporating local search strategies. A new quantitative measure to evaluate the goodness of the biclusters is developed. The experimental results on benchmark datasets demonstrate better performance as compared to existing algorithms available in literature.

Introduction

Microarray experiments produce gene expression patterns that offer enormous information about cell function. This is useful while investigating complex interactions within the cell [1]. Microarrays are used in the medical domain to produce molecular profiles of diseased and normal tissues of patients. Such profiles are useful for understanding various diseases, and aid in more accurate diagnosis, prognosis, treatment planning, as well as drug discovery. Being typically high-dimensional, gene expression data requires appropriate mining strategies like feature selection and clustering [2] for further analysis.

Biological networks relate genes, gene products or their groups (like protein complexes or protein families) to each other in the form of a graph. Clustering of gene expression patterns are being used to generate gene regulatory networks [3]. A major cause of coexpression of genes is their sharing of the regulation mechanism (coregulation) at the sequence level. Clustering of coexpressed genes, into biologically meaningful groups, helps in inferring the biological role of an unknown gene that is coexpressed with a known gene(s).

A cluster is a collection of data objects which are similar to one another within the same cluster but dissimilar to the objects in other clusters [4]. The problem is to group N patterns into nc possible clusters with high intra-class similarity and low inter-class similarity by optimizing an objective function. In objective function-based clustering algorithms, the goal is to find a partition for a given value of nc. Clustering in gene expression data includes partitional, hierarchical, grid-based and density-based approaches to clustering [5] to name a few. Here the genes are typically partitioned into disjoint or overlapped groups according to the similarity of their expression patterns over all conditions.

It is often observed that a subset of genes are coregulated and coexpressed under a subset of conditions, but behave almost independently under other conditions. Here the term “conditions” can imply environmental conditions as well as time points corresponding to one or more such environmental conditions. Biclustering attempts to discover such local structure inherent in the gene expression matrix. It refers to the simultaneous clustering of both genes and conditions in the process of knowledge discovery about local patterns from microarray data [6]. This also allows detection of overlapped groupings among the biclusters, thereby providing a better representation of the biological reality involving genes with multiple functions or those regulated by many factors. For example, a single gene may participate in multiple pathways that may or may not be coactive under all conditions.

It may be noted that clustering approaches compute global models, while biclustering techniques focus on local models. Some of the existing nomenclature for biclustering, particularly in other application fields, are bidimensional clustering, subspace clustering and coclustering.

There has been a lot of research in biclustering [7], [8], [9] involving statistical and graph-theoretic techniques. The pioneering work by Cheng and Church [6] employs a set of heuristic algorithms to find one or more biclusters in gene expression data, based on a uniformity criteria. One bicluster is identified at a time, iteratively. There are iterations of masking null values and discovered biclusters (replacing relevant cells with random numbers), coarse and fine node deletion, node addition, and the inclusion of inverted data. The computational complexity for discovering k biclusters is of the order of O(mn×(m+n)×k), where m and n are the number of genes and conditions, respectively.

Sometimes the masking procedure may result in a phenomenon of random interference, thereby adversely affecting the subsequent discovery of high quality biclusters. In order to circumvent this problem, a two-phase probabilistic algorithm termed flexible overlapped clusters (FLOC) [10] is designed to simultaneously discover a set of possibly overlapping biclusters. Initial biclusters (or seeds) are chosen randomly from the original data matrix. Iterative gene and/or condition additions and/or deletions are performed with a goal of achieving the best potential residue reduction. The time complexity of FLOC is lower for p iterations (pn+m); i.e., O((n+m)2×k×p).

The Plaid model [11] tries to capture the approximate uniformity in a submatrix of the gene expression data, while discovering one bicluster at a time in an iterative process. The input matrix is described as a linear function of variables corresponding to its biclusters, and an iterative maximization process is pursued for estimating the function. It searches for patterns where the genes differ in their expression levels by a constant factor.

Bipartite graphs are employed in Ref. [12], with a bicluster being defined as a subset of genes that jointly respond across a subset of conditions. The objective is to identify the maximum-weighted subgraph. Here a gene is considered to be responding under a condition if its expression level changes significantly, under that condition over the connecting edge, with respect to its normal level. This involves an exhaustive enumeration, with a restriction on the number of genes that can appear in the bicluster. A simultaneous discovery of all biclusters is made at the same time. It may be noted that in all these methods it is possible to generate overlapped gene clusters.

A coupled two-way iterative method [13] has been devised to iteratively generate a set of biclusters, at a time, in cancer datasets. In the process it repeatedly performs one-way hierarchical clustering on the rows and columns of the data matrix, while using stable clusters of rows as attributes for column clustering and vice versa. The Euclidean distance is used as the similarity measure, after normalization of the data.

Gene ontology (GO) information, involving hierarchical functional relationships like “part of”, “overlapping”, has been incorporated into the clustering process called smart hierarchical tendency preserving algorithm (SHTP) [14]. A fast approximate pattern matching technique has been employed [15] to determine maximum sized biclusters with a number of conditions greater than a specified minimum. The worst case complexity of the procedure is claimed to be O(m2n). Rich probabilistic models have been used [16] for discovering relations between expressions, regulatory motifs and gene annotations. The outcome is a collection of disjoint biclusters, generated in a supervised manner.

Efficient techniques have been successfully amalgamated in the deterministic biclustering with frequent pattern mining algorithm (DBF) [17] to generate a set of good quality biclusters. Here the changing tendency between two conditions is modeled as an item, with the genes corresponding to transactions. A frequent itemset with the supporting genes forms a bicluster. In the second phase, these are iteratively refined by adding more genes and/or conditions.

A good survey on biclustering is available in literature [18], with a categorization of the different heuristic approaches made as follows:

  • Iterative row and column clustering combination [13]: apply clustering algorithms to the rows and columns of the data matrix, separately, and then combine the results using some iterative procedure.

  • Divide and conquer [7]: break the problem into smaller sub-problems, solve them recursively, and combine the solutions to solve the original problem.

  • Greedy iterative search [6], [10]: make a locally optimal choice, in the hope that this will lead to a globally good solution.

  • Exhaustive biclustering enumeration: the best biclusters are identified, using an exhaustive enumeration of all possible biclusters existent in the data, in exponential time [12].

  • Distribution parameter identification [11]: identify best-fitting parameters by minimizing a criterion through an iterative approach.

There exist a number of investigations dealing with time-series data [19], [20]. However, in this study, we will not be concerned with differentiating between time-course and condition-based gene expression data.

A greedy local search heuristic for biclustering has been reported in literature [6]. Here similarity is computed as a measure of the coherence of the genes and conditions in the bicluster. Although the greedy local search methods are by themselves fast, but they often yield suboptimal solutions.

The quality of a biclustering is often considered to be more important than the computation time required to generate it. Hence genetic algorithms (GAs) [21] provide an alternative efficient search technique in a large solution space, based on the theory of evolution. GAs involve a set of evolutionary operators, like selection, crossover and mutation. A population of chromosomes is made to evolve over generations by optimizing a fitness function, which provides a quantitative measure of the fitness of individuals in the pool. Single-objective GA, with local search, has been employed for identifying biclusters in gene expression data [22].

A simulated annealing (SA) based biclustering algorithm [23] is found to provide improved performance over that of Ref. [6], and is also able to escape from local minima. Unlike the classical optimization techniques like GA, that appreciate only improvements in the chosen fitness functions, SA also allows a probabilistic acceptance of temporary disimprovement in fitness scores. However, the results are often data dependent.

When there are two or more conflicting characteristics to be optimized, the single-objective GA requires an appropriate formulation of the single fitness function in terms of an aggregation of the different criteria involved. In such situations multi-objective evolutionary algorithms (MOEAs) [24] provide an alternative, more efficient, approach to searching for optimal solutions. They have found successful application in feature selection and classification of microarray gene expression patterns [25].

In this paper we investigate the use of MOEA, in conjunction with local search heuristics, while generating and iteratively refining an optimal set of biclusters. Here the objective is to find one or more biclusters that are optimal with respect to their homogeneity and size. Since these two criteria are usually conflicting, this lead us to formulate the biclustering problem in a multi-objective framework. The fitness functions are formulated as a pair, consisting of the mean squared residue score [6] and the size of the bicluster.

The remaining part of this article is organized as follows. Section 2 introduces the preliminaries of gene expression data, biclustering and MOEA. The proposed multi-objective GA is presented in Section 3. A new quantitative measure for evaluating the goodness of the biclusters is proposed in Section 4. Comparative results, along with statistical significance for biological relevance, are provided in Section 5 on benchmark gene expression datasets. Section 6 concludes the article.

Section snippets

Preliminaries

In this section we briefly discuss the basic concepts of microarray gene expression data, biclustering and MOEA.

Multi-objective biclustering

MOEA is a global search heuristic, primarily used for optimization tasks. In this section we present the general framework and implementation details of MOEA for biclustering. Local search heuristics are employed to speed up convergence by refining the chromosomes.

Quantitative evaluation

The bicluster should satisfy two requirements simultaneously. On one hand, the expression levels of each gene within the bicluster should be similar over the range of conditions, i.e., it should have a low mean squared residue score. On the other hand, the bicluster should simultaneously be larger in size. Note that the mean squared residue represents the variance of the selected genes and conditions with respect to the coherence (homogeneity) of the bicluster.

In order to quantify how well the

Results

We have implemented the proposed multi-objective biclustering algorithm on microarray data consisting of two benchmark gene expression datasets, viz., Yeast and Human B-cell Lymphoma. Availability of literature on the performance of related algorithms on these datasets, prompted their selection in this study. As the problem suggests, the size of an extracted bicluster should be as large as possible while satisfying a homogeneity criterion. The threshold δ was selected as 300 for Yeast data in

Conclusions

In this article we have introduced a general multi-objective framework for biclustering gene expression data, while incorporating local search for finer tuning. A qualitative measurement of the formed biclusters, along with a comparative assessment of results, is provided on two benchmark gene expression datasets to demonstrate the effectiveness of the proposed method. Biological validation of the selected genes within the biclusters have been provided by publicly available GO consortium.

Gene

Acknowledgment

This work was supported by CSIR research Grant no. 22/0346/02/EMR-II.

About the author—SUSHMITA MITRA is a Professor at the Machine Intelligence Unit, Indian Statistical Institute, Kolkata. From 1992 to 1994 she was in the RWTH, Aachen, Germany, as a DAAD Fellow. She was a Visiting Professor in the Computer Science Departments of the University of Alberta, Edmonton, Canada, in 2004, Meiji University, Japan, in 1999, 2004, 2005, and Aalborg University Esbjerg, Denmark, in 2002, 2003. Dr. Mitra received the National Talent Search Scholarship (1978–1983) from NCERT,

References (28)

  • J. Yang, H. Wang, W. Wang, P. Yu, Enhanced biclustering on expression data, in: Proceedings of the Third IEEE Symposium...
  • L. Lazzeroni et al.

    Plaid models for gene expression data

    Stat. Sin.

    (2002)
  • A. Tanay et al.

    Discovering statistically significant biclusters in gene expression data

    Bioinformatics

    (2002)
  • G. Getz et al.

    Coupled two-way clustering analysis of breast cancer and colon cancer gene expression data

    Bioinformatics

    (2003)
  • Cited by (0)

    About the author—SUSHMITA MITRA is a Professor at the Machine Intelligence Unit, Indian Statistical Institute, Kolkata. From 1992 to 1994 she was in the RWTH, Aachen, Germany, as a DAAD Fellow. She was a Visiting Professor in the Computer Science Departments of the University of Alberta, Edmonton, Canada, in 2004, Meiji University, Japan, in 1999, 2004, 2005, and Aalborg University Esbjerg, Denmark, in 2002, 2003. Dr. Mitra received the National Talent Search Scholarship (1978–1983) from NCERT, India, the IEEE TNN Outstanding Paper Award in 1994 for her pioneering work in neuro-fuzzy computing, and the CIMPA-INRIA-UNESCO Fellowship in 1996.

    She is the author of the books “Neuro-Fuzzy Pattern Recognition: Methods in Soft Computing” and “Data Mining: Multimedia, Soft Computing, and Bioinformatics” published by John Wiley. Dr. Mitra has guest edited special issues of journals, and is an Associate Editor of “Neurocomputing”. She has more than 100 research publications in referred international journals. According to the science citation index (SCI), two of her papers have been ranked 3rd and 15th in the list of Top-cited papers in Engineering Science from India during 1992–2001. Dr. Mitra is a Senior Member of IEEE. She served in the capacity of Program Chair, Tutorial Chair, and as member of programme committees of many international conferences. Her current research interests include data mining, pattern recognition, soft computing, image processing, and bioinformatics.

    About the author—HAIDER BANKA received his M.Sc. and M.Tech. degree in Computer Science from University of Calcutta, India, in 2001 and 2003, respectively. During 2003–2004, he was a lecturer in Engineering College, Durgapur, India. Since 2004 he is a Senior Research Fellow at Machine Intelligence Unit, Indian Statistical Institute, Kolkata. Mr. Banka serves as a reviewer of several international journals. His current research interests include data mining, pattern recognition, soft computing, combinatorial optimization, and bioinformatics.

    View full text