Elsevier

Pattern Recognition Letters

Volume 31, Issue 14, 15 October 2010, Pages 2138-2146
Pattern Recognition Letters

Using Gene Ontology annotations in exploratory microarray clustering to understand cancer etiology

https://doi.org/10.1016/j.patrec.2010.01.006Get rights and content

Abstract

Gene expression profiling provides insight into the functions of genes at a molecular level. Clustering of gene expression profiles can facilitate the identification of the underlying driving biological program causing genes’ co-expression. Standard clustering methods, grouping genes based on similar expression values, fail to capture weak expression correlations potentially causing genes in the same biological process to be grouped separately. We have developed a novel clustering algorithm, which incorporates functional gene information from the Gene Ontology into the clustering process, resulting in more biologically meaningful clusters. We have validated our method using two multi-cancer microarray datasets. In addition, we show the potential of such methods for the exploration of cancer etiology.

Introduction

Gene expression profiling using microarrays has become a key tool in the analysis of biological systems at a molecular level. While still producing relatively noisy data, much improvement has been made in noise correcting normalisation procedures and feature selection, providing rich datasets for further biological analysis. Microarray analysis pipelines generally come in two flavours: differential expression analysis and exploratory clustering. The purpose of differential expression analysis is to find a small subset of genes which are differentially expressed between two or more experimental conditions or samples. Having a small gene set makes for more manageable biological interpretation than using the 20,000 genes that are typically profiled on an array. In contrast, exploratory clustering attempts to utilise all genes on an array for biological interpretation by considering sets of genes with similar expression patterns, rather than a per gene analysis. This is useful under the assumption that genes with shared expression patterns have similar function or are involved in similar biological processes. Each of the clusters of genes identified provide a starting point for further biological analysis. Once clusters have been determined, usually a resource such as the Gene Ontology (Ashburner et al., 2000) is used to assist in determining the biological process represented by a set of genes.

The Gene Ontology (GO) (Ashburner et al., 2000) is a curated, structured vocabulary that describes genes and gene products. It is modeled as a directed acyclic graph, with terms as nodes and relationships between terms as arcs. A node (term) can have one or more parents, representing a more general description of the term. A node may also have children that are more specific definitions of the term. The graph is hierarchical with three top parent nodes: molecular function, biological process and cellular component. In the GO, two genes may be annotated to the same term, or they may be related through a shared term higher in the GO hierarchy. Given a set of genes, tools which calculate the terms that are statistically overrepresented in the set are commonly used to describe the biological process represented by the set of genes. For example, GeneMerge (Castillo-Davis and Hartl, 2003), FatiGO (Al-Shahrour et al., 2004) and others (Martin et al., 2004, Lee et al., 2004, Alexa et al., 2006, Zhong et al., 2004).

While a useful procedure, exploratory clustering analysis pipelines commonly face a difficult problem. Clusters can be dominated by strong or noisy expression patterns, forcing genes of similar function or those belonging to the same biological process with less correlated expression, to join another cluster. Therefore the resulting clusters may not represent a biological process in its entirety or majority, making it hard to determine which molecular processes a particular cluster of genes represents. Therefore, to improve the clustering process, additional information can be introduced to ensure that genes with similar function or shared pathways can be clustered together. Sequence similarity, protein structure similarity, shared pathways and functions, are all ways in which genes can be shown to be related. In this paper, we focus on using the GO in the clustering process. While we use only the GO as our additional information source, it is possible that another source of information might be used to further improve the clustering output.

Previous attempts have been made that utilize functional information in the clustering of gene expression profiles, however these have focused mainly on the task of predicting the function of genes with unknown function. The task in this case, is to cluster all genes with known function, and attempt to assign genes with unknown functions to one of these clusters. The unknown function is then inferred from the genes with known function. Huang et al., 2006, Pan et al., 2006 used functional annotations shared between genes to modify standard distance and model based clustering algorithms. Boratyn et al. (2007) proposed a general method for modifying the distance measure based on prior shared functional information between genes. However both of these methods only use small numbers of distinct functional categories, which does not apply well when using the GO. The multiple shared functions between genes and large structure would require significant pruning of the GO graph to work in these frameworks. Cheng et al. (2004) attempted to address this by developing a clique-finding algorithm for the GO and used the cliques to perform co-clustering analysis with gene expression profiles. Another attempt developed by Liu et al. (2004) is a biclustering approach that prunes possible cluster assignments based on the GO structure.

There are however two fundamental drawbacks with these approaches. Firstly, the GO is constructed as a directed acyclic graph, with terms lower in the hierarchy being specialisations, or parts of, terms higher in the hierarchy. Genes are then annotated to one or more terms in the graph, at the lowest (most specific) level possible. Drawing a path from one gene to another through this graph to determine similarity of the genes does not necessarily imply shared biology. The abstraction of terms across each level of the ontology can be such that two genes with a single shared parent term, may be extremely diverse in terms of their specific function. For example, the two terms negative regulation of steroid metabolic process and positive regulation of steroid metabolic process share the parent steroid metabolic process. Genes annotated to each of these terms have the opposite effect on steroid metabolism. Therefore it would not be correct to state they had similar function based on their shared parent, especially in the context of their co-expression. Secondly, having genes annotated to the same term does not necessarily imply they have similar function or share a biological pathway, in the context of their expression patterns. A single gene can act differently in various biological contexts and thus have context specific roles. It is therefore crucial to consider the expression context of a gene when deciding whether to use the knowledge of shared function to alter the clustering procedure. We define a gene’s expression context to be the expression of a gene when considering the expression of all genes that are in the same biological process. It is only within this context, that one can make an informed decision on whether a certain gene should be considered to have a certain function.

Our goal is slightly different from previous approaches, in the sense that we are not attempting to predict genes with unknown function, but generate clusters of genes which are suitable for biological interpretation and encapsulate a particular biological process better than that of a standard clustering approach.

A method is needed that uses shared functional information between genes from the GO, that does not rely on GO structure, and uses GO annotations only when they are relevant to the gene set of interest (the gene’s expression context). We previously developed GOMAC: Gene Ontology assisted Microarray Clustering, a modified k-means clustering algorithm which incorporates GO information only when it is relevant to the gene’s expression context, thus avoiding problems with irrelevant gene similarities (Macintyre et al., 2008). This paper is an extension of the original manuscript, validating the method on two microarray datasets (Tothill et al., 2005, Ramaswamy et al., 2001) spanning 12 and 10 cancer types respectively, demonstrating that our method results in an alternative to k-means clustering, providing clusters which are more informative in terms of biological interpretation. We also discuss the biological implications of our results with respect to future research in cancer etiology.

Section snippets

Methods

The key biological assumption of the algorithm presented in this paper is that genes that share a particular annotation in the GO, will share a detectable similarity in their microarray expression pattern. There are three key differences between our approach and the previous attempts at clustering using the GO outlined above:

  • Only GO terms that are statistically overrepresented within a cluster are used to calculate the similarity between genes. This ensures that only GO terms within the gene’s

Cancer microarray test data

For testing, microarray datasets with various sample classes were required to demonstrate the potential of GOMAC to uncover biological similarities across classes. We used two published datasets, profiling cancers of unknown primary (CUP): Tothill et al. (2005) which has cDNA microarrays across 12 cancer types and their subtypes, and 10 cancer types profiled using the Affymetrix Hu6800 platform, Ramaswamy et al. (2001). These datasets were useful for our purposes as they have samples in a range

Implementation

The previous version of the GOMAC algorithm was implemented in Perl and used the software GeneMerge (Castillo-Davis and Hartl, 2003) for calculating the over-representation of terms. In order to handle larger datasets and to interface with the latest version of the Gene Ontology, we have re-implemented the experimental set-up in C using a memory resident database as an internal data structure. Using a newer version of the GO (October, 2008) compared to the previous publication (Macintyre et

Clustering performance assessment

External clustering assessment typically uses a ‘gold standard’ clustering determined by external means to compare clusterings. However, in the case of exploratory clustering, there is no ‘gold standard’. Instead, when clustering microarrays, the standard measure to determine whether a new algorithm provides biologically better clusters than a previous algorithm, is to look for statistically overrepresented GO terms in each of the clusters and show that the new algorithm has clusters of

Results

Before assessment of the performance of our algorithm is carried out, a value of C, the number of clusters, is required. Only after this can we use the performance criteria outlined above to compare GOMAC with other clustering methods.

Discussion and future work

The Gene Ontology is usually used only after the clustering of genes and samples has been done. Here we reasoned that since multiple genes are coordinately expressed by means of biological programs, such as cell types and organs, the use of the GO in the process of clustering would focus the analysis on the driving program rather than individual genes.

We have shown through our analysis, that incorporation of additional biological information into the microarray clustering process in a

Acknowledgements

This work was supported by the following grants: ID 400107 NHMRC, W81XWH04-1-0336 DoD OCRP, Komen for the Cure BCTR0707358, and NBCF 509292. This work was partially supported by NICTA. NICTA is funded by the Australian Government’s Department of Communications, Information Technology and the Arts, the Australian Research Council through Backing Australia’s Ability, and the ICT Centre of Excellence programs. GM is supported by an Australian Postgraduate Award and a NICTA Research Project Award.

References (18)

  • F. Al-Shahrour et al.

    Fatigo: a web tool for finding significant associations of gene ontology terms with groups of genes

    Bioinformatics

    (2004)
  • A. Alexa et al.

    Improved scoring of functional groups from gene expression data by decorrelating go graph structure

    Bioinformatics

    (2006)
  • M. Ashburner et al.

    Gene ontology: tool for the unification of biology

    Nat. Genet.

    (2000)
  • Boratyn, G.M., Datta, S., Datta, S., 2007. Incorporation of biological knowledge into distance for clustering genes....
  • C.I. Castillo-Davis et al.

    Genemerge–post-genomic analysis, data mining, and hypothesis testing

    Bioinformatics

    (2003)
  • J. Cheng et al.

    A knowledge-based clustering algorithm driven by gene ontology

    J. Biopharmaceut. Stat.

    (2004)
  • M.B. Eisen et al.

    Cluster analysis and display of genome-wide expression patterns

    Proc. Nat. Acad. Sci.

    (1998)
  • D. Huang et al.

    Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data

    Bioinformatics

    (2006)
  • King, A., Gottlieb, E., 2009. Glucose metabolism and programmed cell death: an evolutionary and mechanistic...
There are more references available in the full text version of this article.

Cited by (14)

  • POPTric: Pathway-based Order Preserving Triclustering for gene sample time data analysis

    2022, Expert Systems with Applications
    Citation Excerpt :

    Basically, biological knowledge is used as a posterior criteria to ensure the relevancy of the discovered clusters. The active involvement of semi-supervised learning approaches have led to their and gaining popularity in the field of clustering (Macintyre, Bailey, Gustafsson, Haviv, & Kowalczyk, 2010; Mitra & Ghosh, 2012; Verbanck, Lê, & Pagès, 2013) and biclustering (Henriques & Madeira, 2016; Nepomuceno et al., 2015; Visconti, Cordero, & Pensa, 2014). Li and Tuck (2009) proposed a triclustering algorithm which integrates gene expression and gene regulatory information for clustering.

  • A hybrid approach for data clustering based on modified cohort intelligence and K-means

    2014, Expert Systems with Applications
    Citation Excerpt :

    The term unsupervised means that grouping is establish based on the intrinsic structure of the data without any need to supply the process with training items. Clustering has been applied across many applications, i.e., machine learning (Anaya & Boticario, 2011; Fan, Chen, & Lee, 2008), image processing (Das & Konar, 2009; Portela, Cavalcanti, & Ren, 2014; SiangTan & MatIsa, 2011; Zhao, Fan, & Liu, 2014), data mining (Carmona et al., 2012; Ci, Guizani, & Sharif, 2007), pattern recognition (Bassiou & Kotropoulos, 2011; Yuan & Kuo, 2008), bioinformatics (Bhattacharya & De, 2010; Macintyre, Bailey, Gustafsson, Haviv, & Kowalczyk, 2010; Zheng, Yoon, & Lam, 2014), construction management (Cheng & Leu, 2009), marketing (Kim & Ahn, 2008; Kuo, An, Wang, & Chung, 2006), document clustering (Jun, Park, & Jang, 2014), intrusion detection (Jun et al., 2014), healthcare (Gunes, Polat, & Sebnem, 2010; Hung, Chen, Yang, & Deng, 2013) and information retrieval (Chan, 2008; Dhanapal, 2008). Clustering algorithms can generally be divided into two categories; hierarchical clustering and partitional clustering (Han, 2005).

  • A feature selection technique for inference of graphs from their known topological properties: Revealing scale-free gene regulatory networks

    2014, Information Sciences
    Citation Excerpt :

    In this context, there are several recent initiatives to overcome such limitations by incorporating other information in the inference/prediction method. One of these initiatives involves the use of functional gene information, e.g., from the Gene Ontology, Proteome, KEGG, among others, in the clustering process, resulting in more biologically meaningful clusters [41]. Another initiative is the use of biologic information for the discovery of transcriptional regulation relationships, i.e., to infer GRNs [58].

  • Entropic Biological Score: A cell cycle investigation for GRNs inference

    2014, Gene
    Citation Excerpt :

    There are several recent initiatives to improve the knowledge about the interrelationship between genes by incorporating biological information in the inference methods. In Cui et al. (2010), De Haan et al. (2010) and Macintyre et al. (2010), known information of biological entities, available in public databases, is used into the clustering process (exploratory analysis/unsupervised classification) to obtain biologically more meaningful clusters. In some investigations (Ernst et al., 2008; Seok et al., 2010), transcriptional regulation relationships, involving GRNs inference, are discovered using biological information.

  • The best-so-far ABC with multiple patrilines for clustering problems

    2013, Neurocomputing
    Citation Excerpt :

    Clustering is a process for organizing unlabeled objects into groups such that the similarities among members within the same group are maximal while the similarities among members of different groups are minimal. It is an important technique applied in many application domains including machine learning [7,8], data mining [9,10], pattern recognition [11,12], image analysis [13,14], information retrieval [15,16], and bioinformatics [17,18]. Apart from SI-based methods, clustering problems have been solved using various techniques as presented in the previous literature.

  • SGAClust: Semi-supervised Graph Attraction Clustering of gene expression data

    2022, Network Modeling Analysis in Health Informatics and Bioinformatics
View all citing articles on Scopus
View full text