A heavy-tailed empirical Bayes method for replicated microarray data
Introduction
DNA microarray is recognised as being a powerful tool for studying the RNA expression levels of thousands of genes simultaneously under different conditions. These experiments compare two different samples of cDNA coloured with different dyes (red and green) to measure the intensity of fluorescence after hybridization. This method allows us to compare a large amount of information simultaneously in order to identify and quantify genes which are differentially expressed. Microarray experiments typically consist of intensity measurements of thousands of genes with few if any replicates for each gene.
After normalization, gene expression distribution generally presents heavier tails than Gaussian distribution. Gene expression distribution has been modelled using several densities: Cauchy (Khondoker et al., 2006), Pareto (Kuznetsov, 2001), Laplace (Purdom and Holmes, 2005), -student (Lonnstedt and Speed, 2002) and log-normal (Hoyle et al., 2002). Recently, we studied several different microarray datasets (Salas-Gonzalez et al., 2006b) and -stable distribution was seen to fit gene expression distribution more satisfactorily.
Here, we derive a novel statistic that can be used to identify differential expression in microarray experiments. This statistic is based on an -stable mixture model and the property of the scale mixture of normals. The very well known properties of the -stable distribution allow us to calculate the parameters of the model. We introduce a Bayesian -stable mixture model which models genes as being composed of two different populations: either differentially or not differentially expressed. This Bayesian mixture model is currently a very popular approach (Lonnstedt and Speed, 2002, Bhowmick et al., 2006, Newton et al., 2004, Do et al., 2005, Gottardo et al., 2003).
The paper is organised as follows: in Section 2 we describe the main properties of the -stable distribution; in Section 3 we present a novel statistic based on the property of the scale mixture of normals; in Section 4 the statistic is tested with both synthetic and real microarray data; Section 5 contains the results; and some conclusions are drawn in Section 6.
Section snippets
Properties
The -stable distribution is a family of distributions which presents heavy tails and a certain degree of asymmetry. Its properties are well understood and it has been used to model impulsive phenomena in many different fields, such as biology, electrical engineering, computer science, economics, physics and astronomy (Zolotarev, 1986, Nikias and Shao, 1995). Moreover, the -stable distribution satisfies the stability property and the generalised central limit theorem. To a certain degree this
Statistical inference
In this section we introduce a novel statistic based upon the scale mixture of normals and the heavy-tailed -stable distribution. Let be the number of genes in each array, the number of replicates (arrays), and . We assume that the data are base 2 logarithms of red dye intensity (denoted as ) and green () suitably normalized using LOESS normalization (Yang et al., 2002). Therefore,
Our aim is to ascertain which genes are differentially expressed.
Simulated data
To illustrate the performance of the statistic proposed in this paper we simulated a dataset containing genes and replicates. The non-expressed genes were simulated following an -stable distribution in which the parameters were , , and . Thus, the samples are simulated from the following mathematical model (Eqs. (3.2), (3.6)):
These were typical values obtained in the analysis of the four gene
Simulated data
For each different value for the cutoff considered (20 different values from to ) we simulated 100 different datasets with the -stable parameters given in Section 4.1. The stable statistic and were calculated for each dataset.
The histograms of the estimated values for each -stable parameter are depicted in Fig. 4. Thus, we were able to compare the variance of the estimated values between different simulated datasets. It can be seen that the true values are estimated very
Conclusions
We put forward a new statistic to identify expressed genes in replicated microarray data. This statistic is based upon the properties of the -stable distribution. An -stable mixture model is introduced and the property of the scale mixture of normals is used to calculate the Bayes log-posterior odds. This procedure has allowed us to calculate the proposed statistic quite simply by using some known properties of the -stable density. The proposed statistic was tested using both synthetic and
Acknowledgements
This work was partially supported by the project TEC2007-68030-C02-02/TCM of the MCyT of Spain, the PETRI project DENCLASES (PET2006-0253) of the Spanish MEC and the Excellence Projects TIC-02566 and TIC-03269 of the Consejería de Innovación Ciencia y Empresa (Junta de Andalucía, Spain). The first author did part of the work while at ISTI-CNR of Pisa. We thank A.L. Tate for revising our English text. We also thank the reviewers for their helpful comments and insights.
References (27)
- et al.
Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling
Nature
(2000) - et al.
A Laplace mixture model for identification of differential expression in microarray experiments
Biostatistics
(2006) - et al.
Maximum-likelihood symmetric alpha-stable parameter estimation
IEEE Transactions on Signal Processing
(1999) - et al.
Transcriptome analysis of arabidopsis colonized by a plant-growth promoting rhizobacterium reveals a general effect on disease resistance
The Plant Journal
(2003) - et al.
A method for simulating stable random variables
Journal of the American Statistical Association
(1976) - et al.
A Bayesian mixture model for differential gene expression
Biostatistics
(2005) Parameter estimation of stable distributions
Communications in Statistics — Theory and Methods
(2006)- et al.
Bayesian regression analysis with scale mixture of normals
Econometric Theory
(2000) - Godsill, S., Kuruoglu, E.E., 1999. Bayesian inference for time series with heavy-tailed symmetric alpha stable noise...
- et al.
Statistical analysis of microarray data: A Bayesian approach
Biostatistics
(2003)