A heavy-tailed empirical Bayes method for replicated microarray data

https://doi.org/10.1016/j.csda.2008.08.008Get rights and content

Abstract

DNA microarray has been recognized as being an important tool for studying the expression of thousands of genes simultaneously. These experiments allow us to compare two different samples of cDNA obtained under different conditions. A novel method for the analysis of replicated microarray experiments based upon the modelling of gene expression distribution as a mixture of α-stable distributions is presented. Some features of the distribution of gene expression, such as Pareto tails and the fact that the variance of any given array increases concomitantly with an increase in the number of genes studied, suggest the possibility of modelling gene expression distribution on the basis of α-stable density. The proposed methodology uses very well known properties of α-stable distribution, such as the scale mixture of normals. A Bayesian log-posterior odds is calculated, which allows us to decide whether a gene is expressed differentially or not. The proposed methodology is illustrated using simulated and experimental data and the results are compared with other existing statistical approaches. The proposed heavy-tail model improves the performance of other distributions and is easily applicable to microarray gene data, specially if the dataset contains outliers or presents high variance between replicates.

Introduction

DNA microarray is recognised as being a powerful tool for studying the RNA expression levels of thousands of genes simultaneously under different conditions. These experiments compare two different samples of cDNA coloured with different dyes (red and green) to measure the intensity of fluorescence after hybridization. This method allows us to compare a large amount of information simultaneously in order to identify and quantify genes which are differentially expressed. Microarray experiments typically consist of intensity measurements of thousands of genes with few if any replicates for each gene.

After normalization, gene expression distribution generally presents heavier tails than Gaussian distribution. Gene expression distribution has been modelled using several densities: Cauchy (Khondoker et al., 2006), Pareto (Kuznetsov, 2001), Laplace (Purdom and Holmes, 2005), t-student (Lonnstedt and Speed, 2002) and log-normal (Hoyle et al., 2002). Recently, we studied several different microarray datasets (Salas-Gonzalez et al., 2006b) and α-stable distribution was seen to fit gene expression distribution more satisfactorily.

Here, we derive a novel statistic that can be used to identify differential expression in microarray experiments. This statistic is based on an α-stable mixture model and the property of the scale mixture of normals. The very well known properties of the α-stable distribution allow us to calculate the parameters of the model. We introduce a Bayesian α-stable mixture model which models genes as being composed of two different populations: either differentially or not differentially expressed. This Bayesian mixture model is currently a very popular approach (Lonnstedt and Speed, 2002, Bhowmick et al., 2006, Newton et al., 2004, Do et al., 2005, Gottardo et al., 2003).

The paper is organised as follows: in Section 2 we describe the main properties of the α-stable distribution; in Section 3 we present a novel statistic based on the property of the scale mixture of normals; in Section 4 the statistic is tested with both synthetic and real microarray data; Section 5 contains the results; and some conclusions are drawn in Section 6.

Section snippets

Properties

The α-stable distribution is a family of distributions which presents heavy tails and a certain degree of asymmetry. Its properties are well understood and it has been used to model impulsive phenomena in many different fields, such as biology, electrical engineering, computer science, economics, physics and astronomy (Zolotarev, 1986, Nikias and Shao, 1995). Moreover, the α-stable distribution satisfies the stability property and the generalised central limit theorem. To a certain degree this

Statistical inference

In this section we introduce a novel statistic based upon the scale mixture of normals and the heavy-tailed α-stable distribution. Let N be the number of genes in each array, n the number of replicates (arrays), i=1N and j=1n. We assume that the data Mij are base 2 logarithms of red dye intensity (denoted as Rij) and green (Gij) suitably normalized using LOESS normalization (Yang et al., 2002). Therefore, Mij=log(RijGij).

Our aim is to ascertain which genes are differentially expressed. 

Simulated data

To illustrate the performance of the statistic S proposed in this paper we simulated a dataset containing N=10000 genes and n=4 replicates. The non-expressed genes were simulated following an α-stable distribution in which the parameters were α=1.8, β=0, σ=0.1 and μ=0. Thus, the samples are simulated from the following mathematical model (Eqs. (3.2), (3.6)): Mij|μi,λi,σN(0,λi0.12)for i=1:N.p(λi)=f1.82,1(2{cos(1.8π4)}21.8,0).

These were typical values obtained in the analysis of the four gene

Simulated data

For each different value for the cutoff w considered (20 different values from w=0.005 to w=0.1) we simulated 100 different datasets with the α-stable parameters given in Section 4.1. The stable statistic and B were calculated for each dataset.

The histograms of the estimated values for each α-stable parameter are depicted in Fig. 4. Thus, we were able to compare the variance of the estimated values between different simulated datasets. It can be seen that the true values are estimated very

Conclusions

We put forward a new statistic to identify expressed genes in replicated microarray data. This statistic is based upon the properties of the α-stable distribution. An α-stable mixture model is introduced and the property of the scale mixture of normals is used to calculate the Bayes log-posterior odds. This procedure has allowed us to calculate the proposed statistic quite simply by using some known properties of the α-stable density. The proposed statistic was tested using both synthetic and

Acknowledgements

This work was partially supported by the project TEC2007-68030-C02-02/TCM of the MCyT of Spain, the PETRI project DENCLASES (PET2006-0253) of the Spanish MEC and the Excellence Projects TIC-02566 and TIC-03269 of the Consejería de Innovación Ciencia y Empresa (Junta de Andalucía, Spain). The first author did part of the work while at ISTI-CNR of Pisa. We thank A.L. Tate for revising our English text. We also thank the reviewers for their helpful comments and insights.

References (27)

  • A.A. Alizadeh et al.

    Distinct types of diffuse large b-cell lymphoma identified by gene expression profiling

    Nature

    (2000)
  • D. Bhowmick et al.

    A Laplace mixture model for identification of differential expression in microarray experiments

    Biostatistics

    (2006)
  • J.S. Bodenschatz et al.

    Maximum-likelihood symmetric alpha-stable parameter estimation

    IEEE Transactions on Signal Processing

    (1999)
  • F. Cartieaux et al.

    Transcriptome analysis of arabidopsis colonized by a plant-growth promoting rhizobacterium reveals a general effect on disease resistance

    The Plant Journal

    (2003)
  • J. Chambers et al.

    A method for simulating stable random variables

    Journal of the American Statistical Association

    (1976)
  • K.-A. Do et al.

    A Bayesian mixture model for differential gene expression

    Biostatistics

    (2005)
  • Z. Fan

    Parameter estimation of stable distributions

    Communications in Statistics — Theory and Methods

    (2006)
  • C. Fernandez et al.

    Bayesian regression analysis with scale mixture of normals

    Econometric Theory

    (2000)
  • Godsill, S., Kuruoglu, E.E., 1999. Bayesian inference for time series with heavy-tailed symmetric alpha stable noise...
  • R. Gottardo et al.

    Statistical analysis of microarray data: A Bayesian approach

    Biostatistics

    (2003)
  • D.C. Hoyle et al.

    Making sense of microarray data distributions

    Bioinformatics

    (2002)
  • M.R. Khondoker et al.

    Statistical estimation of gene expression using multiple laser scans of microarrays

    Bioinformatics

    (2006)
  • S.M. Kogon et al.

    Characteristic function based estimation of stable parameters

  • Cited by (0)

    View full text