Skip to main content
Log in

Biomarker discovery: classification using pooled samples

A simulation study

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

RNA-sample pooling is sometimes inevitable, but should be avoided in classification tasks like biomarker studies. Our simulation framework investigates a two-class classification study based on gene expression profiles to point out how strong the outcomes of single sample designs differ to those of pooling designs. The results show how the effects of pooling depend on pool size, discriminating pattern, number of informative features and the statistical learning method used (support vector machines with linear and radial kernel, random forest (RF), linear discriminant analysis, powered partial least squares discriminant analysis (PPLS-DA) and partial least squares discriminant analysis (PLS-DA)). As a measure for the pooling effect, we consider prediction error (PE) and the coincidence of important feature sets for classification based on PLS-DA, PPLS-DA and RF. In general, PPLS-DA and PLS-DA show constant PE with increasing pool size and low PE for patterns for which the convex hull of one class is not a cover of the other class. The coincidence of important feature sets is larger for PLS-DA and PPLS-DA as it is for RF. RF shows the best results for patterns in which the convex hull of one class is a cover of the other class, but these depend strongly on the pool size. We complete the PE results with experimental data which we pool artificially. The PE of PPLS-DA and PLS-DA are again least influenced by pooling and are low. Additionally, we show under which assumption the PLS-DA loading weights, as a measure for importance of features regarding classification, are equal for the different designs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

Abbreviations

N :

Total number of available single samples (with subscript A or B for class A or B)

N S :

Number of samples used for the single sample arrays (with subscript A or B for class A or B)

\({N_{S_P}}\) :

Number of samples used for the pools (with subscript A or B for class A or B)

A S :

Number of arrays for the single sample design

A P :

Number of arrays for the pools

A :

Total number of arrays which can be financed

m p :

Pool size

N P :

Number of pools

u g,i :

Random variables on the scale of measured intensities for a microarray experiment for gene g and sample i

\({u_{g,p_j}}\) :

Random variables on the scale of measured intensities for a microarray experiment for gene g and pool p j

μ g :

Mean gene expression level of gene g

\({\sigma_{b}^{2}}\) :

Biological variance

X g,i :

Gene expression values on the log scale for gene g and sample i (with subscript A or B for class A or B)

\({X_{g,p_j}}\) :

Gene expression values on the log scale for gene g and pool p j (with subscript A or B for class A or B)

w i :

Proportion of sample i in a pool

cov :

Covariance

corr :

Correlation

ISF :

Informative simulated feature(s)

LDA:

Linear discriminant analysis

PE(s):

Prediction error(s)

PLS-DA:

Partial least squares discriminant analysis

PPLS-DA:

Power partial least squares discriminant analysis

RF:

Random forest

sd :

Standard deviation

SVM:

Support vector machines

SVML:

Support vector machines with linear kernel

SVMR:

Support vector machines with radial kernel

D sim :

Informative simulated feature set

\({D_{m_p}^{\rm M}}\) :

Important features for classification with method M in a design with pool size m p

\({I_{1}^{\rm M}}\) :

\({= D_{1}^{\rm M} \cap D_{\rm sim}^{\rm M}}\)

\({I_{1:m_p}^{\rm M}}\) :

\({= I_{1}^{\rm M} \cap D_{m_p}^{\rm M}}\) for method M important informative simulated features which coincide in the single sample design and in a design with pool size m p

X t :

Transposed matrix of X

|I|:

Cardinality of I

abs(a):

Absolute value of the real number a

References

  • Affymetrix (2004) Sample pooling for microarray analysis: a statistical assessment of risks and biases. Technical note, Part no. 701494, rev. 2

  • Allison DB, Cui X, Page GP, Sabripour M (2006) Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet 7(1): 55–65

    Article  Google Scholar 

  • Barker M, Rayens W (2003) Partial least squares for discrimination. J Chemom 17(3): 166–173

    Article  Google Scholar 

  • Biomarkers Definition Workgroup: (2001) Biomarkers and surrogate endpoints: preferred definitions and conceptual framework. Clin Pharmacol Ther 69(3): 89–95

    Article  Google Scholar 

  • Boulesteix A-L (2004) Pls dimension reduction for classification with microarray data. Stat Appl Genet Mol Biol 3(1). doi:10.2202/1544-6115.1075

  • Boulesteix A-L, Strobl C, Augustin T, Daumer M (2008) Evaluating microarray-based classifiers: an overview. Cancer Inf 6: 77–97

    Google Scholar 

  • Breiman L (2001) Random forests. Mach Learn 45: 5–32

    Article  MATH  Google Scholar 

  • Dettling M (2004) Bagboosting for tumor classification with gene expression data. Bioinformatics 20(18): 3583–3593. doi:10.1093/bioinformatics/bth447

    Article  Google Scholar 

  • Dettling M, Buehlmann P (2003) Boosting for tumor classification with gene expression data. Bioinformatics 19(9): 1061–1069

    Article  Google Scholar 

  • Díaz-Uriarte R, de Andrés SA (2006) Gene selection and classification of microarray data using random forest. BMC Bioinformat 7: 3. doi:10.1186/1471-2105-7-3

    Article  Google Scholar 

  • Dimitriadou E, Hornik K, Leisch F, Meyer D, Weingessel A (2009) e1071: Misc functions of the Department of Statistics (e1071), TU Wien. R package version 1.5-20. http://CRAN.R-project.org/package=e1071

  • Dudoit S, Fridlyand J, Speed TP (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97: 77–87

    Article  MathSciNet  MATH  Google Scholar 

  • Feng Z, Prentice R, Srivastava S (2004) Research issues and strategies for genomic and proteomic biomarker discovery and validation: a statistical perspective. Pharmacogenomics 5(6): 709–719. doi:10.1517/14622416.5.6.709

    Article  Google Scholar 

  • Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439): 531–537

    Article  Google Scholar 

  • Indahl UG, Martens H, Næs T (2007) From dummy regression to prior probabilities in pls-da. J Chemom 21: 529–536

    Article  Google Scholar 

  • Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP (2003) Summaries of affymetrix genechip probe level data. Nucleic Acids Res 31(4): e15

    Article  Google Scholar 

  • Jensen JLWV (1906) Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta Math 30: 175–193

    Article  MathSciNet  MATH  Google Scholar 

  • Kendziorski C, Irizarry RA, Chen KS, Haag JD, Gould MN (2005) On the utility of pooling biological samples in microarray experiments. Proc Natl Acad Sci USA 102(12): 4252–4257

    Article  Google Scholar 

  • Kerr MK (2003) Design considerations for efficient and effective microarray studies. Biometrics 59(4): 822–828

    Article  MathSciNet  MATH  Google Scholar 

  • Lapointe J, Li C, Higgins JP, van de Rijn M, Bair E, Montgomery K, Ferrari M, Egevad L, Rayford W, Bergerheim U, Ekman P, DeMarzo AM, Tibshirani R, Botstein D, Brown PO, Brooks JD, Pollack JR (2004) Gene expression profiling identifies clinically relevant subtypes of prostate cancer. Proc Natl Acad Sci USA 101(3): 811–816. doi:10.1073/pnas.0304146101

    Article  Google Scholar 

  • Liaw A, Wiener M (2002) Classification and regression by randomForest. http://CRAN.R-project.org/doc/Rnews/

  • Liland KH, Indahl U (2009) Powered partial least squares discriminant analysis. Chemometrics 23: 7–18

    Article  Google Scholar 

  • Liu H, Li J, Wong L (2002) A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. Genome Inf 13: 51–60

    Google Scholar 

  • Mary-Huard T, Daudin JJ, Baccini M, Biggeri A, Bar-Hen A (2007) Biases induced by pooling samples in microarray experiments. Bioinformatics 23(13): i313–i318

    Article  Google Scholar 

  • Nocairi H, Qannari EM, Vigneau E, Bertrand D (2005) Discrimination on latent components with respect to patterns. Application to multicollinear data. Comput Stat Data Anal 48(1): 139–147

    Article  MathSciNet  MATH  Google Scholar 

  • Peng X, Wood CL, Blalock EM, Chen KC, Landfield PW, Stromberg AJ (2003) Statistical implications of pooling rna samples for microarray experiments. BMC Bioinform 4: 26. doi:10.1186/1471-2105-4-26

    Article  Google Scholar 

  • Quackenbush J (2002) Microarray data normalization and transformation. Nat Genet 32(Suppl): 496–501

    Article  Google Scholar 

  • R Development Core Team (2008) R: A language and environment for statistical computing. R foundation for statistical computing, Vienna, Austria. ISBN 3-900051-07-0. http://www.R-project.org

  • Russel, S, Norvig, P (eds) (2009) Artificial intellligence: a modern approach. Prentice Hall, Upper Saddle River

    Google Scholar 

  • Sadiq ST, Agranoff D (2008) Pooling serum samples may lead to loss of potential biomarkers in SELDI-ToF MS proteomic profiling. Proteome Sci 6: 16

    Article  Google Scholar 

  • Searfoss GH, Jordan WH, Calligaro DO, Galbreath EJ, Schirtzinger LM, Berridge BR, Gao H, Higgins MA, May PC, Ryan TP (2003) Adipsin, a biomarker of gastrointestinal toxicity mediated by a functional gamma-secretase inhibitor. J Biol Chem 278(46): 46107–46116

    Article  Google Scholar 

  • Simon R, Radmacher MD, Dobbin K (2002) Design of studies using dna microarrays. Genet Epidemiol 23(1):21–36. doi:10.1002/gepi.202

    Google Scholar 

  • Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1(2): 203–209

    Article  Google Scholar 

  • Storey R, Tibshirani J (2003) Statistical significance for genomewide studies. Proc Natal Acad Sci 100: 9440–9445

    Article  MathSciNet  MATH  Google Scholar 

  • Telaar A, Nürnberg G, Repsilber D (2010) Finding biomarker signatures in pooled sample designs: a simulation framework for methodological comparisons. Adv Bioinform 2010: 8

    Google Scholar 

  • Veer L (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415(31): 530–536

    Article  Google Scholar 

  • Venables WN, Ripley BD (2002) Modern applied statistics with S, 4th edn. New York. ISBN 0-387-95457-0 http://www.stats.ox.ac.uk/pub/MASS4

  • Zhang W, Carriquiry A, Nettleton D, Dekkers JC (2007) Pooling mRNA in microarray experiments and its effect on power. Bioinformatics 23(10): 1217–1224

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gerd Nürnberg.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Telaar, A., Repsilber, D. & Nürnberg, G. Biomarker discovery: classification using pooled samples. Comput Stat 28, 67–106 (2013). https://doi.org/10.1007/s00180-011-0302-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-011-0302-0

Keywords

Navigation