Biomarker discovery: classification using pooled samples

Telaar, Anna; Repsilber, Dirk; Nürnberg, Gerd

doi:10.1007/s00180-011-0302-0

Biomarker discovery: classification using pooled samples

A simulation study

Original Paper
Published: 19 January 2012

Volume 28, pages 67–106, (2013)
Cite this article

Computational Statistics Aims and scope Submit manuscript

Anna Telaar¹,
Dirk Repsilber¹ &
Gerd Nürnberg¹

277 Accesses
2 Citations
Explore all metrics

Abstract

RNA-sample pooling is sometimes inevitable, but should be avoided in classification tasks like biomarker studies. Our simulation framework investigates a two-class classification study based on gene expression profiles to point out how strong the outcomes of single sample designs differ to those of pooling designs. The results show how the effects of pooling depend on pool size, discriminating pattern, number of informative features and the statistical learning method used (support vector machines with linear and radial kernel, random forest (RF), linear discriminant analysis, powered partial least squares discriminant analysis (PPLS-DA) and partial least squares discriminant analysis (PLS-DA)). As a measure for the pooling effect, we consider prediction error (PE) and the coincidence of important feature sets for classification based on PLS-DA, PPLS-DA and RF. In general, PPLS-DA and PLS-DA show constant PE with increasing pool size and low PE for patterns for which the convex hull of one class is not a cover of the other class. The coincidence of important feature sets is larger for PLS-DA and PPLS-DA as it is for RF. RF shows the best results for patterns in which the convex hull of one class is a cover of the other class, but these depend strongly on the pool size. We complete the PE results with experimental data which we pool artificially. The PE of PPLS-DA and PLS-DA are again least influenced by pooling and are low. Additionally, we show under which assumption the PLS-DA loading weights, as a measure for importance of features regarding classification, are equal for the different designs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Abbreviations

N :: Total number of available single samples (with subscript A or B for class A or B)
N _S :: Number of samples used for the single sample arrays (with subscript A or B for class A or B)
\({N_{S_P}}\) :: Number of samples used for the pools (with subscript A or B for class A or B)
A _S :: Number of arrays for the single sample design
A _P :: Number of arrays for the pools
A :: Total number of arrays which can be financed
m _p :: Pool size
N _P :: Number of pools
u _g,i :: Random variables on the scale of measured intensities for a microarray experiment for gene g and sample i
\({u_{g,p_j}}\) :: Random variables on the scale of measured intensities for a microarray experiment for gene g and pool p _j
μ _g :: Mean gene expression level of gene g
\({\sigma_{b}^{2}}\) :: Biological variance
X _g,i :: Gene expression values on the log scale for gene g and sample i (with subscript A or B for class A or B)
\({X_{g,p_j}}\) :: Gene expression values on the log scale for gene g and pool p _j (with subscript A or B for class A or B)
w _i :: Proportion of sample i in a pool
cov :: Covariance
corr :: Correlation
ISF :: Informative simulated feature(s)
LDA:: Linear discriminant analysis
PE(s):: Prediction error(s)
PLS-DA:: Partial least squares discriminant analysis
PPLS-DA:: Power partial least squares discriminant analysis
RF:: Random forest
sd :: Standard deviation
SVM:: Support vector machines
SVML:: Support vector machines with linear kernel
SVMR:: Support vector machines with radial kernel
D _sim :: Informative simulated feature set
\({D_{m_p}^{\rm M}}\) :: Important features for classification with method M in a design with pool size m _p
\({I_{1}^{\rm M}}\) :: \({= D_{1}^{\rm M} \cap D_{\rm sim}^{\rm M}}\)
\({I_{1:m_p}^{\rm M}}\) :: \({= I_{1}^{\rm M} \cap D_{m_p}^{\rm M}}\) for method M important informative simulated features which coincide in the single sample design and in a design with pool size m _p
X ^t :: Transposed matrix of X
|I|:: Cardinality of I
abs(a):: Absolute value of the real number a

References

Affymetrix (2004) Sample pooling for microarray analysis: a statistical assessment of risks and biases. Technical note, Part no. 701494, rev. 2
Allison DB, Cui X, Page GP, Sabripour M (2006) Microarray data analysis: from disarray to consolidation and consensus. Nat Rev Genet 7(1): 55–65
Article Google Scholar
Barker M, Rayens W (2003) Partial least squares for discrimination. J Chemom 17(3): 166–173
Article Google Scholar
Biomarkers Definition Workgroup: (2001) Biomarkers and surrogate endpoints: preferred definitions and conceptual framework. Clin Pharmacol Ther 69(3): 89–95
Article Google Scholar
Boulesteix A-L (2004) Pls dimension reduction for classification with microarray data. Stat Appl Genet Mol Biol 3(1). doi:10.2202/1544-6115.1075
Boulesteix A-L, Strobl C, Augustin T, Daumer M (2008) Evaluating microarray-based classifiers: an overview. Cancer Inf 6: 77–97
Google Scholar
Breiman L (2001) Random forests. Mach Learn 45: 5–32
Article MATH Google Scholar
Dettling M (2004) Bagboosting for tumor classification with gene expression data. Bioinformatics 20(18): 3583–3593. doi:10.1093/bioinformatics/bth447
Article Google Scholar
Dettling M, Buehlmann P (2003) Boosting for tumor classification with gene expression data. Bioinformatics 19(9): 1061–1069
Article Google Scholar
Díaz-Uriarte R, de Andrés SA (2006) Gene selection and classification of microarray data using random forest. BMC Bioinformat 7: 3. doi:10.1186/1471-2105-7-3
Article Google Scholar
Dimitriadou E, Hornik K, Leisch F, Meyer D, Weingessel A (2009) e1071: Misc functions of the Department of Statistics (e1071), TU Wien. R package version 1.5-20. http://CRAN.R-project.org/package=e1071
Dudoit S, Fridlyand J, Speed TP (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97: 77–87
Article MathSciNet MATH Google Scholar
Feng Z, Prentice R, Srivastava S (2004) Research issues and strategies for genomic and proteomic biomarker discovery and validation: a statistical perspective. Pharmacogenomics 5(6): 709–719. doi:10.1517/14622416.5.6.709
Article Google Scholar
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439): 531–537
Article Google Scholar
Indahl UG, Martens H, Næs T (2007) From dummy regression to prior probabilities in pls-da. J Chemom 21: 529–536
Article Google Scholar
Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B, Speed TP (2003) Summaries of affymetrix genechip probe level data. Nucleic Acids Res 31(4): e15
Article Google Scholar
Jensen JLWV (1906) Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta Math 30: 175–193
Article MathSciNet MATH Google Scholar
Kendziorski C, Irizarry RA, Chen KS, Haag JD, Gould MN (2005) On the utility of pooling biological samples in microarray experiments. Proc Natl Acad Sci USA 102(12): 4252–4257
Article Google Scholar
Kerr MK (2003) Design considerations for efficient and effective microarray studies. Biometrics 59(4): 822–828
Article MathSciNet MATH Google Scholar
Lapointe J, Li C, Higgins JP, van de Rijn M, Bair E, Montgomery K, Ferrari M, Egevad L, Rayford W, Bergerheim U, Ekman P, DeMarzo AM, Tibshirani R, Botstein D, Brown PO, Brooks JD, Pollack JR (2004) Gene expression profiling identifies clinically relevant subtypes of prostate cancer. Proc Natl Acad Sci USA 101(3): 811–816. doi:10.1073/pnas.0304146101
Article Google Scholar
Liaw A, Wiener M (2002) Classification and regression by randomForest. http://CRAN.R-project.org/doc/Rnews/
Liland KH, Indahl U (2009) Powered partial least squares discriminant analysis. Chemometrics 23: 7–18
Article Google Scholar
Liu H, Li J, Wong L (2002) A comparative study on feature selection and classification methods using gene expression profiles and proteomic patterns. Genome Inf 13: 51–60
Google Scholar
Mary-Huard T, Daudin JJ, Baccini M, Biggeri A, Bar-Hen A (2007) Biases induced by pooling samples in microarray experiments. Bioinformatics 23(13): i313–i318
Article Google Scholar
Nocairi H, Qannari EM, Vigneau E, Bertrand D (2005) Discrimination on latent components with respect to patterns. Application to multicollinear data. Comput Stat Data Anal 48(1): 139–147
Article MathSciNet MATH Google Scholar
Peng X, Wood CL, Blalock EM, Chen KC, Landfield PW, Stromberg AJ (2003) Statistical implications of pooling rna samples for microarray experiments. BMC Bioinform 4: 26. doi:10.1186/1471-2105-4-26
Article Google Scholar
Quackenbush J (2002) Microarray data normalization and transformation. Nat Genet 32(Suppl): 496–501
Article Google Scholar
R Development Core Team (2008) R: A language and environment for statistical computing. R foundation for statistical computing, Vienna, Austria. ISBN 3-900051-07-0. http://www.R-project.org
Russel, S, Norvig, P (eds) (2009) Artificial intellligence: a modern approach. Prentice Hall, Upper Saddle River
Google Scholar
Sadiq ST, Agranoff D (2008) Pooling serum samples may lead to loss of potential biomarkers in SELDI-ToF MS proteomic profiling. Proteome Sci 6: 16
Article Google Scholar
Searfoss GH, Jordan WH, Calligaro DO, Galbreath EJ, Schirtzinger LM, Berridge BR, Gao H, Higgins MA, May PC, Ryan TP (2003) Adipsin, a biomarker of gastrointestinal toxicity mediated by a functional gamma-secretase inhibitor. J Biol Chem 278(46): 46107–46116
Article Google Scholar
Simon R, Radmacher MD, Dobbin K (2002) Design of studies using dna microarrays. Genet Epidemiol 23(1):21–36. doi:10.1002/gepi.202
Google Scholar
Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1(2): 203–209
Article Google Scholar
Storey R, Tibshirani J (2003) Statistical significance for genomewide studies. Proc Natal Acad Sci 100: 9440–9445
Article MathSciNet MATH Google Scholar
Telaar A, Nürnberg G, Repsilber D (2010) Finding biomarker signatures in pooled sample designs: a simulation framework for methodological comparisons. Adv Bioinform 2010: 8
Google Scholar
Veer L (2002) Gene expression profiling predicts clinical outcome of breast cancer. Nature 415(31): 530–536
Article Google Scholar
Venables WN, Ripley BD (2002) Modern applied statistics with S, 4th edn. New York. ISBN 0-387-95457-0 http://www.stats.ox.ac.uk/pub/MASS4
Zhang W, Carriquiry A, Nettleton D, Dekkers JC (2007) Pooling mRNA in microarray experiments and its effect on power. Bioinformatics 23(10): 1217–1224
Article Google Scholar

Download references

Author information

Authors and Affiliations

Leibniz Institute for Farm Animal Biology, Wilhelm-Stahl-Allee 2, 18196, Dummerstorf, Germany
Anna Telaar, Dirk Repsilber & Gerd Nürnberg

Authors

Anna Telaar
View author publications
You can also search for this author in PubMed Google Scholar
Dirk Repsilber
View author publications
You can also search for this author in PubMed Google Scholar
Gerd Nürnberg
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gerd Nürnberg.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Telaar, A., Repsilber, D. & Nürnberg, G. Biomarker discovery: classification using pooled samples. Comput Stat 28, 67–106 (2013). https://doi.org/10.1007/s00180-011-0302-0

Download citation

Received: 06 October 2010
Accepted: 29 December 2011
Published: 19 January 2012
Issue Date: February 2013
DOI: https://doi.org/10.1007/s00180-011-0302-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Biomarker discovery: classification using pooled samples

Abstract

Access this article

Similar content being viewed by others

The Influence of Multi-class Feature Selection on the Prediction of Diagnostic Phenotypes

ROC Curves for the Statistical Analysis of Microarray Data

Simultaneous Sample and Gene Selection Using T-score and Approximate Support Vectors

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Biomarker discovery: classification using pooled samples

Abstract

Access this article

Similar content being viewed by others

The Influence of Multi-class Feature Selection on the Prediction of Diagnostic Phenotypes

ROC Curves for the Statistical Analysis of Microarray Data

Simultaneous Sample and Gene Selection Using T-score and Approximate Support Vectors

Abbreviations

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation