Abstract
When large omics datasets present unwanted latent variability, a critical analysis step is to control these so-called batch effects properly. However, most batch effect-correction algorithms (BECAs) face limitations when the source of unwanted variation and the variable of interest are confounded. In this paper, we use RNA-seq data to study the effects of radiation contamination on tree frogs (Hyla orientalis) collected in the Chornobyl Exclusion Zone. We identify the site of collection of the frogs as a confounding factor in the transcriptomics analysis. We present our strategy to correct this confounding effect using the following BECAs: ComBat-seq, linear residualization, and Surrogate Variable Analysis. We show that the severe confounding between the site and radiocontamination level makes the correction step challenging. Instead, we investigate the site-to-site variability and successfully deconvolute the batch variable from the radiation level by adjusting for the population genetic structure. Our strategy allowed us to reveal the effects of low-dose radiation on the gene expression of Chornobyl tree frogs and appropriately preprocess the RNA-seq dataset for future multimodal integrative analyses.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
We use the official spelling Chornobyl, in accordance with the romanization of Ukrainian geographical names recommended by the 10th United Nations Conference on the Standardization of Geographical Names (see https://mfa.gov.ua/en/correctua).
References
Anders, S., Huber, W.: Differential expression analysis for sequence count data. Genome Biol. 11(10), R106 (2010). https://doi.org/10.1186/gb-2010-11-10-r106
Argelaguet, R., et al.: MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol. 21(1), 111 (2020). https://doi.org/10.1186/s13059-020-02015-1
Armant, O., Car, C., Frelon, S., Camoin, L.: Population transcriptogenomics highlights impaired metabolism and small population sizes in tree frogs living in the Chernobyl Exclusion Zone (2023). https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE211060
Beaugelin-Seiller, K., Jasserand, F., Garnier-Laplace, J., Gariel, J.C.: Modeling radiological dose in non-human species: principles, computerization, and application. Health Phys. 90(5), 485–493 (2006). https://doi.org/10.1097/01.HP.0000182192.91169.ed
Burraco, P., Car, C., Bonzom, J.M., Orizaola, G.: Assessment of exposure to ionizing radiation in Chernobyl tree frogs (Hyla orientalis). Sci. Rep. 11, 20509 (2021). https://doi.org/10.1038/s41598-021-00125-9
Cao, Z.J., Gao, G.: Multi-omics single-cell data integration and regulatory inference with graph-linked embedding. Nat. Biotechnol. 40(10), 1458–1466 (2022). https://doi.org/10.1038/s41587-022-01284-4
Car, C., et al.: Unusual evolution of tree frog populations in the Chernobyl exclusion zone. Evol. Appl. 15(2), 203–219 (2022). https://doi.org/10.1111/eva.13282
Car, C., et al.: Population transcriptogenomics highlights impaired metabolism and small population sizes in tree frogs living in the Chernobyl Exclusion Zone. BMC Biol. 21(1), 164 (2023). https://doi.org/10.1186/s12915-023-01659-2
Chen, Y., Chen, L., Lun, A.T.L., Baldoni, P.L., Smyth, G.K.: edgeR 4.0: powerful differential analysis of sequencing data with expanded functionality and improved support for small counts and larger datasets. bioRxiv (2024). https://doi.org/10.1101/2024.01.21.576131
Eslami, A., Qannari, E.M., Kohler, A., Bougeard, S.: Algorithms for multi-group PLS. J. Chemom. 28(3), 192–201 (2014). https://doi.org/10.1002/cem.2593
García, C.B., Salmerón, R., García, C., García, J.: Residualization: justification, properties and application. J. Appl. Stat. 47(11), 1990–2010 (2020). https://doi.org/10.1080/02664763.2019.1701638
Goh, W.W.B., Wang, W., Wong, L.: Why batch effects matter in omics data, and how to avoid them. Trends Biotechnol. 35(6), 498–507 (2017). https://doi.org/10.1016/j.tibtech.2017.02.012
Goh, W.W.B., Yong, C.H., Wong, L.: Are batch effects still relevant in the age of big data? Trends Biotechnol. 40(9), 1029–1040 (2022). https://doi.org/10.1016/j.tibtech.2022.02.005
Grabherr, M.G., et al.: Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nat. Biotechnol. 29(7), 644–652 (2011). https://doi.org/10.1038/nbt.1883
Jaffe, A.E., et al.: Practical impacts of genomic data “cleaning” on biological discovery using surrogate variable analysis. BMC Bioinform. 16(1), 372 (2015). https://doi.org/10.1186/s12859-015-0808-5
Johnson, W.E., Li, C., Rabinovic, A.: Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8(1), 118–127 (2007). https://doi.org/10.1093/biostatistics/kxj037
Knaus, B.J., Grünwald, N.J.: VCFR: a package to manipulate and visualize variant call format data in R. Mol. Ecol. Resour. 17(1), 44–53 (2017). https://doi.org/10.1111/1755-0998.12549
Kostyuk, S.V., et al.: Effect of low-dose ionizing radiation on the expression of mitochondria-related genes in human mesenchymal stem cells. Int. J. Mol. Sci. 23(1), 261 (2021). https://doi.org/10.3390/ijms23010261
Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nat. Methods 9(4), 357–359 (2012). https://doi.org/10.1038/nmeth.1923
Leek, J.T., Johnson, W.E., Parker, H.S., Jaffe, A.E., Storey, J.D.: The SVA package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 28(6), 882–883 (2012). https://doi.org/10.1093/bioinformatics/bts034
Leek, J.T., et al.: Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11(10), 733–739 (2010). https://doi.org/10.1038/nrg2825
Leek, J.T., Storey, J.D.: Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3(9), e161 (2007). https://doi.org/10.1371/journal.pgen.0030161
Li, B., Dewey, C.N.: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinform. 12(1), 323 (2011). https://doi.org/10.1186/1471-2105-12-323
Li, T., Zhang, Y., Patil, P., Johnson, W.E.: Overcoming the impacts of two-step batch effect correction on gene expression estimation and inference. Biostatistics 24(3), 635–652 (2023). https://doi.org/10.1093/biostatistics/kxab039
Love, M.I., Huber, W., Anders, S.: Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15(12), 550 (2014). https://doi.org/10.1186/s13059-014-0550-8
Martinelli, F., et al.: Gene regulatory networks elucidating huanglongbing disease mechanisms. PLoS ONE 8, e74256 (2013). https://doi.org/10.1371/journal.pone.0074256
Murat El Houdigui, S., Adam-Guillermin, C., Armant, O.: Ionising radiation induces promoter DNA hypomethylation and perturbs transcriptional activity of genes involved in morphogenesis during gastrulation in zebrafish. Int. J. Mol. Sci. 21(11), 4014 (2020). https://doi.org/10.3390/ijms21114014
Murat El Houdigui, S., et al.: A systems biology approach reveals neuronal and muscle developmental defects after chronic exposure to ionising radiation in zebrafish. Sci. Rep. 9(1), 20241 (2019). https://doi.org/10.1038/s41598-019-56590-w
Nygaard, V., Rødland, E.A., Hovig, E.: Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses. Biostatistics 17(1), 29–39 (2016). https://doi.org/10.1093/biostatistics/kxv027
Ritchie, M.E., et al.: limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43(7), e47 (2015). https://doi.org/10.1093/nar/gkv007
Rohart, F., Eslami, A., Matigian, N., Bougeard, S., Lê Cao, K.A.: MINT: a multivariate integrative method to identify reproducible molecular signatures across independent experiments and platforms. BMC Bioinform. 18(1), 128 (2017). https://doi.org/10.1186/s12859-017-1553-8
Sims, A.H., et al.: The removal of multiplicative, systematic bias allows integration of breast cancer gene expression datasets - improving meta-analysis and prediction of prognosis. BMC Med. Genomics 1(1), 42 (2008). https://doi.org/10.1186/1755-8794-1-42
Soneson, C., Love, M.I., Robinson, M.D.: Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Research 4, 1521 (2016). https://doi.org/10.12688/f1000research.7563.2
Sul, J.H., Martin, L.S., Eskin, E.: Population structure in genetic studies: confounding factors and mixed models. PLoS Genet. 14(12), e1007309 (2018). https://doi.org/10.1371/journal.pgen.1007309
Tenenhaus, A., Tenenhaus, M.: Regularized generalized canonical correlation analysis for multiblock or multigroup data analysis. Eur. J. Oper. Res. 238(2), 391–403 (2014). https://doi.org/10.1016/j.ejor.2014.01.008
Wang, Y., Lê Cao, K.A.: PLSDA-batch: a multivariate framework to correct for batch effects in microbiome data. Briefings Bioinform. 24(2), bbac622 (2023). https://doi.org/10.1093/bib/bbac622
Witten, D.M., Tibshirani, R., Hastie, T.: A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10(3), 515–534 (2009). https://doi.org/10.1093/biostatistics/kxp008
Wu, T., et al.: clusterProfiler 4.0: a universal enrichment tool for interpreting omics data. Innovation 2(3), 100141 (2021). https://doi.org/10.1016/j.xinn.2021.100141
Yu, Y., et al.: Correcting batch effects in large-scale multiomics studies using a reference-material-based ratio method. Genome Biol. 24(1), 201 (2023). https://doi.org/10.1186/s13059-023-03047-z
Zhang, Y., Parmigiani, G., Johnson, W.E.: ComBat-seq: batch effect adjustment for RNA-seq count data. NAR Genomics Bioinform. 2(3), lqaa078 (2020). https://doi.org/10.1093/nargab/lqaa078
Zhou, L., Chi-Hau Sue, A., Bin Goh, W.W.: Examining the practical limits of batch effect-correction algorithms: when should you care about batch effects? J. Genet. Genomics 46(9), 433–443 (2019). https://doi.org/10.1016/j.jgg.2019.08.002
Acknowledgments
EG and CC are supported by PhD grants funded by the French Institute for Radiation Protection and Nuclear Safety (IRSN). S. Gashchack, Y. Gulyaichenko, G. Orizaola, and P. Burraco helped in the field, and S. Gashchack also with measurements of radioactive contamination in tree frogs.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
A Appendix: Close-Up on the Impact of Sparsity in Gene Selection
A Appendix: Close-Up on the Impact of Sparsity in Gene Selection
To assess how forcing sparsity in PCA influenced the selected gene list, we also ran a similar approach using standard PCA. We performed PCA on 1000 bootstrap samples of the non-corrected (raw) variance-stabilized matrix. In each model, genes were ranked by the absolute value of their weight and the top 400 genes were selected from components 1 and 2. Genes stably selected across bootstrap iterations in components 1, 2, or both were submitted to gene functional annotation, as mentioned previously.
Table 4 shows that the genes selected using sparse PCA were more stable across bootstrap samples than with standard PCA. This led to identifying a larger number of deregulated pathways in the uncorrected dataset with sparse PCA than with PCA. In Fig. 5, we notice that the alteration of biological processes related to energy metabolism (GO terms “oxidative phosphorylation” or “energy derivation by oxidation of organic compounds”) was recovered with sPCA and not with PCA. The identification of pathways typically linked with low-dose radiation, despite the presence of batch effects, suggests that sparsity in the PCA weight vectors mitigated the influence of noise.
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Goujon, E., Armant, O., Car, C., Bonzom, JM., Tenenhaus, A., Garali, I. (2024). Batch Effect Correction in a Confounded Scenario: a Case Study on Gene Expression of Chornobyl Tree Frogs. In: Gori, R., Milazzo, P., Tribastone, M. (eds) Computational Methods in Systems Biology. CMSB 2024. Lecture Notes in Computer Science(), vol 14971. Springer, Cham. https://doi.org/10.1007/978-3-031-71671-3_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-71671-3_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-71670-6
Online ISBN: 978-3-031-71671-3
eBook Packages: Computer ScienceComputer Science (R0)