Skip to main content

Batch Effect Correction in a Confounded Scenario: a Case Study on Gene Expression of Chornobyl Tree Frogs

  • Conference paper
  • First Online:
Computational Methods in Systems Biology (CMSB 2024)

Abstract

When large omics datasets present unwanted latent variability, a critical analysis step is to control these so-called batch effects properly. However, most batch effect-correction algorithms (BECAs) face limitations when the source of unwanted variation and the variable of interest are confounded. In this paper, we use RNA-seq data to study the effects of radiation contamination on tree frogs (Hyla orientalis) collected in the Chornobyl Exclusion Zone. We identify the site of collection of the frogs as a confounding factor in the transcriptomics analysis. We present our strategy to correct this confounding effect using the following BECAs: ComBat-seq, linear residualization, and Surrogate Variable Analysis. We show that the severe confounding between the site and radiocontamination level makes the correction step challenging. Instead, we investigate the site-to-site variability and successfully deconvolute the batch variable from the radiation level by adjusting for the population genetic structure. Our strategy allowed us to reveal the effects of low-dose radiation on the gene expression of Chornobyl tree frogs and appropriately preprocess the RNA-seq dataset for future multimodal integrative analyses.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    We use the official spelling Chornobyl, in accordance with the romanization of Ukrainian geographical names recommended by the 10th United Nations Conference on the Standardization of Geographical Names (see https://mfa.gov.ua/en/correctua).

References

  1. Anders, S., Huber, W.: Differential expression analysis for sequence count data. Genome Biol. 11(10), R106 (2010). https://doi.org/10.1186/gb-2010-11-10-r106

    Article  Google Scholar 

  2. Argelaguet, R., et al.: MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol. 21(1), 111 (2020). https://doi.org/10.1186/s13059-020-02015-1

    Article  Google Scholar 

  3. Armant, O., Car, C., Frelon, S., Camoin, L.: Population transcriptogenomics highlights impaired metabolism and small population sizes in tree frogs living in the Chernobyl Exclusion Zone (2023). https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE211060

  4. Beaugelin-Seiller, K., Jasserand, F., Garnier-Laplace, J., Gariel, J.C.: Modeling radiological dose in non-human species: principles, computerization, and application. Health Phys. 90(5), 485–493 (2006). https://doi.org/10.1097/01.HP.0000182192.91169.ed

    Article  Google Scholar 

  5. Burraco, P., Car, C., Bonzom, J.M., Orizaola, G.: Assessment of exposure to ionizing radiation in Chernobyl tree frogs (Hyla orientalis). Sci. Rep. 11, 20509 (2021). https://doi.org/10.1038/s41598-021-00125-9

    Article  Google Scholar 

  6. Cao, Z.J., Gao, G.: Multi-omics single-cell data integration and regulatory inference with graph-linked embedding. Nat. Biotechnol. 40(10), 1458–1466 (2022). https://doi.org/10.1038/s41587-022-01284-4

    Article  Google Scholar 

  7. Car, C., et al.: Unusual evolution of tree frog populations in the Chernobyl exclusion zone. Evol. Appl. 15(2), 203–219 (2022). https://doi.org/10.1111/eva.13282

    Article  Google Scholar 

  8. Car, C., et al.: Population transcriptogenomics highlights impaired metabolism and small population sizes in tree frogs living in the Chernobyl Exclusion Zone. BMC Biol. 21(1), 164 (2023). https://doi.org/10.1186/s12915-023-01659-2

    Article  Google Scholar 

  9. Chen, Y., Chen, L., Lun, A.T.L., Baldoni, P.L., Smyth, G.K.: edgeR 4.0: powerful differential analysis of sequencing data with expanded functionality and improved support for small counts and larger datasets. bioRxiv (2024). https://doi.org/10.1101/2024.01.21.576131

  10. Eslami, A., Qannari, E.M., Kohler, A., Bougeard, S.: Algorithms for multi-group PLS. J. Chemom. 28(3), 192–201 (2014). https://doi.org/10.1002/cem.2593

    Article  Google Scholar 

  11. García, C.B., Salmerón, R., García, C., García, J.: Residualization: justification, properties and application. J. Appl. Stat. 47(11), 1990–2010 (2020). https://doi.org/10.1080/02664763.2019.1701638

    Article  MathSciNet  Google Scholar 

  12. Goh, W.W.B., Wang, W., Wong, L.: Why batch effects matter in omics data, and how to avoid them. Trends Biotechnol. 35(6), 498–507 (2017). https://doi.org/10.1016/j.tibtech.2017.02.012

    Article  Google Scholar 

  13. Goh, W.W.B., Yong, C.H., Wong, L.: Are batch effects still relevant in the age of big data? Trends Biotechnol. 40(9), 1029–1040 (2022). https://doi.org/10.1016/j.tibtech.2022.02.005

    Article  Google Scholar 

  14. Grabherr, M.G., et al.: Trinity: reconstructing a full-length transcriptome without a genome from RNA-Seq data. Nat. Biotechnol. 29(7), 644–652 (2011). https://doi.org/10.1038/nbt.1883

    Article  Google Scholar 

  15. Jaffe, A.E., et al.: Practical impacts of genomic data “cleaning” on biological discovery using surrogate variable analysis. BMC Bioinform. 16(1), 372 (2015). https://doi.org/10.1186/s12859-015-0808-5

  16. Johnson, W.E., Li, C., Rabinovic, A.: Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8(1), 118–127 (2007). https://doi.org/10.1093/biostatistics/kxj037

    Article  Google Scholar 

  17. Knaus, B.J., Grünwald, N.J.: VCFR: a package to manipulate and visualize variant call format data in R. Mol. Ecol. Resour. 17(1), 44–53 (2017). https://doi.org/10.1111/1755-0998.12549

    Article  Google Scholar 

  18. Kostyuk, S.V., et al.: Effect of low-dose ionizing radiation on the expression of mitochondria-related genes in human mesenchymal stem cells. Int. J. Mol. Sci. 23(1), 261 (2021). https://doi.org/10.3390/ijms23010261

    Article  MathSciNet  Google Scholar 

  19. Langmead, B., Salzberg, S.L.: Fast gapped-read alignment with Bowtie 2. Nat. Methods 9(4), 357–359 (2012). https://doi.org/10.1038/nmeth.1923

    Article  Google Scholar 

  20. Leek, J.T., Johnson, W.E., Parker, H.S., Jaffe, A.E., Storey, J.D.: The SVA package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 28(6), 882–883 (2012). https://doi.org/10.1093/bioinformatics/bts034

    Article  Google Scholar 

  21. Leek, J.T., et al.: Tackling the widespread and critical impact of batch effects in high-throughput data. Nat. Rev. Genet. 11(10), 733–739 (2010). https://doi.org/10.1038/nrg2825

    Article  Google Scholar 

  22. Leek, J.T., Storey, J.D.: Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet. 3(9), e161 (2007). https://doi.org/10.1371/journal.pgen.0030161

    Article  Google Scholar 

  23. Li, B., Dewey, C.N.: RSEM: accurate transcript quantification from RNA-Seq data with or without a reference genome. BMC Bioinform. 12(1), 323 (2011). https://doi.org/10.1186/1471-2105-12-323

    Article  Google Scholar 

  24. Li, T., Zhang, Y., Patil, P., Johnson, W.E.: Overcoming the impacts of two-step batch effect correction on gene expression estimation and inference. Biostatistics 24(3), 635–652 (2023). https://doi.org/10.1093/biostatistics/kxab039

    Article  MathSciNet  Google Scholar 

  25. Love, M.I., Huber, W., Anders, S.: Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biol. 15(12), 550 (2014). https://doi.org/10.1186/s13059-014-0550-8

    Article  Google Scholar 

  26. Martinelli, F., et al.: Gene regulatory networks elucidating huanglongbing disease mechanisms. PLoS ONE 8, e74256 (2013). https://doi.org/10.1371/journal.pone.0074256

    Article  Google Scholar 

  27. Murat El Houdigui, S., Adam-Guillermin, C., Armant, O.: Ionising radiation induces promoter DNA hypomethylation and perturbs transcriptional activity of genes involved in morphogenesis during gastrulation in zebrafish. Int. J. Mol. Sci. 21(11), 4014 (2020). https://doi.org/10.3390/ijms21114014

  28. Murat El Houdigui, S., et al.: A systems biology approach reveals neuronal and muscle developmental defects after chronic exposure to ionising radiation in zebrafish. Sci. Rep. 9(1), 20241 (2019). https://doi.org/10.1038/s41598-019-56590-w

  29. Nygaard, V., Rødland, E.A., Hovig, E.: Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses. Biostatistics 17(1), 29–39 (2016). https://doi.org/10.1093/biostatistics/kxv027

    Article  MathSciNet  Google Scholar 

  30. Ritchie, M.E., et al.: limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 43(7), e47 (2015). https://doi.org/10.1093/nar/gkv007

    Article  Google Scholar 

  31. Rohart, F., Eslami, A., Matigian, N., Bougeard, S., Lê Cao, K.A.: MINT: a multivariate integrative method to identify reproducible molecular signatures across independent experiments and platforms. BMC Bioinform. 18(1), 128 (2017). https://doi.org/10.1186/s12859-017-1553-8

    Article  Google Scholar 

  32. Sims, A.H., et al.: The removal of multiplicative, systematic bias allows integration of breast cancer gene expression datasets - improving meta-analysis and prediction of prognosis. BMC Med. Genomics 1(1), 42 (2008). https://doi.org/10.1186/1755-8794-1-42

    Article  Google Scholar 

  33. Soneson, C., Love, M.I., Robinson, M.D.: Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Research 4, 1521 (2016). https://doi.org/10.12688/f1000research.7563.2

  34. Sul, J.H., Martin, L.S., Eskin, E.: Population structure in genetic studies: confounding factors and mixed models. PLoS Genet. 14(12), e1007309 (2018). https://doi.org/10.1371/journal.pgen.1007309

    Article  Google Scholar 

  35. Tenenhaus, A., Tenenhaus, M.: Regularized generalized canonical correlation analysis for multiblock or multigroup data analysis. Eur. J. Oper. Res. 238(2), 391–403 (2014). https://doi.org/10.1016/j.ejor.2014.01.008

    Article  MathSciNet  Google Scholar 

  36. Wang, Y., Lê Cao, K.A.: PLSDA-batch: a multivariate framework to correct for batch effects in microbiome data. Briefings Bioinform. 24(2), bbac622 (2023). https://doi.org/10.1093/bib/bbac622

  37. Witten, D.M., Tibshirani, R., Hastie, T.: A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10(3), 515–534 (2009). https://doi.org/10.1093/biostatistics/kxp008

    Article  Google Scholar 

  38. Wu, T., et al.: clusterProfiler 4.0: a universal enrichment tool for interpreting omics data. Innovation 2(3), 100141 (2021). https://doi.org/10.1016/j.xinn.2021.100141

  39. Yu, Y., et al.: Correcting batch effects in large-scale multiomics studies using a reference-material-based ratio method. Genome Biol. 24(1), 201 (2023). https://doi.org/10.1186/s13059-023-03047-z

    Article  Google Scholar 

  40. Zhang, Y., Parmigiani, G., Johnson, W.E.: ComBat-seq: batch effect adjustment for RNA-seq count data. NAR Genomics Bioinform. 2(3), lqaa078 (2020). https://doi.org/10.1093/nargab/lqaa078

  41. Zhou, L., Chi-Hau Sue, A., Bin Goh, W.W.: Examining the practical limits of batch effect-correction algorithms: when should you care about batch effects? J. Genet. Genomics 46(9), 433–443 (2019). https://doi.org/10.1016/j.jgg.2019.08.002

    Article  Google Scholar 

Download references

Acknowledgments

EG and CC are supported by PhD grants funded by the French Institute for Radiation Protection and Nuclear Safety (IRSN). S. Gashchack, Y. Gulyaichenko, G. Orizaola, and P. Burraco helped in the field, and S. Gashchack also with measurements of radioactive contamination in tree frogs.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Imène Garali .

Editor information

Editors and Affiliations

A Appendix: Close-Up on the Impact of Sparsity in Gene Selection

A Appendix: Close-Up on the Impact of Sparsity in Gene Selection

To assess how forcing sparsity in PCA influenced the selected gene list, we also ran a similar approach using standard PCA. We performed PCA on 1000 bootstrap samples of the non-corrected (raw) variance-stabilized matrix. In each model, genes were ranked by the absolute value of their weight and the top 400 genes were selected from components 1 and 2. Genes stably selected across bootstrap iterations in components 1, 2, or both were submitted to gene functional annotation, as mentioned previously.

Table 4 shows that the genes selected using sparse PCA were more stable across bootstrap samples than with standard PCA. This led to identifying a larger number of deregulated pathways in the uncorrected dataset with sparse PCA than with PCA. In Fig. 5, we notice that the alteration of biological processes related to energy metabolism (GO terms “oxidative phosphorylation” or “energy derivation by oxidation of organic compounds”) was recovered with sPCA and not with PCA. The identification of pathways typically linked with low-dose radiation, despite the presence of batch effects, suggests that sparsity in the PCA weight vectors mitigated the influence of noise.

Table 4. Feature selection approaches and Gene Ontology terms enrichment
Fig. 5.
figure 5

Enriched Gene Ontology-terms network after feature selection by PCA or sparse PCA on the uncorrected count matrices

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Goujon, E., Armant, O., Car, C., Bonzom, JM., Tenenhaus, A., Garali, I. (2024). Batch Effect Correction in a Confounded Scenario: a Case Study on Gene Expression of Chornobyl Tree Frogs. In: Gori, R., Milazzo, P., Tribastone, M. (eds) Computational Methods in Systems Biology. CMSB 2024. Lecture Notes in Computer Science(), vol 14971. Springer, Cham. https://doi.org/10.1007/978-3-031-71671-3_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-71671-3_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-71670-6

  • Online ISBN: 978-3-031-71671-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics