Abstract
This article provides a detailed description, analysis, and visualization of a case–control genome-wide genotypic dataset from the North American Rheumatoid Arthritis Consortium (NARAC). The data is presented in terms of the number of females and males in both cases and controls, as well as the percentage of missing data. The number of alleles and genotypes is also counted, and the minor allele frequency (MAF) is calculated for each single nucleotide polymorphism (SNP). The data is further classified into four categories based on the SNP's MAF, namely, very rare, rare, low frequency, and common SNPs. The regions of these categories in the chromosome are investigated to determine the proportion of SNPs in coding locations and other regions. It is observed that each category has a different proportion in each region of consequence annotation. The data composition in terms of alleles and genotypes is found to be greatly disproportionate. The results present clear insights into the data and its MAF, which can be compared with other datasets. These findings can aid researchers in gaining a comprehensive understanding of such case–control datasets and bring accurate insights into the data.
Similar content being viewed by others
Data availability
Due to the subject confidentiality agreement, the data used during the current study are not publicly accessible but are available upon reasonable request from the first author.
References
The international SNP map working group: A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409(6822), 928–933 (2001). https://doi.org/10.1038/35057149
Genomes Project Consortium: A global reference for human genetic variation. Nature 526(7571), 68–74 (2015). https://doi.org/10.1038/nature15393
Silman, A.J., Pearson, J.E.: Epidemiology and genetics of rheumatoid arthritis. Arthritis Res. 4(Suppl 3), S265–272 (2002). https://doi.org/10.1186/ar578
Amos, C.I., et al.: Data for Genetic Analysis Workshop 16 Problem 1, association analysis of rheumatoid arthritis data. BMC Proc (2009). https://doi.org/10.1186/1753-6561-3-s7-s2
Cui, J., Taylor, K.E., Lee, Y.C., Ka, H.: The influence of polygenic risk scores on heritability of anti-CCP level in RA. Genes Immun. 15(2), 107–114 (2014). https://doi.org/10.1038/gene.2013.68
Stahl, E.A., et al.: Genome-wide association study meta-analysis identifies seven new rheumatoid arthritis risk loci. Nat. Genet. 42(6), 504–514 (2010). https://doi.org/10.1038/ng.582
Raychaudhuri, S., et al.: Common variants at CD40 and other loci confer risk of rheumatoid arthritis. Nat. Genet. 40(10), 1216–1223 (2008). https://doi.org/10.1038/ng.233
Chen, R., Stahl, E.A., Kurreeman, F.A.S., Gregersen, P.K., Siminovitch, K.A., Worthington, J.: Fine mapping the TAGAP risk locus in rheumatoid arthritis. Genes Immun. (2011). https://doi.org/10.1038/gene.2011.8
Raychaudhuri, S., et al.: Five amino acids in three HLA proteins explain most of the association between MHC and seropositive rheumatoid arthritis.Nat. Genet. 44(3), 291–296 (2012). https://doi.org/10.1038/ng.1076
Ding, B., et al.: Different patterns of associations with anti-citrullinated protein antibody—Positive and anti-citrullinated protein antibody-negative rheumatoid arthritis in the extended major histocompatibility complex region. Arthritis Rheum. 60(1), 30–38 (2009). https://doi.org/10.1002/art.24135
Lee, H.S., et al.: Several regions in the major histocompatibility complex confer risk for Anti-CCP-antibody positive rheumatoid arthritis, independent of the DRB1 locus. Mol. Med. 14, 293–300 (2008). https://doi.org/10.2119/2007-00123.Lee
Manavalan, R., Priya, S.: Rheumatoid arthritis identification using epistasis analysis through computational models. Biomed. Biotechnol. Res. J. 4(1), 8–15 (2020). https://doi.org/10.4103/bbrj.bbrj_147_19
Achour, Y., et al.: Analysis of two susceptibility SNPs in HLA region and evidence of interaction between rs6457617 in HLA-DQB1 and HLA-DRB1 * 04 locus on Tunisian rheumatoid arthritis. J. Genet. 96(6), 911–918 (2017). https://doi.org/10.1007/s12041-017-0855-y
Siegel, R.J., Bridges, S.L., Ahmed, S.: HLA—C: An accomplice in rheumatic diseases. ACR open Rheumatol. 1(9), 571–579 (2019). https://doi.org/10.1002/acr2.11065
The International HapMap Consortium: A haplotype map of the human genome. Nature 437, 1299–1320 (2005). https://doi.org/10.1038/nature04226
Bycroft, C., et al.: The UK Biobank resource with deep phenotyping and genomic data. Nature 562, 203–209 (2018). https://doi.org/10.1038/s41586-018-0579-z
Karczewski, K.J., et al.: The ExAC browser: displaying reference data information from over 60 000 exomes. Nucleic Acid Res. 45(D1), D840–D845. Nucleic Acid Res. 45, 840–845 (2017). https://doi.org/10.1093/nar/gkw971
The 1000 Genomes project consortium: A map of human genome variation from population-scale sequencing. Nature 467(7319), 1061–1073 (2010). https://doi.org/10.1038/nature09534
Meyer, P.W.A., et al.: HLA-DRB1 shared epitope genotyping using the revised classification and its association with circulating autoantibodies, acute phase reactants, cytokines and clinical indices of disease activity in a cohort of South African rheumatoid arthritis patients. Arthritis Res. Ther. 13(5), R160 (2011). https://doi.org/10.1186/ar3479
Segal, D.J.: Beyond the genome and into the clinic. Genome Med. 4(10), 78 (2012). https://doi.org/10.1186/gm379
Yoo, Y.J., Kim, S.A., Bull, S.B.: Clique-based clustering of correlated SNPs in a gene can improve performance of gene-based multi-bin linear combination test. BioMed Res. Int. 2015, 852341 (2015). https://doi.org/10.1155/2015/852341
Mclaren, W., et al.: The ensembl variant effect predictor. Genome Biol. 17, 122 (2016). https://doi.org/10.1186/s13059-016-0974-4
Acknowledgements
The authors would like to acknowledge the Genetic Analysis Workshop Grant [R01 GM031575] for providing the NARAC dataset. This work was made possible by funds from the National Institutes of Health [NO1-AR-2-2263 and RO1-AR-44422] and the National Arthritis Foundation (NAF).
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Author information
Authors and Affiliations
Contributions
Conceptualization: MNS, AMS and HFAH. Data curation: FSI and MNS. Formal analysis: FSI and MNS. Investigation: FSI and MNS. Methodology: MNS, AMS and HFAH. Resources: MNS Software: FSI and MNS. Supervision: MNS, AMS and HFAH. Validation: GWZ and MNS. Visualization: FSI. Writing and original draft: GWZ, FSI and MNS Writing, review, and editing: GWZ, MNS, AMS and HFAH. All authors reviewed the manuscript.
Corresponding author
Ethics declarations
Conflict of interest
The authors declare that they have neither affiliations nor involvement in any organization or entity that has a financial stake in the subject matter or materials discussed in this manuscript.
Ethical approval
Not applicable.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Saad, M.N., Zareef, G.W., Ibrahim, F.S. et al. Genome-wide exploratory analysis for NARAC dataset with preparation for haplotype block partitioning through minor allele frequency quality control viewpoint. Iran J Comput Sci 6, 387–396 (2023). https://doi.org/10.1007/s42044-023-00147-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42044-023-00147-8