Original Research
Comparative evaluation of the heterozygous variant standard deviation as a quality measure for next-generation sequencing

https://doi.org/10.1016/j.jbi.2022.104234Get rights and content
Under a Creative Commons license
open access

Highlights

  • A need for additional next-generation sequencing quality checks is identified.

  • Standard deviation of heterozygous allele frequencies is a crucial quality metric.

  • Variant allele frequencies can be modeled from coverage using the gamma distribution.

  • The effect of increasing high coverage becomes negligible in terms of quality gain.

Abstract

Next-generation sequencing holds unprecedented throughput in terms of informational content to cost. The technology has entered the scene in laboratory diagnostics and offers flexible workflows in biomedical research. However, the rapid acquisition of genomic data also gives rise to a substantial fraction of sequencing artifacts, causing the detection of false-positive germline variants or erroneous somatic mutations. Consequently, there is a pressing need for efficient and practical quality assessment in sequencing projects. In this study, we investigate using heterozygous variant allele frequency (VAF) standard deviation (σ) for supplementary quality control. Whereas several proposed quality metrics are based on empirical assessments, the dispersion of the allele frequencies reflects a direct approximation of the inherent and discrete features of a diploid genome. Consequently, homologous chromosomes display heterozygous VAF of approximately 1/2. Based on the meta-analysis of 152 whole-exome sequencing data sets, we found that σ reflects both sequencing coverage and noise and can be effectively modeled. It is concluded that the relative comparison of heterozygous VAF σ provides a practical handle for quality assessment, even for samples afflicted with copy-number alterations. The approach can be implemented when performing whole-exome, whole-genome, or targeted panel sequencing and helps identify problematic samples, such as those retrieved from archived formalin-fixed paraffin-embedded tissue.

Abbreviations

CNA
Copy-number alterations
DP
Depth of coverage
FFPE
Formalin-fixed, paraffin-embedded
Het/Hom
Heterozygous to homozygous variants ratio
MCL
Mantle cell lymphoma
NGS
Next-generation sequencing
PCR
Polymerase chain reaction
Ti/Tv
Ratio of transitions to transversions
VAF
Heterozygous variant allele frequencies
WES
Whole-exome sequencing
WGS
Whole-genome sequencing
σhet
Standard deviation of heterozygous variant allele frequencies
σ'het
Trimmed standard deviation of heterozygous variant allele frequencies

Data Availability.

All data for this study is available online through the sequencing read archive SRA (NCBI, NIH, Bethesda, Maryland, USA) with the following IDs: PRJNA489753, PRJNA755885, and a control subset of samples included in the 1000Genomes project (ERR031846, ERR031848, ERR031972, ERR034564, SRR070503, SRR070517, SRR070534, SRR070790, SRR070795, SRR077391, SRR099958, SRR099960, SRR099968, SRR765993, SRR769545, SRR101000). The mantle cell lymphoma relapse case sample data sets are available in the online supplement (VCF and BED coverage file) and can be visualized with CNAplot [26].

Cited by (0)