Theoretical Foundation of the Performance of Phylogeny-Based Somatic Variant Detection

Moriyama, Takuya; Imoto, Seiya; Miyano, Satoru; Yamaguchi, Rui

doi:10.1007/978-3-030-64511-3_9

Takuya Moriyama¹³,
Seiya Imoto¹³,
Satoru Miyano¹⁴ &
…
Rui Yamaguchi^13,15,16

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 12508))

Included in the following conference series:

International Symposium on Mathematical and Computational Oncology

316 Accesses

Abstract

We study the performance of a variant detection method that is based on a property of tumor phylogenetic tree. Our major contributions are two folds. First, we show the property of tumor phylogenetic tree: the total patterns of mutations are restricted if a multi-regional mutation profile follows a corresponding tumor phylogenetic tree, where a multi-regional mutation profile is a matrix in which predictions of somatic mutations at the corresponding tumor regions are listed. Second, we evaluate the lower and upper bounds of specificity and sensitivity of a phylogeny-based somatic variant detection method under several situations. In the evaluation, we conduct patient-wise variant detection from a noisy multi-regional mutation profile matrix for some genomic positions by utilizing the phylogenetic property; we assume that the phylogenetic information can be extracted from another mutation profile matrix that contains accurate candidates at different genomic positions from the noisy ones. From the evaluation, we find that higher sensitivity is not guaranteed in the phylogeny-based variant detection, but higher specificity is guaranteed for several cases. These findings indicate the tumor phylogeny gives more merit for variant detection based on erroneous long-read sequencers (e.g. Oxford nanopore sequencers) than that based on accurate short-read sequencers (e.g., Illumina sequencer).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Koboldt, D.C., et al.: VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 22(3), 568–576 (2012)
Article Google Scholar
Saunders, C.T., et al.: Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics 28(14), 1811–1817 (2012)
Article Google Scholar
Cibulskis, K., et al.: Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31(3), 213–219 (2013)
Article Google Scholar
Shiraishi, Y., et al.: An empirical Bayesian framework for somatic mutation detection from cancer genome sequencing data. Nucleic Acids Res. 41(7), e89 (2013)
Article Google Scholar
Usuyama, N., et al.: HapMuC: somatic mutation calling using heterozygous germ line variants near candidate mutations. Bioinformatics 30(23), 3302–3309 (2014)
Article Google Scholar
Kim, S., et al.: Strelka2: fast and accurate calling of germline and somatic variants. Nat. Methods 15(8), 591–594 (2018)
Article Google Scholar
Moriyama, T., et al.: A Bayesian model integration for mutation calling through data partitioning. Bioinformatics 35(21), 4247–4254 (2019)
Article Google Scholar
Sahraeian, S.M.E., et al.: Deep convolutional neural networks for accurate somatic mutation detection. Nat. Commun. 10(1), 1041 (2019)
Article Google Scholar
Josephidou, M., et al.: multiSNV: a probabilistic approach for improving detection of somatic point mutations from multiple related tumour samples. Nucleic Acids Res. 43(9), e61 (2015)
Article Google Scholar
Moriyama, T., et al.: Accurate and flexible bayesian mutation call from multi-regional tumor samples. In: Mathematical and Computational Oncology, pp. 47–61. Springer, Cham (2019)
Google Scholar
van Rens, K.E., et al.: SNV-PPILP: refined SNV calling for tumor data using perfect phylogenies and ILP. Bioinformatics 31(7), 1133–1135 (2015)
Article Google Scholar
Reiter, J.G., et al.: Reconstructing metastatic seeding patterns of human cancers. Nat. Commun. 8, 14114 (2017)
Article Google Scholar
Dorri, F., et al.: Somatic mutation detection and classification through probabilistic integration of clonal population information. Commun. Biol. 2(1), 44 (2019)
Article Google Scholar
Detering, H., et al.: Accuracy of somatic variant detection in multiregional tumor sequencing data. bioRxiv 655605 (2019)
Google Scholar
Gusfield, D.: Efficient algorithms for inferring evolutionary trees. Networks 21(1), 19–28 (1991)
Article MathSciNet Google Scholar
Kimura, M.: The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics 61(4), 126 (1969)
Google Scholar
Zafar, H., et al.: SiFit: inferring tumor trees from single-cell sequencing data under finite-sites models. Genome Biol. 18(1), 178 (2017)
Article Google Scholar
Zafar, H., et al.: SiCloneFit: bayesian inference of population structure, genotype, and phylogeny of tumor clones from single-cell genome sequencing data. Genome Research (2019)
Google Scholar
El-Kebir, M.: SPhyR: tumor phylogeny estimation from single-cell sequencing data under loss and error. Bioinformatics 34(17), i671–i679 (2018)
Article Google Scholar

Download references

Acknowledgements

This work has been supported by the Grant-in-Aid for JSPS Research Fellow (17J08884) and MEXT/JSPS KAKENHI Grant (15H05912, hp180198, hp170227, 18H03329, hp190158).

Author information

Authors and Affiliations

Human Genome Center, The Institute of Medical Science, The University of Tokyo, Tokyo, Japan
Takuya Moriyama, Seiya Imoto & Rui Yamaguchi
M&D Data Science Center, Tokyo Medical and Dental University, Tokyo, Japan
Satoru Miyano
Division of Cancer Systems Biology, Aichi Cancer Center Research Institute, Nagoya, Japan
Rui Yamaguchi
Department of Cancer Informatics, Nagoya University Graduate School of Medicine, Nagoya, Japan
Rui Yamaguchi

Authors

Takuya Moriyama
View author publications
You can also search for this author in PubMed Google Scholar
Seiya Imoto
View author publications
You can also search for this author in PubMed Google Scholar
Satoru Miyano
View author publications
You can also search for this author in PubMed Google Scholar
Rui Yamaguchi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rui Yamaguchi .

Editor information

Editors and Affiliations

University of Nevada Reno, Reno, NV, USA
George Bebis
George Washington University, Ashburn, VA, USA
Max Alekseyev
University of California, Riverside, Riverside, CA, USA
Heyrim Cho
The College of New Jersey, Ewing, NJ, USA
Jana Gevertz
IBM Research - Zurich, Rüschlikon, Switzerland
Maria Rodriguez Martinez

Appendix

Proof

(proof of Lemma Lemma 2) T has a phylogenetic tree, hence we can choose a phylogenetic tree $\mathcal {T}$. We can assume $|F_{\mathcal {T}}| \le c$ by removing leaves in $\mathcal {T}$ if no cell corresponds to the leaf in $f:R \rightarrow F_{\mathcal {T}}$. We can also assume that the root node has only one outgoing edge by adding a new root node and connect the novel root and the previous root node.

For the last condition, we remove the following two types of internal nodes: i) the internal node having only one outgoing edge, and ii) the internal node having more than 2 outgoing edges. It is sufficient to show the operation to remove nodes that satisfy i) or ii) from T while keeping conditions of a)-c).

For i), just remove the nodes as in Fig. 6a. We can easily check a)-c) still holds true after this operation. For ii) just remove the node as in Fig. 6b. If the number of outgoing edges is more than three, apply this operation recursively. We can also check that a)-c) still hold true after these operations. Because the operations pictured in Fig. 6a and 6b just decrease the number of i) and ii) nodes, we can totally remove i) and ii) nodes. $\square $

1.1 Performance Evaluation of $R_r$

1.2 Detailed Procedures for Performance Evaluation

Evaluation of $\mathbb {E}_{B|A}[\mathrm {TN}(L, A, B)] / k_2$. We evaluate the upper bound and lower bound for $\mathbb {E}_{B|A}[\mathrm {FP}(L, A, B)]$. Letting K be the total patterns of columns in A, the lower bound can be derived as follows.

$$\begin{aligned} \mathbb {E}_{B|A}[\mathrm {FP}(L, A, B)]&= k_2 \sum _{j=1}^{K} \left( \prod _{i = 1}^{n} f_2^{a_{I_j,i}} (1-f_2)^{1-a_{I_j,i}} \right) \\&\ge k_2 K \min _{j \in \{1, \cdots , K\}} \left( \prod _{i = 1}^{n} f_2^{a_{I_j,i}} (1-f_2)^{1-a_{I_j,i}} \right) \\&\ge k_2 K \min (f_2, 1-f_2)^{n} = k_2 K \underline{f_2}^{n}, \end{aligned}$$

where $\underline{f_2} := \min (f_2, 1-f_2)$. The upper bound can also be derived as follows.

$$\begin{aligned} \mathbb {E}_{B|A}[\mathrm {FP}(L, A, B)]&= k_2 \sum _{j=1}^{K} \left( \prod _{i = 1}^{n} f_2^{a_{I_j,i}} (1-f_2)^{1-a_{I_j,i}} \right) \\&\le k_2 K \max _{j \in \{1, \cdots , K\}} \left( \prod _{i = 1}^{n} f_2^{a_{I_j,i}} (1-f_2)^{1-a_{I_j,i}} \right) \\&\le k_2 K \max (f_2, 1-f_2)^{n} = k_2 K \overline{f_2}^{n}, \end{aligned}$$

where $\overline{f_2} := \max (f_2, 1-f_2)$. From this, we can evaluate $\mathbb {E}_{B|A}[\mathrm {TN}(L, A, B)]$ as follows.

$$\begin{aligned} (1 - K \overline{f_2}^{n} ) \le \frac{\mathbb {E}_{B|A}[\mathrm {TN}(L, A, B)]}{k_2} \le (1 - K \underline{f_2}^{n} ). \end{aligned}$$

(7)

Evaluation of $\mathbb {E}_{B|A}[\mathrm {TP}(L, A, B)] / k_1$. From the linearity of the expectation, the expected number of true positives can be written as follows.

$$\begin{aligned} \mathbb {E}_{B|A}[\mathrm {TP}(L, A, B)]&= \mathbb {E}_{B|A} \left[ \sum _{j = 1}^{k_1} L(\textit{\textbf{c}}_{j}, A) \right] = \sum _{j = 1}^{k_1} \mathrm {Pr}( L(\textit{\textbf{c}}_{j}, A) = 1 ). \end{aligned}$$

The lower bound for $\mathrm {Pr}( L(\textit{\textbf{c}}_{j}, A) = 1 )$ is as follows.

$$\begin{aligned}&\mathrm {Pr}( L(\textit{\textbf{c}}_{j}, A) = 1 ) \\&= \sum _{i = 1}^{n} \mathrm {Pr} \left( L(\textit{\textbf{c}}_{j}, A) = 1, \textit{\textbf{c}}_{j} \,\,s.t.\; \sum _{n^{\prime } = 1}^{n} a_{I_{j}, n^{\prime }} = i \right) \\&= \sum _{i = 1}^{n} \mathrm {Pr} \left( L(\textit{\textbf{c}}_{j}, A) = 1 \left| \textit{\textbf{c}}_{j} \,\,s.t.\; \sum _{n^{\prime } = 1}^{n} a_{I_{j}, n^{\prime }} = i \right. \right) \mathrm {Pr} \left( \textit{\textbf{c}}_{j} \,\,s.t.\; \sum _{n^{\prime } = 1}^{n} a_{I_{j}, n^{\prime }} = i \right) \\&= \sum _{i = 1}^{n} w_{i} \mathrm {Pr}\left( L(\textit{\textbf{c}}_{j}, A) = 1 \left| \textit{\textbf{c}}_{j} \,\,s.t.\; \sum _{n^{\prime } = 1}^{n} a_{I_{j}, n^{\prime }} = i \right. \right) \\&\ge \sum _{i = 1}^{n} w_{i} \mathrm {Pr}\left( \textit{\textbf{a}}_{I_{j}} = \textit{\textbf{c}}_{j} \left| \textit{\textbf{c}}_{j} \,\,s.t.\; \sum _{n^{\prime } = 1}^{n} a_{I_{j}, n^{\prime }} = i \right. \right) = \sum _{i = 1}^{n} w_{i}(1-f_1)^{(n-i)}, \\&\,\,\,\, \because ) \,\textit{\textbf{a}}_{I_{j}} = \textit{\textbf{c}}_{j} \Rightarrow L(\textit{\textbf{c}}_{j}, A) = 1, \end{aligned}$$

where $w_{i} := \mathrm {Pr} \left( \textit{\textbf{c}}_{j} \,\,s.t.\; \sum _{n^{\prime } = 1}^{n} a_{I_{j}, n^{\prime }} = i \right) $. From this,

$$\begin{aligned} \mathbb {E}_{B|A}[\mathrm {TP}(L, A, B)] \ge k_1 \sum _{i = 1}^{n} w_{i}(1-f_1)^{(n-i)}. \end{aligned}$$

For obtaining the upper bound of $\mathbb {E}_{B|A}[\mathrm {TP}(L, A, B)]$, we focus on two things as shown in Fig. 8. First, the number of column vectors in A that each $\textit{\textbf{c}}_{j}$ can correspond is at most K. Second, the probability for each $\textit{\textbf{c}}_{j}$ corresponding to one column vector is at most $\overline{f_1}^{n-i}$, where $\overline{f_1} := \max (f_1, 1-f_1)$, and $i = \sum _{n^{\prime } = 1}^{n} a_{I_{j}, n^{\prime }}$. From this, we can obtain the upper bound for the conditional probability as follows.

$$\begin{aligned} \mathrm {Pr}\left( L(\textit{\textbf{b}}_{j}, A) = 1 \left| \textit{\textbf{c}}_{j} \,\,s.t.\; \sum _{n^{\prime } = 1}^{n} a_{I_{j}, n^{\prime }} = i \right. \right) \le K \overline{f_1}^{(n-i)}. \end{aligned}$$

Then, the upper bound of $\mathbb {E}_{B|A}[\mathrm {TP}(L, A, B)]$ is as follows.

$$\begin{aligned} \mathbb {E}_{B|A}[\mathrm {TP}(L, A, B)]&\le k_1 \sum _{i = 1}^{n} w_{i} K \overline{f_1}^{(n-i)} = k_1 K \sum _{i = 1}^{n} w_{i} \overline{f_1}^{(n-i)}. \nonumber \end{aligned}$$

Therefore,

$$\begin{aligned} G_{n}(\textit{\textbf{w}}, (1-f_1)) \le \frac{\mathbb {E}_{B|A}[\mathrm {TP}(L, A, B)]}{k_1} \le K G_{n}(\textit{\textbf{w}}, \overline{f_1}), \end{aligned}$$

Performance Evaluation of $R_r$. We can evaluate the specificity and sensitivity for $R_r$.

$$\begin{aligned} \frac{ \mathbb {E}_{B|A}[\mathrm {TP}(R_r, A, B)] }{ k_1 }&= (k_1)^{-1} \mathbb {E}_{B|A} \left[ \sum _{j=1}^{k_1} \sum _{q = r}^{n} \mathbb {I}_{\left\{ \sum _{n^{\prime }=1}^{n} c_{j,n^{\prime }} = q \right\} } \right] \\&= (k_1)^{-1} \sum _{j=1}^{k_1} \sum _{q = r}^{n} \mathrm {Pr}\left( \sum _{n^{\prime }=1}^{n} c_{j,n^{\prime }} = q \right) \\&= (k_1)^{-1} \sum _{j=1}^{k_1} \sum _{q = r}^{n} \sum _{x = 1}^{q} \mathrm {Pr}\left( \sum _{n^{\prime }=1}^{n} b_{j,n^{\prime }} = q, \textit{\textbf{c}}_{j} \,\,s.t.\; \sum _{n^{\prime \prime }=1}^{n} a_{I_{j}, n^{\prime \prime }} = x \right) \\&= (k_1)^{-1} \sum _{j=1}^{k_1} \sum _{q = r}^{n} \sum _{x = 1}^{q} w_{x} \, {}_{n-x} \mathrm {C} _{q-x} \, f_{1}^{q-x} (1-f_{1})^{n-q} \\&= \sum _{q = r}^{n} \sum _{x = 1}^{q} w_{x} \, {}_{n-x} \mathrm {C} _{q-x} \, f_{1}^{q-x} (1-f_{1})^{n-q}, \\ \frac{ \mathbb {E}_{B|A}[\mathrm {TN}(R_r, A, B)] }{ k_2 }&= \sum _{x=0}^{r-1} {}_{n} \mathrm {C} _{x} \, (1-f_2)^{n-x} \, f_2^{x} = 1 - \sum _{x=r}^{n} {}_{n} \mathrm {C} _{x} \, (1-f_2)^{n-x} \, f_2^{x}. \end{aligned}$$

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Moriyama, T., Imoto, S., Miyano, S., Yamaguchi, R. (2020). Theoretical Foundation of the Performance of Phylogeny-Based Somatic Variant Detection. In: Bebis, G., Alekseyev, M., Cho, H., Gevertz, J., Rodriguez Martinez, M. (eds) Mathematical and Computational Oncology. ISMCO 2020. Lecture Notes in Computer Science(), vol 12508. Springer, Cham. https://doi.org/10.1007/978-3-030-64511-3_9

Download citation

DOI: https://doi.org/10.1007/978-3-030-64511-3_9
Published: 07 December 2020
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-64510-6
Online ISBN: 978-3-030-64511-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Theoretical Foundation of the Performance of Phylogeny-Based Somatic Variant Detection

Abstract

Access this chapter

References

Acknowledgements