Skip to main content

Theoretical Foundation of the Performance of Phylogeny-Based Somatic Variant Detection

  • Conference paper
  • First Online:
Mathematical and Computational Oncology (ISMCO 2020)

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 12508))

Included in the following conference series:

  • 316 Accesses

Abstract

We study the performance of a variant detection method that is based on a property of tumor phylogenetic tree. Our major contributions are two folds. First, we show the property of tumor phylogenetic tree: the total patterns of mutations are restricted if a multi-regional mutation profile follows a corresponding tumor phylogenetic tree, where a multi-regional mutation profile is a matrix in which predictions of somatic mutations at the corresponding tumor regions are listed. Second, we evaluate the lower and upper bounds of specificity and sensitivity of a phylogeny-based somatic variant detection method under several situations. In the evaluation, we conduct patient-wise variant detection from a noisy multi-regional mutation profile matrix for some genomic positions by utilizing the phylogenetic property; we assume that the phylogenetic information can be extracted from another mutation profile matrix that contains accurate candidates at different genomic positions from the noisy ones. From the evaluation, we find that higher sensitivity is not guaranteed in the phylogeny-based variant detection, but higher specificity is guaranteed for several cases. These findings indicate the tumor phylogeny gives more merit for variant detection based on erroneous long-read sequencers (e.g. Oxford nanopore sequencers) than that based on accurate short-read sequencers (e.g., Illumina sequencer).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Koboldt, D.C., et al.: VarScan 2: somatic mutation and copy number alteration discovery in cancer by exome sequencing. Genome Res. 22(3), 568–576 (2012)

    Article  Google Scholar 

  2. Saunders, C.T., et al.: Strelka: accurate somatic small-variant calling from sequenced tumor-normal sample pairs. Bioinformatics 28(14), 1811–1817 (2012)

    Article  Google Scholar 

  3. Cibulskis, K., et al.: Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31(3), 213–219 (2013)

    Article  Google Scholar 

  4. Shiraishi, Y., et al.: An empirical Bayesian framework for somatic mutation detection from cancer genome sequencing data. Nucleic Acids Res. 41(7), e89 (2013)

    Article  Google Scholar 

  5. Usuyama, N., et al.: HapMuC: somatic mutation calling using heterozygous germ line variants near candidate mutations. Bioinformatics 30(23), 3302–3309 (2014)

    Article  Google Scholar 

  6. Kim, S., et al.: Strelka2: fast and accurate calling of germline and somatic variants. Nat. Methods 15(8), 591–594 (2018)

    Article  Google Scholar 

  7. Moriyama, T., et al.: A Bayesian model integration for mutation calling through data partitioning. Bioinformatics 35(21), 4247–4254 (2019)

    Article  Google Scholar 

  8. Sahraeian, S.M.E., et al.: Deep convolutional neural networks for accurate somatic mutation detection. Nat. Commun. 10(1), 1041 (2019)

    Article  Google Scholar 

  9. Josephidou, M., et al.: multiSNV: a probabilistic approach for improving detection of somatic point mutations from multiple related tumour samples. Nucleic Acids Res. 43(9), e61 (2015)

    Article  Google Scholar 

  10. Moriyama, T., et al.: Accurate and flexible bayesian mutation call from multi-regional tumor samples. In: Mathematical and Computational Oncology, pp. 47–61. Springer, Cham (2019)

    Google Scholar 

  11. van Rens, K.E., et al.: SNV-PPILP: refined SNV calling for tumor data using perfect phylogenies and ILP. Bioinformatics 31(7), 1133–1135 (2015)

    Article  Google Scholar 

  12. Reiter, J.G., et al.: Reconstructing metastatic seeding patterns of human cancers. Nat. Commun. 8, 14114 (2017)

    Article  Google Scholar 

  13. Dorri, F., et al.: Somatic mutation detection and classification through probabilistic integration of clonal population information. Commun. Biol. 2(1), 44 (2019)

    Article  Google Scholar 

  14. Detering, H., et al.: Accuracy of somatic variant detection in multiregional tumor sequencing data. bioRxiv 655605 (2019)

    Google Scholar 

  15. Gusfield, D.: Efficient algorithms for inferring evolutionary trees. Networks 21(1), 19–28 (1991)

    Article  MathSciNet  Google Scholar 

  16. Kimura, M.: The number of heterozygous nucleotide sites maintained in a finite population due to steady flux of mutations. Genetics 61(4), 126 (1969)

    Google Scholar 

  17. Zafar, H., et al.: SiFit: inferring tumor trees from single-cell sequencing data under finite-sites models. Genome Biol. 18(1), 178 (2017)

    Article  Google Scholar 

  18. Zafar, H., et al.: SiCloneFit: bayesian inference of population structure, genotype, and phylogeny of tumor clones from single-cell genome sequencing data. Genome Research (2019)

    Google Scholar 

  19. El-Kebir, M.: SPhyR: tumor phylogeny estimation from single-cell sequencing data under loss and error. Bioinformatics 34(17), i671–i679 (2018)

    Article  Google Scholar 

Download references

Acknowledgements

This work has been supported by the Grant-in-Aid for JSPS Research Fellow (17J08884) and MEXT/JSPS KAKENHI Grant (15H05912, hp180198, hp170227, 18H03329, hp190158).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rui Yamaguchi .

Editor information

Editors and Affiliations

Appendix

Appendix

Fig. 6.
figure 6

Procedures for removing a node having only one outgoing edge and removing a node having more than two outgoing edges.

Proof

(proof of Lemma Lemma 2) T has a phylogenetic tree, hence we can choose a phylogenetic tree \(\mathcal {T}\). We can assume \(|F_{\mathcal {T}}| \le c\) by removing leaves in \(\mathcal {T}\) if no cell corresponds to the leaf in \(f:R \rightarrow F_{\mathcal {T}}\). We can also assume that the root node has only one outgoing edge by adding a new root node and connect the novel root and the previous root node.

For the last condition, we remove the following two types of internal nodes: i) the internal node having only one outgoing edge, and ii) the internal node having more than 2 outgoing edges. It is sufficient to show the operation to remove nodes that satisfy i) or ii) from T while keeping conditions of a)-c).

For i), just remove the nodes as in Fig. 6a. We can easily check a)-c) still holds true after this operation. For ii) just remove the node as in Fig. 6b. If the number of outgoing edges is more than three, apply this operation recursively. We can also check that a)-c) still hold true after these operations. Because the operations pictured in Fig. 6a and 6b just decrease the number of i) and ii) nodes, we can totally remove i) and ii) nodes. \(\square \)

1.1 Performance Evaluation of \(R_r\)

Fig. 7.
figure 7

Specificity and sensitivity of \(R_r\) at \(r=5\), \(n = 20\), \(K = 30\).

1.2 Detailed Procedures for Performance Evaluation

Evaluation of \(\mathbb {E}_{B|A}[\mathrm {TN}(L, A, B)] / k_2\). We evaluate the upper bound and lower bound for \(\mathbb {E}_{B|A}[\mathrm {FP}(L, A, B)]\). Letting K be the total patterns of columns in A, the lower bound can be derived as follows.

$$\begin{aligned} \mathbb {E}_{B|A}[\mathrm {FP}(L, A, B)]&= k_2 \sum _{j=1}^{K} \left( \prod _{i = 1}^{n} f_2^{a_{I_j,i}} (1-f_2)^{1-a_{I_j,i}} \right) \\&\ge k_2 K \min _{j \in \{1, \cdots , K\}} \left( \prod _{i = 1}^{n} f_2^{a_{I_j,i}} (1-f_2)^{1-a_{I_j,i}} \right) \\&\ge k_2 K \min (f_2, 1-f_2)^{n} = k_2 K \underline{f_2}^{n}, \end{aligned}$$

where \(\underline{f_2} := \min (f_2, 1-f_2)\). The upper bound can also be derived as follows.

$$\begin{aligned} \mathbb {E}_{B|A}[\mathrm {FP}(L, A, B)]&= k_2 \sum _{j=1}^{K} \left( \prod _{i = 1}^{n} f_2^{a_{I_j,i}} (1-f_2)^{1-a_{I_j,i}} \right) \\&\le k_2 K \max _{j \in \{1, \cdots , K\}} \left( \prod _{i = 1}^{n} f_2^{a_{I_j,i}} (1-f_2)^{1-a_{I_j,i}} \right) \\&\le k_2 K \max (f_2, 1-f_2)^{n} = k_2 K \overline{f_2}^{n}, \end{aligned}$$

where \(\overline{f_2} := \max (f_2, 1-f_2)\). From this, we can evaluate \(\mathbb {E}_{B|A}[\mathrm {TN}(L, A, B)]\) as follows.

$$\begin{aligned} (1 - K \overline{f_2}^{n} ) \le \frac{\mathbb {E}_{B|A}[\mathrm {TN}(L, A, B)]}{k_2} \le (1 - K \underline{f_2}^{n} ). \end{aligned}$$
(7)

Evaluation of \(\mathbb {E}_{B|A}[\mathrm {TP}(L, A, B)] / k_1\). From the linearity of the expectation, the expected number of true positives can be written as follows.

$$\begin{aligned} \mathbb {E}_{B|A}[\mathrm {TP}(L, A, B)]&= \mathbb {E}_{B|A} \left[ \sum _{j = 1}^{k_1} L(\textit{\textbf{c}}_{j}, A) \right] = \sum _{j = 1}^{k_1} \mathrm {Pr}( L(\textit{\textbf{c}}_{j}, A) = 1 ). \end{aligned}$$

The lower bound for \(\mathrm {Pr}( L(\textit{\textbf{c}}_{j}, A) = 1 )\) is as follows.

$$\begin{aligned}&\mathrm {Pr}( L(\textit{\textbf{c}}_{j}, A) = 1 ) \\&= \sum _{i = 1}^{n} \mathrm {Pr} \left( L(\textit{\textbf{c}}_{j}, A) = 1, \textit{\textbf{c}}_{j} \,\,s.t.\; \sum _{n^{\prime } = 1}^{n} a_{I_{j}, n^{\prime }} = i \right) \\&= \sum _{i = 1}^{n} \mathrm {Pr} \left( L(\textit{\textbf{c}}_{j}, A) = 1 \left| \textit{\textbf{c}}_{j} \,\,s.t.\; \sum _{n^{\prime } = 1}^{n} a_{I_{j}, n^{\prime }} = i \right. \right) \mathrm {Pr} \left( \textit{\textbf{c}}_{j} \,\,s.t.\; \sum _{n^{\prime } = 1}^{n} a_{I_{j}, n^{\prime }} = i \right) \\&= \sum _{i = 1}^{n} w_{i} \mathrm {Pr}\left( L(\textit{\textbf{c}}_{j}, A) = 1 \left| \textit{\textbf{c}}_{j} \,\,s.t.\; \sum _{n^{\prime } = 1}^{n} a_{I_{j}, n^{\prime }} = i \right. \right) \\&\ge \sum _{i = 1}^{n} w_{i} \mathrm {Pr}\left( \textit{\textbf{a}}_{I_{j}} = \textit{\textbf{c}}_{j} \left| \textit{\textbf{c}}_{j} \,\,s.t.\; \sum _{n^{\prime } = 1}^{n} a_{I_{j}, n^{\prime }} = i \right. \right) = \sum _{i = 1}^{n} w_{i}(1-f_1)^{(n-i)}, \\&\,\,\,\, \because ) \,\textit{\textbf{a}}_{I_{j}} = \textit{\textbf{c}}_{j} \Rightarrow L(\textit{\textbf{c}}_{j}, A) = 1, \end{aligned}$$

where \(w_{i} := \mathrm {Pr} \left( \textit{\textbf{c}}_{j} \,\,s.t.\; \sum _{n^{\prime } = 1}^{n} a_{I_{j}, n^{\prime }} = i \right) \). From this,

$$\begin{aligned} \mathbb {E}_{B|A}[\mathrm {TP}(L, A, B)] \ge k_1 \sum _{i = 1}^{n} w_{i}(1-f_1)^{(n-i)}. \end{aligned}$$

For obtaining the upper bound of \(\mathbb {E}_{B|A}[\mathrm {TP}(L, A, B)]\), we focus on two things as shown in Fig. 8. First, the number of column vectors in A that each \(\textit{\textbf{c}}_{j}\) can correspond is at most K. Second, the probability for each \(\textit{\textbf{c}}_{j}\) corresponding to one column vector is at most \(\overline{f_1}^{n-i}\), where \(\overline{f_1} := \max (f_1, 1-f_1)\), and \(i = \sum _{n^{\prime } = 1}^{n} a_{I_{j}, n^{\prime }}\). From this, we can obtain the upper bound for the conditional probability as follows.

$$\begin{aligned} \mathrm {Pr}\left( L(\textit{\textbf{b}}_{j}, A) = 1 \left| \textit{\textbf{c}}_{j} \,\,s.t.\; \sum _{n^{\prime } = 1}^{n} a_{I_{j}, n^{\prime }} = i \right. \right) \le K \overline{f_1}^{(n-i)}. \end{aligned}$$

Then, the upper bound of \(\mathbb {E}_{B|A}[\mathrm {TP}(L, A, B)]\) is as follows.

$$\begin{aligned} \mathbb {E}_{B|A}[\mathrm {TP}(L, A, B)]&\le k_1 \sum _{i = 1}^{n} w_{i} K \overline{f_1}^{(n-i)} = k_1 K \sum _{i = 1}^{n} w_{i} \overline{f_1}^{(n-i)}. \nonumber \end{aligned}$$

Therefore,

$$\begin{aligned} G_{n}(\textit{\textbf{w}}, (1-f_1)) \le \frac{\mathbb {E}_{B|A}[\mathrm {TP}(L, A, B)]}{k_1} \le K G_{n}(\textit{\textbf{w}}, \overline{f_1}), \end{aligned}$$
Fig. 8.
figure 8

The key idea for obtaining the upper bound of \(\mathbb {E}_{B|A}[\mathrm {TP}(L, A, B)]\).

Performance Evaluation of \(R_r\). We can evaluate the specificity and sensitivity for \(R_r\).

$$\begin{aligned} \frac{ \mathbb {E}_{B|A}[\mathrm {TP}(R_r, A, B)] }{ k_1 }&= (k_1)^{-1} \mathbb {E}_{B|A} \left[ \sum _{j=1}^{k_1} \sum _{q = r}^{n} \mathbb {I}_{\left\{ \sum _{n^{\prime }=1}^{n} c_{j,n^{\prime }} = q \right\} } \right] \\&= (k_1)^{-1} \sum _{j=1}^{k_1} \sum _{q = r}^{n} \mathrm {Pr}\left( \sum _{n^{\prime }=1}^{n} c_{j,n^{\prime }} = q \right) \\&= (k_1)^{-1} \sum _{j=1}^{k_1} \sum _{q = r}^{n} \sum _{x = 1}^{q} \mathrm {Pr}\left( \sum _{n^{\prime }=1}^{n} b_{j,n^{\prime }} = q, \textit{\textbf{c}}_{j} \,\,s.t.\; \sum _{n^{\prime \prime }=1}^{n} a_{I_{j}, n^{\prime \prime }} = x \right) \\&= (k_1)^{-1} \sum _{j=1}^{k_1} \sum _{q = r}^{n} \sum _{x = 1}^{q} w_{x} \, {}_{n-x} \mathrm {C} _{q-x} \, f_{1}^{q-x} (1-f_{1})^{n-q} \\&= \sum _{q = r}^{n} \sum _{x = 1}^{q} w_{x} \, {}_{n-x} \mathrm {C} _{q-x} \, f_{1}^{q-x} (1-f_{1})^{n-q}, \\ \frac{ \mathbb {E}_{B|A}[\mathrm {TN}(R_r, A, B)] }{ k_2 }&= \sum _{x=0}^{r-1} {}_{n} \mathrm {C} _{x} \, (1-f_2)^{n-x} \, f_2^{x} = 1 - \sum _{x=r}^{n} {}_{n} \mathrm {C} _{x} \, (1-f_2)^{n-x} \, f_2^{x}. \end{aligned}$$

Rights and permissions

Reprints and permissions

Copyright information

© 2020 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Moriyama, T., Imoto, S., Miyano, S., Yamaguchi, R. (2020). Theoretical Foundation of the Performance of Phylogeny-Based Somatic Variant Detection. In: Bebis, G., Alekseyev, M., Cho, H., Gevertz, J., Rodriguez Martinez, M. (eds) Mathematical and Computational Oncology. ISMCO 2020. Lecture Notes in Computer Science(), vol 12508. Springer, Cham. https://doi.org/10.1007/978-3-030-64511-3_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-64511-3_9

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-64510-6

  • Online ISBN: 978-3-030-64511-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics