Skip to main content

I-CONVEX: Fast and Accurate de Novo Transcriptome Recovery from Long Reads

  • Conference paper
  • First Online:
Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECML PKDD 2022)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1753))

  • 815 Accesses

Abstract

Long-read sequencing technologies demonstrate high potential for de novo discovery of complex transcript isoforms, but high error rates pose a significant challenge. Existing error correction methods rely on clustering reads based on isoform-level alignment and cannot be efficiently scaled. We propose a new method, I-CONVEX, that performs fast, alignment-free isoform clustering with almost linear computational complexity, and leads to better consensus accuracy on simulated, synthetic, and real datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Steijger, T., et al.: Assessment of transcript reconstruction methods for RNA-seq. Nat. Meth. 10, 1177–1184 (2013)

    Article  Google Scholar 

  2. Kannan, S., Hui, J., Mazooji, K., Pachter, L., Tse, D.: Shannon: an information-optimal de Novo RNA-Seq assembler. Preprint at https://www.biorxiv.org/content/10.1101/039230v1 (2016)

  3. Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., Wold, B.: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Meth. 5, 621–628 (2008)

    Article  Google Scholar 

  4. Wang, Z., Gerstein, M., Snyder, M.: RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10, 57–63 (2009)

    Article  Google Scholar 

  5. Gordon, S.P., et al.: Widespread polycistronic transcripts in fungi revealed by single-molecule mRNA sequencing. PLoS ONE 10, e0132628 (2015)

    Article  Google Scholar 

  6. Wang, B., et al.: Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing. Nat. Commun. 7, 11708 (2016)

    Article  Google Scholar 

  7. Hwang, B., Lee, J.H., Bang, D.: Single-cell RNA sequencing technologies and bioinformatics pipelines. Exp. Mol. Med. 50, 96 (2018)

    Article  Google Scholar 

  8. Sahlin, K., Tomaszkiewicz, M., Makova, K.D., Medvedev, P.: Deciphering highly similar multigene family transcripts from Iso-Seq data with IsoCon. Nat. Commun. 9, 4601 (2018)

    Article  Google Scholar 

  9. Abdel-Ghany, S.E., et al.: A survey of the sorghum transcriptome using single-molecule long reads. Nat. Commun. 7, 11706 (2016)

    Article  Google Scholar 

  10. Weirather, J.L., et al.: Characterization of fusion genes and the significantly expressed fusion isoforms in breast cancer by hybrid sequencing. Nucleic Acids Res. 43, e116 (2015)

    Article  Google Scholar 

  11. Westbrook, C.J., et al.: No assembly required: full-length MHC class I allele discovery by PacBio circular consensus sequencing. Hum. Immunol. 76, 891–896 (2015)

    Article  Google Scholar 

  12. Roberts, A., Pachter, L.: Streaming fragment assignment for real-time analysis of sequencing experiments. Nat. Meth. 10, 71–73 (2013)

    Article  Google Scholar 

  13. Trapnell, C., et al.: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010)

    Article  Google Scholar 

  14. Broder, A.Z.: On the resemblance and containment of documents. Compression Complex. Seq. 1997, 21–29 (1997)

    Google Scholar 

  15. Ondov, B.D., et al.: Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17, 1–14 (2016)

    Article  Google Scholar 

  16. Berlin, K., et al.: Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 33, 623–630 (2015)

    Article  Google Scholar 

  17. Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB, vol. 99, no. 6 (1999)

    Google Scholar 

  18. Tardaguila, M., et al.: SQANTI: extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification. Genome Res. 28, 396–411 (2016)

    Article  Google Scholar 

  19. Safonova, Y., et al.: IgRepertoireConstructor: a novel algorithm for antibody repertoire construction and immunoproteogenomics analysis. Bioinformatics 31, i53–i61 (2015)

    Article  Google Scholar 

  20. Gibrat, J.F.: A short note on dynamic programming in a band. BMC Bioinf. 19(1), 1–5 (2018)

    Article  Google Scholar 

  21. Clauset, A., Newman, M.E.J., Moore, C.: Finding community structure in very large networks. Phy. Rev. E 70, 066111 (2004)

    Article  Google Scholar 

  22. Tseng, E., Tang, H., AlOlaby, R.R., Hickey, L., Tassone, F.: Altered expression of the FMR1 splicing variants landscape in premutation carriers. Biochim. Biophys. Acta 1860, 1117–1126 (2017)

    Article  Google Scholar 

  23. Steeb, W., Hardy, Y.: Matrix Calculus and Kronecker Product: A Practical Approach to Linear and Multilinear Algebra, 2nd edn. World Scientific Publishing Company, Singapore (2011

    Google Scholar 

  24. Bertsekas, D.P.: Nonlinear programming. J. Oper. Res. Soc. 48, 334 (1997)

    Article  Google Scholar 

  25. Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R Stat. Soc. Ser. B Methodol. 58, 267–288 (1996)

    MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sina Baharlouei .

Editor information

Editors and Affiliations

Appendices

Appendix

1.1 Supplementary Note 1: Identifiability of de novo Transcriptome Recovery from Long Reads

A family of distributions \(P = \{{P}_{\theta }| \theta \in \Theta \}\) parameterized by a vector \(\theta \) is called identifiable if it satisfies the following condition:

$$ P_{{{\uptheta }_{1} }} = P_{{{\uptheta }_{2} }} \Rightarrow {\uptheta }_{1} = {\uptheta }_{2} \quad \forall {\uptheta }_{1} ,{\uptheta }_{2} \in {\Theta }, $$

This means that with enough observations from one of the distributions inside the family, one can decide which \(\theta \) is the ground-truth value. For the de novo transcriptome recovery problem, \(\theta \) is defined as the set of unknown transcripts/sequences \(S = \{{s}_{1}, {s}_{2}, ..., {s}_{m}\}\) and their corresponding abundances \(\rho = \{{\rho }_{1}, {\rho }_{2}, ..., {\rho }_{m}\}\). For a given read\(r\), we define \({P}_{\theta }(r)\) as the probability of observing \(r\), given \(\theta = (S, \rho ).\) We assume that a read generated from a sequence by adding insertion, deletion, or substitution errors through an i.i.d. process which does not depend on the location of the error. For simplicity, let \({\delta }_{s}, {\delta }_{i},\) and \({\delta }_{d}\) be the probability of observing a substitution, insertion, or deletion error at each location of the sequence respectively. In Theorem 1, we prove that the problem of de novo transcriptome recovery is identifiable.

Supplementary Theorem 1 (Long-Read Sequencing Identifiability):

Assume that the set of sequences \(S = \{{s}_{1}, {s}_{2}, ..., {s}_{m}\}\) with the corresponding abundances \(\rho = \{{\rho }_{1}, {\rho }_{2}, ..., {\rho }_{m}\}\) is unknown. Suppose that for any read\(r\), \({P}_{S, \rho }(r)\) represents the probability of observing read \(r\) given that it is generated from a sequence in \(S\) with the abundance vector\(\rho \). Then, given \({P}_{S, \rho }(r)\) for all reads\(r\), both \(S\) and \(\rho \) can be exactly recovered when substitution error parameter\({\delta }_{s} \ne \frac{3}{4}\).

Proof:

Each read \(r\) with non-zero \(P(r)\) can be obtained by adding insertions, deletions, and substitutions to one or several sequences in \(S\). Without loss of generality, we can model the transcriptome recovery problem as denoising the reads obtained by applying three noise channels to the set of sequences in \(S\) sequentially (Supplementary Figure 1). In this model, \({P}_{1}(r)\) is the distribution of reads after the substitution channel; and \({P}_{2}(r)\) is the distribution of reads after applying the deletion channel (and before the insertion channel). Lastly, \(P(r)\) is the final probability distribution observed after applying the insertion noise. If we prove that the transcriptome recovery is identifiable under each one of these three channels (insertion, deletion, and substitution) individually, then the entire problem is identifiable. This is because of the fact that using the given probability of observing reads \(P(r)\), we can recover \({P}_{2}(r)\) by denoising the effect of the insertion channel. Then, with the same argument, we can recover \({P}_{1}(r)\) by denoising the effect of the deletion channel. Finally, we can find the vector \(\rho \) by denoising the effect of the substitution channel on the probability vector \({P}_{1}\). Thus, it suffices to show that \({P}_{2}(r), {P}_{1}(r),\) and \(\rho \) can be recovered sequentially given \(P(r)\). In what follows, we show each one of these three channels is identifiable.

Supplementary Figure 1.
figure 3

Each channel affects the probability vector of reads from the previous step. To prove the identifiability of the transcriptome recovery problem, it suffices to denoise the effect of the channels sequentially from last to first.

Identifiability Under the Substitution Channel:

For the substitution channel, we can observe that a read \(r\) with length \(L\) can only be generated from a sequence of size \(L\), since the substitution does not change the sequence length. Thus, for estimating the abundances of the sequence with size \(L\), we can only focus on the set of reads with size \(L\). It means that without loss of generality we can assume that the set of reads and sequences are all of size \(L\). Moreover, without loss of generality, assume that S consists of all \({4}^{L}\) possible transcripts with size \(L\) with possibly zero abundances. Thus, the problem is reduced to recovering vector \(\rho \), given the \(P(r)\) and \(S\). Let \({\delta }_{s}=\delta \) be the substitution error rate. Then the probability of observing the read \({r}_{i}\) can be written as follows:

$$P\left({r}_{i}\right)=\sum\limits_{j=1}^{n}{\uprho }_{j}P\left({r}_{i}|{s}_{j}\right)=\sum\limits_{j=1}^{n}{\uprho }_{j}{\left(\frac{\updelta }{3}\right)}^{d\left({r}_{i},{s}_{j}\right)}{\left(1-\updelta \right)}^{L-d\left({r}_{i},{s}_{j}\right)}$$

Here \(d({r}_{i},{s}_{j})\) is the number of mismatches between \({r}_{i}\) and \({s}_{j}\). This relationship can be written in a matrix form. For example, when \(L=1\), let the abundances of ‘A’, ‘C’, ‘G’, and ‘T’ be \({\rho }_{1}, {\rho }_{2}, {\rho }_{3}\) and \({\rho }_{4}\) respectively. Then, the probability of observing each {‘A’, ‘C’, ‘G’, ‘T’} can be obtained by the following equation:

$$\left[\begin{array}{c}P\left({r}_{1}\right)\\ P\left({r}_{2}\right)\\ P\left({r}_{3}\right)\\ P\left({r}_{4}\right)\end{array}\right]= A\left[\begin{array}{c}{\uprho }_{1}\\ {\uprho }_{2}\\ {\uprho }_{3}\\ {\uprho }_{4}\end{array}\right]$$

where

$$ A = \left[ {\begin{array}{*{20}c} {1 - \delta } & {\delta /3} & {\delta /3} & {\delta /3} \\ {\delta /3} & {1 - \delta } & {\delta /3} & {\delta /3} \\ {\delta /3} & {\delta /3} & {1 - \delta } & {\delta /3} \\ {\delta /3} & {\delta /3} & {\delta /3} & {1 - \delta } \\ \end{array} } \right] $$

Thus, the identifiability problem in this case reduces to showing the invertibility of this matrix. Notice that the determinant of this matrix can be computed as

where the second equality is due to the effect of permuting rows on the determinant of the matrix. One can easily check that this polynomial is non-zero for all \(\delta \in [0, 1]\) except \(\delta = \frac{3}{4}\), which implies the invertibility of the matrix for all \(\delta \in [0, 1]\) except \(\delta = \frac{3}{4}\).

Now, let us consider the case of general length \(L>1\). Assume that we order the set of all possible sequences with size \(L\) (all \(m={4}^{L}\) possible sequences) according to the order of sequences in base \(4\). In this case, the probability transition relation can be written as

$$ \left[ {\begin{array}{*{20}c} {P\left( {r_{1} } \right)} \\ {P\left( {r_{2} } \right)} \\ \ldots \\ {P\left( {r_{m} } \right)} \\ \end{array} } \right] = \underbrace {A \otimes \ldots \otimes A}_{{L\:{\text{times}}}}\left[ {\begin{array}{*{20}c} {{\uprho }_{1} } \\ {{\uprho }_{2} } \\ \ldots \\ {{\uprho }_{m} } \\ \end{array} } \right] $$

where the Kronecker product of two matrices \({A}_{(m \times n)}\) and \({B}_{(p \times q)}\) is a matrix \({C}_{((m\times p) \times (n\times q))}\), defined as:

$$C=A\otimes B=\left[\begin{array}{cccc}{a}_{11}B& {a}_{12}B& ...& {a}_{1n}B\\ {a}_{21}B& {a}_{22}B& ...& {a}_{2n}B\\ ...& ...& ...& ...\\ {a}_{m1}B& {a}_{m2}B& ...& {a}_{mn}B\end{array}\right]$$

The Kronecker product of \(L\) matrices is invertible if and only if each one of them is invertible [23]. We have already shown that \(A\) is an invertible matrix when \({\delta }_{s}\ne \frac{3}{4}\). Thus, \(A \otimes A\otimes ... \otimes A\) is invertible as well, which shows the problem is identifiable when the substitution channel is applied.

Insertion Channel:

Let \(R\) be the support of \(P\) (\(R = \{r: P(r) > 0\}\)). We define the operator \(\le \) on a pair of reads as follows: \({r}_{1 }\le {r}_{2}\) if \({r}_{2}\) can be obtained by inserting zero or more bases to \({r}_{1}\) in different locations. It is easy to observe that \((R, \le )\) forms a partially ordered set (poset). Since it satisfies the reflexivity, anti-symmetry, and transitivity. We have the following observations:

  • If \(r^{\prime} \le r\) and \(r \ne r^{\prime}\), then \(\left| {r^{\prime}} \right| < \left| r \right|\).

  • R has at least one minimal element.

  • Any minimal element of \(R\) belongs to \(S\).

To see the correctness of the first observation, assume that \(r^{\prime} \le r\), which means r is obtained by adding zero or more characters to r. However, since \(r^{\prime} \ne r\), the number of added characters cannot be zero. Thus at least one character is added to \(r^{\prime}\) to obtain r. This means that \(\left| {r^{\prime}} \right| \le \left| r \right|\).

To prove the second observation, assume that \(r_{min}\) is an element of \(R\) with the minimum length (it is not necessarily unique). Indeed, such an element exists, because \(\left| r \right| \ge 1\) for all \(r \in R\). We claim that \(r_{min}\) is a minimal element. We prove by contradiction. Assume that \(r_{min}\) is not a minimal element. Thus, there exists an element \(r^{\prime} \ne r_{min}\) in \(R\) such that \(r^{\prime} \le r_{min}\). According to the first observation \(\left| r \right| < \left| {r_{min} } \right|\), which contradicts the fact that \(r_{min}\) is an element with the minimum length in \(R\).

Finally, we prove that a minimal element of \(R\) belongs to \(S\). We prove by contradiction. Assume that \({r}_{min}\) is a minimal element that does not belong to \(S\). Thus, there exists an element \(s\in S\), such that \(s \le {r}_{min}.\) Clearly \(r=s\) belongs to \(R\), since \(P(r)\ge P(r|s)={\rho }_{s}(1 - {\delta }_{i}{)}^{|s|}>0\). Therefore, \(s\) is an element in \(R\) which is not equal to \({r}_{min}\) and \(s\le {r}_{min}\) which contradicts with the fact that \({r}_{min}\) is a minimal element in \(R.\)

Now, we prove the identifiability of the transcriptome recovery problem under the insertion channel by induction on the size of \(S\).

For \(|S|=1\), according to the third observation, the minimal member of \(R\) belongs to \(S\), and its abundance is clearly \(1\).

Based on the induction hypothesis, assume that the problem is identifiable when \(|S|=m\). We want to prove it is identifiable, when \(|S|=m+1\). Based on the third observation, any minimal element \({r}_{min}\) of \(R\) belongs to \(S\). Set \(S^{\prime} = S - \left\{ {r_{min} } \right\}\) and update the probability vector of the reads accordingly: \(P^{\prime}\left( r \right) = P\left( r \right) - P(r|r_{min} )\rho_{{r_{min} }}\). For updating the probability vector of reads we need to estimate \({\rho }_{{r}_{min}}\) exactly. Since \({r}_{min}\) is the minimal element, it can be only obtained from a sequence with the exact same structure. Hence, \(P({r}_{min}) = {\rho }_{{r}_{min}}(1-\delta {)}^{|{r}_{min}|}\), which means \({\rho }_{{r}_{min}}=\frac{P({r}_{min})}{(1-\delta {)}^{|{r}_{min}|}}.\) According to the assumption of induction, since \(\left| {S^{\prime}} \right| = m\). we can recover \(S^{\prime}\) using \(P^{\prime}\). Thus, \(S = S^{\prime} \cup \left\{ {r_{min} } \right\}\), which means we can recover \(S\) exactly.

Deletion Channel:

The argument is similar to the insertion channel. The only difference is that the operator \(\le \) is defined for a pair of reads \({r}_{1 }\le {r}_{2}\) if \({r}_{2}\) can be obtained by deleting some elements of \({r}_{1}\). Moreover, the minimal element of \(R\) is the longest read in this case.

1.2 Supplementary Note 2: Sparse EM Algorithm Inside the Isoform Recovery via Convexification

As discussed in Sect. 3, the maximum likelihood estimator of the abundance vector given the set of transcripts and reads can be written as:

$${\widehat{\rho }}_{ML}=\underset{\rho }{\mathrm{argmax}} \sum\limits_{i=1}^{n}log\left(\sum\limits_{j=1}^{m}{\alpha }_{ij}{\rho }_{j}\right)\hspace{1em}\mathrm{subject\ to}\hspace{1em}\uprho \ge 0,\hspace{1em}\sum\limits_{j=1}^{m}{\uprho }_{j}=1$$

Or equivalently:

$${\widehat{\rho }}_{ML}=\underset{\rho }{\mathrm{argmin}} \sum\limits_{i=1}^{n}-log\left(\sum\limits_{j=1}^{m}{\alpha }_{ij}{\rho }_{j}\right)\hspace{1em}\mathrm{subject\ to}\hspace{1em}\rho \ge 0,\hspace{1em}\sum\limits_{j=1}^{m}{\rho }_{j}=1$$

Applying the Expectation Maximization (EM) algorithm, at each iteration, we minimize a tight local upper bound of the above objective function:

$$ \begin{array}{*{20}l} {{{\uprho }}^{{t + 1}} = \mathop {{\text{argmin}}}\limits_{\rho } - \mathop \sum \limits_{{i = 1}}^{n} \mathop \sum \limits_{{j = 1}}^{m} \frac{{{{\upalpha }}_{{ij}} {{\uprho }}_{j}^{t} }}{{\mathop \sum \nolimits_{{j^{\prime} = 1}}^{m} {{\upalpha }}_{{ij^{\prime}}} {{\uprho }}_{{j^{\prime}}}^{t} }}\log \left( {\frac{{\rho _{j} }}{{\rho _{j}^{t} }}} \right)} \hfill \\ {\quad \quad \quad \quad \quad - \mathop \sum \limits_{{i = 1}}^{n} \log \left( {\mathop \sum \limits_{{j = 1}}^{m} \alpha _{{ij}} \rho _{j} } \right)\quad {\text{subject}}\;{\text{to}}\quad {{\uprho }} \ge 0,\quad \mathop \sum \limits_{{j = 1}}^{m} {{\uprho }}_{j} = 1} \hfill \\ \end{array}$$

The Lagrangian function of the above problem can be written as:

$$L\left(\uprho ,\uplambda \right)=-\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{m}\frac{{\mathrm{\alpha }}_{ij}{\uprho }_{j}^{t}}{\sum\nolimits_{{j}^{^{\prime}}=1}^{m}{\mathrm{\alpha }}_{i{j}^{^{\prime}}}{\uprho }_{{j}^{^{\prime}}}^{t}}\mathrm{log}\left(\frac{{\uprho }_{j}}{{\uprho }_{j}^{t}}\right)-\sum\limits_{i=1}^{n}\mathrm{log}\left(\sum\limits_{j=1}^{m}{\alpha }_{ij}{\rho }_{j}\right)+\uplambda \left(\sum\limits_{j=1}^{m}{\uprho }_{j}-1\right)$$

And the dual problem takes the form of:

$$\underset{\uplambda }{\mathrm{max}}\ \underset{\uprho }{\mathrm{min}}\ L\left(\uprho ,\uplambda \right)$$

Since the Lagrangian function is convex in \(\rho \), and the constraints are linear, the strong duality holds [24]. Thus, by setting the gradient of \(L\) with respect to \(\rho \) to 0, we have:

$$-\sum\limits_{i=1}^{n}\frac{{\mathrm{\alpha }}_{ik}{\uprho }_{k}^{t}}{\sum\nolimits_{{j}^{^{\prime}}=1}^{m}{\mathrm{\alpha }}_{i{j}^{^{\prime}}}{\uprho }_{{j}^{^{\prime}}}^{t}}\frac{1}{{\uprho }_{k}}+{\uplambda }^{*}=0,\hspace{1em}\forall k=1,\dots ,m.$$

Which means:

$${\uprho }_{k}^{t+1}=\frac{1}{{\uplambda }^{*}}\sum\limits_{i=1}^{n}\frac{{\mathrm{\alpha }}_{ik}{\uprho }_{k}^{t}}{\sum\nolimits_{{j}^{^{\prime}}=1}^{m}{\mathrm{\alpha }}_{i{j}^{^{\prime}}}{\uprho }_{{j}^{^{\prime}}}^{t}}\hspace{1em}\forall k=1,\dots ,m.$$

Since \(\sum\nolimits_{j=1}^{m}{{\rho }_{j}^{t+1}}= 1\), we have:

$$1=\sum\limits_{k=1}^{m}{\uprho }_{k}^{t+1}=\sum\limits_{k=1}^{m}\frac{1}{{\uplambda }^{*}}\sum\limits_{i=1}^{n}\frac{{\mathrm{\alpha }}_{ik}{\uprho }_{k}^{t}}{\sum\nolimits_{{j}^{^{\prime}}=1}^{m}{\mathrm{\alpha }}_{i{j}^{^{\prime}}}{\uprho }_{{j}^{^{\prime}}}^{t}}=\frac{1}{{\uplambda }^{*}}\sum\limits_{i=1}^{n}\frac{\sum\nolimits_{k=1}^{m}{\mathrm{\alpha }}_{ik}{\uprho }_{k}^{t}}{\sum\nolimits_{{j}^{^{\prime}}=1}^{m}{\mathrm{\alpha }}_{i{j}^{^{\prime}}}{\uprho }_{{j}^{^{\prime}}}^{t}}=\frac{n}{{\uplambda }^{*}}$$

Therefore \({\lambda }^{*}= n\), and

$${\uprho }_{k}^{t+1}=\frac{1}{n}\sum\limits_{i=1}^{n}\frac{{\mathrm{\alpha }}_{ik}{\uprho }_{k}^{t}}{\sum\nolimits_{{j}^{^{\prime}}=1}^{m}{\mathrm{\alpha }}_{i{j}^{^{\prime}}}{\uprho }_{{j}^{^{\prime}}}^{t}}\hspace{1em}\forall k=1,\dots ,m.$$

This update rule is similar to the one used by Express [12] to obtain the abundance vector.

Sparsification.

One naïve approach to impose sparsity to the estimated abundances in the above problem is to use the \( \ell _{1} \) regularizer [25]. However, the abundances are non-negative and their summation is equal to \(1\). Therefore, the \(\ell _{1} \) regularizer does not change the objective function at all. Thus, instead of using \(\ell _{1}\) regularizer, we apply the \(\ell{l}q\) regularizer to the objective function with \(0< q < 1\). Using this regularizer, the above problem can be written as

$$\begin{array}{*{20}l} {\rho ^{{t + 1}} = \mathop {{\text{argmin}}}\limits_{\rho } - \mathop \sum \limits_{{i = 1}}^{n} \mathop \sum \limits_{{j = 1}}^{m} \frac{{\alpha _{{ij}} \rho _{j}^{t} }}{{\mathop \sum \nolimits_{{j^{\prime} = 1}}^{m} \alpha _{{ij^{\prime}}} \rho _{{j^{\prime}}}^{t} }}\log \left( {\frac{{\rho _{j} }}{{\rho _{j}^{t} }}} \right)} \hfill \\ {\quad \quad \quad \quad \quad \quad - \mathop \sum \limits_{{i = 1}}^{n} \log \left( {\mathop \sum \limits_{{j = 1}}^{m} \alpha _{{ij}} \rho _{j} } \right)\quad {\text{subject}}\;{\text{to}}\quad \rho \ge 0,\quad \rho _{q} \le 1} \hfill \\ \end{array}$$

Thus, the Lagrangian function for the sparse EM takes the following form:

$$L\left(\uprho ,\uplambda \right)=-\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{m}\frac{{\mathrm{\alpha }}_{ij}{\uprho }_{j}^{t}}{\sum\nolimits_{{j}^{^{\prime}}=1}^{m}{\mathrm{\alpha }}_{i{j}^{^{\prime}}}{\uprho }_{{j}^{^{\prime}}}^{t}}\mathrm{log}\left(\frac{{\uprho }_{j}}{{\uprho }_{j}^{t}}\right)-\sum\limits_{i=1}^{n}\mathrm{log}\left(\sum\limits_{j=1}^{m}{\alpha }_{ij}{\rho }_{j}\right)+\uplambda \left({\left(\sum\limits_{j=1}^{m}{\uprho }_{j}^{q}\right)}^\frac{1}{q}-1\right)$$

Again, by writing the optimality condition for the dual problem, we have

$$\nabla {L}_{\uprho }\left(\uprho ,\uplambda \right)=0 \Rightarrow $$
$$-\sum\limits_{i=1}^{n}\frac{{\mathrm{\alpha }}_{ik}{\uprho }_{k}^{t}}{\sum\nolimits_{{j}^{^{\prime}}=1}^{m}{\mathrm{\alpha }}_{i{j}^{^{\prime}}}{\uprho }_{{j}^{^{\prime}}}^{t}}\frac{1}{{\uprho }_{k}^{t+1}}+{\uplambda }^{*}q{\left({\uprho }_{k}^{t+1}\right)}^{q-1}{\left({\left({\uprho }_{1}^{t+1}\right)}^{q}+\dots +{\left({\uprho }_{m}^{t+1}\right)}^{q}\right)}^{\frac{1}{q}-1}=0,\hspace{1em}\forall k=1,\dots ,m.$$

Moreover, according to the complementary slackness, we have:

$${\uplambda }^{*}\left[{\left(\sum\limits_{j=1}^{m}{\left({\uprho }_{j}^{t+1}\right)}^{q}\right)}^\frac{1}{q}-1\right]=0$$

It is easy to observe that \({\lambda }^{*}\ne 0\), otherwise the last two equalities have no solution. Thus, similar to the previous case, we can conclude that \({\lambda }^{*}=n\), and we have the following closed-form update rule:

$${\uprho }_{k}^{t+1}={\left(\frac{1}{n}\sum\limits_{i=1}^{n}\frac{{\mathrm{\alpha }}_{ik}{\uprho }_{k}^{t}}{\sum\nolimits_{{j}^{^{\prime}}=1}^{m}{\mathrm{\alpha }}_{i{j}^{^{\prime}}}{\uprho }_{{j}^{^{\prime}}}^{t}}\right)}^\frac{1}{q},\hspace{1em}\forall k=1,\dots ,m.$$

We refer to \(\frac{1}{q}\) as the sparsity power. This hyper-parameter is set to \(1.03\) in the implementation of Isoform Recovery via Convexification module. As we increase the sparsity power, the abundance vector is sparser, and thus the number of recovered isoforms is smaller at the end. Sparsification of EM can help to remove false positive isoforms further. However, we should avoid setting it to very large numbers as it may lead to the removal of true positives.

1.3 Supplementary Note 3: Further Analysis of MinHash and Locality Sensitive Hashing Algorithms

Definition 1:

Suppose that \({S}_{1}\) and \({S}_{2}\) are k-mer sets of two given sequences. The Jaccard similarity of \({S}_{1}\) and \({S}_{2}\) is defined as

$$J\left({S}_{1},{S}_{2}\right)=\frac{\left|{S}_{1}\cap {S}_{2}\right|}{\left|{S}_{1}\cup {S}_{2}\right|}$$

Definition 2:

Given a set of sequences \({S}_{1},\dots ,{S}_{m}\) and the set of k-mers \(S^{\prime}_{1} , \ldots ,S^{\prime}_{{N^{\prime}}}\), we say a matrix \(M \in \left\{ {0,1} \right\}^{{N^{\prime} \times m}}\) is a representation matrix if it satisfies the following property: The element in row \(i\) and column \(j\) equals \(1\) if and only if the i-th k-mer appears in j-th sequence, and \(0\) otherwise.

Definition 3:

Assume the representation matrix \(M\) has \(N^{\prime}\) rows and \(m\) columns, and let \({\mathcal{P}} = \left\{ {i_{1} ,i_{2} , \ldots ,i_{{N^{\prime}}} } \right\}\) be a permutation of set \(\left\{ {1,2, \ldots ,N^{\prime}} \right\}\), and column \(c\) in \(M\) corresponds to a sequence \(S\). Then the MinHash signature of \(S\) with respect to \(\mathcal{P}\) is defined as \({\mathrm{MinHash}}_{\mathcal{P}}(c)={i}_{j}\) where \(j\) is the smallest number in \(\left\{ {1,2, \ldots ,N^{\prime}} \right\}\) such that \(M[{i}_{j}][c]=1\).

Example:

Assume that \({S}_{1}={\text{ACCAGTC}}\) and \({S}_{2}={\text{ACCGTCA}}\). The set of 3-mers that appears at least once in \({S}_{1}\) or \({S}_{2}\) is \(\{\mathrm{ACC},\mathrm{ CCA},\mathrm{ CAG},\mathrm{ AGT},\mathrm{ GTC},\mathrm{ CCG},\mathrm{ CGT},\mathrm{ TCA}\}.\) Let \(\mathcal{P}=\{7, 5, 2, \mathrm{3,1},\mathrm{4,6}\}\) be a random permutation of 3-mer indices. Since \(5\) is the first 3-mer index in \(\mathcal{P}\) such that its corresponding 3-mer appears in \({S}_{1}\), thus, the MinHash signature of \({S}_{1}\) with respect to \(\mathcal{P}\) is \(5\). With the same logic, we can observe that the MinHash signature of \({S}_{2}\) with respect to \(\mathcal{P}\) is \(7\).

Supplementary Theorem 2:

Let \(J\left({S}_{1},{S}_{2}\right)\) be the Jaccard similarity between two given sequences \({S}_{1}\) and \({S}_{2}\), and \({\mathcal{P}}_{1},\dots ,{\mathcal{P}}_{h}\) are \(h\) distinct randomly generated permutations of set \(\{\mathrm{1,2},\dots ,{N}^{^{\prime}}\}\). For each sequence, a vector of MinHash signatures with length \(h\) is computed with respect to permutations \({\mathcal{P}}_{1},\dots ,{\mathcal{P}}_{h}\). Then, \(\frac{{\# {\text{Matches}}\;{\text{between}}\;{\text{signature}}\;{\text{vectors}}\;{\text{of}}\;S_{1} \;{\text{and}}\;S_{2} }}{h}\) is an unbiased estimator of the Jaccard similarity of \({S}_{1}\) and \({S}_{2}\).

Proof:

Let \({c}_{1}\) and \({c}_{2}\) be the indices of columns in \(M\) corresponding to \({S}_{1}\) and \({S}_{2}\). Let \({R}_{1}\) denote the set of rows in which \({c}_{1}\) and \({c}_{2}\) both have the value \(1\), and \({R}_{2}\) be the set of rows in which exactly one of \({c}_{1}\) and \({c}_{2}\) is \(1\). Fix a permutation \(\mathcal{P}={\{i}_{1},{i}_{2},\dots ,{i}_{{N}^{^{\prime}}}\}\), and let \(j\) be the smallest index such that at least one of \(M\left[{i}_{j}\right]\left[{c}_{1}\right]\) or \(M\left[{i}_{j}\right]\left[{c}_{2}\right]\) is \(1\). Thus \({i}_{j }\in {R}_{1}\cup {R}_{2}\). Moreover, \({\mathrm{MinHash}}_{\mathcal{P}}\left({\mathrm{c}}_{1}\right)={\mathrm{MinHash}}_{\mathcal{P}}\left({\mathrm{c}}_{2}\right)={i}_{j}\) if and only if \({i}_{j} \in {R}_{1}\). Hence \(Pr[{\mathrm{MinHash}}_{\mathcal{P}}\left({\mathrm{c}}_{1}\right)={\mathrm{MinHash}}_{\mathcal{P}}\left({\mathrm{c}}_{2}\right)] = \frac{|{R}_{1}|}{{|{R}_{1}|+|R}_{2}|}\) which is equal to the Jaccard similarity of \({S}_{1}\) and \({S}_{2}\). Thus, we have

$$ \begin{array}{*{20}l} {{\mathbb{E}}\left[ {\frac{{\# {\text{Matches}}\;{\text{between}}\;{\text{signature}}\;{\text{vectors}}\;{\text{of}}\;S_{1} \;{\text{and}}\;S_{2} }}{h}} \right]} \hfill \\ {\quad \quad \quad \quad = \frac{{\mathop \sum \nolimits_{{k = 1}}^{h} Pr\left[ {{\text{MinHash}}_{{{\mathcal{P}}_{k} }} \left( {{\text{c}}_{1} } \right) = {\text{MinHash}}_{{{\mathcal{P}}_{{\text{k}}} }} \left( {{\text{c}}_{2} } \right)} \right]}}{h}} \hfill \\ {\quad \quad \quad \quad = \frac{{hJ\left( {S_{1} ,S_{2} } \right)}}{h} = J\left( {S_{1} ,S_{2} } \right),} \hfill \\ \end{array} $$

which means the estimator is unbiased.

Supplementary Theorem 3:

The variance of the estimator proposed in Theorem 2 is equal to \(\frac{J({S}_{1}, {S}_{2}) - {J}^{2}({S}_{1}, {S}_{2})}{h}\).

Proof:

Since the probability of match between two MinHash signatures for each permutation \({\mathcal{P}}_{k}\) equals to \(J({S}_{1}, {S}_{2})\), thus \(\#\mathrm{Matches\ between\ signature\ vectors\ of}\ {S}_{1}\ \mathrm{and}\ {S}_{2}\) follows a binomial distribution with \(h\) trials (number of permutations) and \(p = J({S}_{1}, {S}_{2})\). Thus its variance equals \(hp(1-p)\). Therefore,

$$ \begin{gathered} {\text{Var}}\left[ {\frac{{\# {\text{Matches}}\;{\text{between}}\;{\text{two}}\;{\text{signatures}}}}{h}} \right] = \frac{{{\text{Var}}\left[ {\# {\text{Matches}}\;{\text{between}}\;{\text{two}}\;{\text{signatures}}} \right]}}{{h^{2} }} \hfill \\ \quad \quad \quad \quad = \frac{{hJ\left( {S_{1} ,S_{2} } \right)\left( {1 - J\left( {S_{1} ,S_{2} } \right)} \right)}}{{h^{2} }} \hfill \\ \quad \quad \quad \quad = \frac{{J\left( {S_{1} ,S_{2} } \right)\left( {1 - J\left( {S_{1} ,S_{2} } \right)} \right)}}{h} \hfill \\ \end{gathered} $$

To obtain the MinHash signatures of reads in a given dataset, we need \(O(NL)\) operations to form k-mer set of sequences, and \(O(hNL)\) operations to obtain MinHash signatures of reads, where \(N\) is the number of reads, \(L\) is the average length of the sequences, and \(h\) is the length of MinHash signatures.

Choosing the Hyper-parameters of the Locality Sensitive Hashing Algorithm

Hashing the reads using MinHash reduces the computational complexity of comparison significantly, since comparing the MinHash signature of two given reads needs \(O\left(h\right)\), while computing the edit distance of two reads requires \(O\left({L}^{2}\right)\) operations (\(L\) can be varied from \(400\) to \(10k\), while \(h\) can be selected as a constant number typically less than \(100\)). Still, comparing the MinHash signatures of all \(O\left({N}^{2}\right)\) pairs of sequences is inefficient for large-scale datasets consisting of millions of sequences. LSH algorithm copes with this issue by dividing the \(h\) entries of the MinHash signature into \(b\) bands of size \(d\) \((h\boldsymbol{ }=\boldsymbol{ }bd)\). If two sequences are equal in all rows of at least one of the bands, they are considered as a candidate similar pair. For the default hyper-parameters of I-CONVEX \((d=1, b=10, k=15)\) the probability of considering two sequences as similar for different values of their Jaccard similarity is depicted in Supplementary Figure 2. We have selected the default values of hyper-parameters such that the probability of having a false negative is small. However, this leads to a higher rate of false positives as well. As we mentioned in Sect. 3, we validate the obtained candidate pairs by a designed convolutional neural network to remove the false positive edges in our similarity graph.

Supplementary Figure 2.
figure 4

The Probability of considering a pair of sequences as similar when d = 1 and b = 10 for different values of Jaccard similarity of the pair.

1.4 Supplementary Note 4: Architecture of Convolutional Neural Network

To validate the similarity of candidate pair obtained from Locality Sensitive Hashing Algorithm we trained a convolutional neural network with the training data described in Results section. In this supplementary, we explain the architecture of the convolutional neural network.

Architecture.

To train the convolutional neural network that validates the similarity of candidate pairs, for each pair in the generated training dataset (which can be similar or dissimilar), both reads are one-hot encoded first (Each letter corresponds to a \(4\times 1\) array). Since the length of each read in the generated dataset is 400, we obtain two arrays of size \(4\times 400\) after one-hot encoding of the reads. Then, we concatenate these two arrays row-wise to obtain an \(8\times 400\) array. These \(8\times 400\) arrays are the input data to the convolutional neural network with the layers depicted in Table 1.

Table 1. Structure of the convolutional neural network designed to validate the similarity of candidate pairs.

The final value after passing each \(8\times 400\) input from the convolutional neural network is a scalar representing the predicted value. This scalar ranges between \(-1\) and \(1\), representing the similarity of a given pair. To compute the loss, let \({y}_{i}\) and \({\widehat{y}}_{i}\) be the predicted value, and the true label corresponding to the i-th candidate pair. We use \({\ell{l}}_{2}\) loss defined as follows:

$$\ell{l}\left(y,\widehat{y}\right)=\sum\limits_{i=1}^{n}{\left({\widehat{y}}_{i}-{y}_{i}\right)}^{2}$$

To optimize the weight parameters in the network, an Adam optimizer is applied to the \({\ell{l}}_{2}\) loss.

Validation of Candidate Similar Pairs:

To validate a given candidate similar pair by the trained convolutional neural network, first each read is trimmed to a sequence with length \(400\) (if the length of a read is less than\(400\), we add enough base ‘A’ to the end of the sequence to reach the length of\(400\)). Each one of the two trimmed reads is one-hot encoded to get an array of size \(4\times 400\). By concatenating them, an array with the size \(8\times 400\) is obtained. Then, it will be passed through the convolutional neural network, and it gives a number within the range of \([-1, +1]\). If the number is greater than\(0\), we consider them as a similar pair.

1.5 Supplementary Note 5: I-CONVEX Pipeline

To run I-CONVEX on a given genome reads dataset, first move the fasta file to the Clustering folder and rename it to reads.fasta. Then execute the following commands in order to obtain pre-clusters:

figure a

The result is a Cluster folder containing subfolders, each of which represents a pre-cluster. Now, the following commands run the clustering via convexification module on pre-clusters:

figure b

The last command stores the final collected transcripts on the final_transcripts.fasta file.

To run all parts of the algorithm step by step, or change the hyper-parameters such as \(k\), \(d\), and \(b\), follow the advanced version instructions on https://github.com/sinaBaharlouei/I-CONVEX.

Supplementary Figures: Distribution of Sequence Lengths in Synthetic and Real Datasets

Supplementary Figure 3.
figure 5

Distribution of the Read Lengths and Transcript Abundances for the simulated and real datasets.

Supplementary Figure 4.
figure 6

Histograms of sequence length of four SIRV datasets.

Supplementary Figure 5.
figure 7

Histograms of transcript abundances of four SIRV datasets.

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Baharlouei, S., Razaviyayn, M., Tseng, E., Tse, D. (2023). I-CONVEX: Fast and Accurate de Novo Transcriptome Recovery from Long Reads. In: Koprinska, I., et al. Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2022. Communications in Computer and Information Science, vol 1753. Springer, Cham. https://doi.org/10.1007/978-3-031-23633-4_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-23633-4_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-23632-7

  • Online ISBN: 978-3-031-23633-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics