I-CONVEX: Fast and Accurate de Novo Transcriptome Recovery from Long Reads

Baharlouei, Sina; Razaviyayn, Meisam; Tseng, Elizabeth; Tse, David

doi:10.1007/978-3-031-23633-4_23

Sina Baharlouei⁴⁶,
Meisam Razaviyayn⁴⁶,
Elizabeth Tseng⁴⁷ &
…
David Tse⁴⁸

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 1753))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

815 Accesses

Abstract

Long-read sequencing technologies demonstrate high potential for de novo discovery of complex transcript isoforms, but high error rates pose a significant challenge. Existing error correction methods rely on clustering reads based on isoform-level alignment and cannot be efficiently scaled. We propose a new method, I-CONVEX, that performs fast, alignment-free isoform clustering with almost linear computational complexity, and leads to better consensus accuracy on simulated, synthetic, and real datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality-Value Based Algorithm

Identifying and quantifying isoforms from accurate full-length transcriptome sequencing reads with Mandalorion

Article Open access 17 July 2023

Accurate isoform discovery with IsoQuant using long reads

Article Open access 02 January 2023

References

Steijger, T., et al.: Assessment of transcript reconstruction methods for RNA-seq. Nat. Meth. 10, 1177–1184 (2013)
Article Google Scholar
Kannan, S., Hui, J., Mazooji, K., Pachter, L., Tse, D.: Shannon: an information-optimal de Novo RNA-Seq assembler. Preprint at https://www.biorxiv.org/content/10.1101/039230v1 (2016)
Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L., Wold, B.: Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nat. Meth. 5, 621–628 (2008)
Article Google Scholar
Wang, Z., Gerstein, M., Snyder, M.: RNA-Seq: a revolutionary tool for transcriptomics. Nat. Rev. Genet. 10, 57–63 (2009)
Article Google Scholar
Gordon, S.P., et al.: Widespread polycistronic transcripts in fungi revealed by single-molecule mRNA sequencing. PLoS ONE 10, e0132628 (2015)
Article Google Scholar
Wang, B., et al.: Unveiling the complexity of the maize transcriptome by single-molecule long-read sequencing. Nat. Commun. 7, 11708 (2016)
Article Google Scholar
Hwang, B., Lee, J.H., Bang, D.: Single-cell RNA sequencing technologies and bioinformatics pipelines. Exp. Mol. Med. 50, 96 (2018)
Article Google Scholar
Sahlin, K., Tomaszkiewicz, M., Makova, K.D., Medvedev, P.: Deciphering highly similar multigene family transcripts from Iso-Seq data with IsoCon. Nat. Commun. 9, 4601 (2018)
Article Google Scholar
Abdel-Ghany, S.E., et al.: A survey of the sorghum transcriptome using single-molecule long reads. Nat. Commun. 7, 11706 (2016)
Article Google Scholar
Weirather, J.L., et al.: Characterization of fusion genes and the significantly expressed fusion isoforms in breast cancer by hybrid sequencing. Nucleic Acids Res. 43, e116 (2015)
Article Google Scholar
Westbrook, C.J., et al.: No assembly required: full-length MHC class I allele discovery by PacBio circular consensus sequencing. Hum. Immunol. 76, 891–896 (2015)
Article Google Scholar
Roberts, A., Pachter, L.: Streaming fragment assignment for real-time analysis of sequencing experiments. Nat. Meth. 10, 71–73 (2013)
Article Google Scholar
Trapnell, C., et al.: Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nat. Biotechnol. 28, 511–515 (2010)
Article Google Scholar
Broder, A.Z.: On the resemblance and containment of documents. Compression Complex. Seq. 1997, 21–29 (1997)
Google Scholar
Ondov, B.D., et al.: Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17, 1–14 (2016)
Article Google Scholar
Berlin, K., et al.: Assembling large genomes with single-molecule sequencing and locality-sensitive hashing. Nat. Biotechnol. 33, 623–630 (2015)
Article Google Scholar
Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB, vol. 99, no. 6 (1999)
Google Scholar
Tardaguila, M., et al.: SQANTI: extensive characterization of long-read transcript sequences for quality control in full-length transcriptome identification and quantification. Genome Res. 28, 396–411 (2016)
Article Google Scholar
Safonova, Y., et al.: IgRepertoireConstructor: a novel algorithm for antibody repertoire construction and immunoproteogenomics analysis. Bioinformatics 31, i53–i61 (2015)
Article Google Scholar
Gibrat, J.F.: A short note on dynamic programming in a band. BMC Bioinf. 19(1), 1–5 (2018)
Article Google Scholar
Clauset, A., Newman, M.E.J., Moore, C.: Finding community structure in very large networks. Phy. Rev. E 70, 066111 (2004)
Article Google Scholar
Tseng, E., Tang, H., AlOlaby, R.R., Hickey, L., Tassone, F.: Altered expression of the FMR1 splicing variants landscape in premutation carriers. Biochim. Biophys. Acta 1860, 1117–1126 (2017)
Article Google Scholar
Steeb, W., Hardy, Y.: Matrix Calculus and Kronecker Product: A Practical Approach to Linear and Multilinear Algebra, 2nd edn. World Scientific Publishing Company, Singapore (2011
Google Scholar
Bertsekas, D.P.: Nonlinear programming. J. Oper. Res. Soc. 48, 334 (1997)
Article Google Scholar
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. R Stat. Soc. Ser. B Methodol. 58, 267–288 (1996)
MATH Google Scholar

Download references

Author information

Authors and Affiliations

University of Southern California, Los Angeles, CA, USA
Sina Baharlouei & Meisam Razaviyayn
Pacific Biosciences, Menlo Park, CA, USA
Elizabeth Tseng
Stanford University, Stanford, CA, USA
David Tse

Authors

Sina Baharlouei
View author publications
You can also search for this author in PubMed Google Scholar
Meisam Razaviyayn
View author publications
You can also search for this author in PubMed Google Scholar
Elizabeth Tseng
View author publications
You can also search for this author in PubMed Google Scholar
David Tse
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sina Baharlouei .

Editor information

Editors and Affiliations

University of Sydney, Sydney, Australia
Irena Koprinska
University of Bari Aldo Moro, Bari, Italy
Paolo Mignone
University of Pisa, Pisa, Italy
Riccardo Guidotti
Warsaw University of Technology, Warsaw, Poland
Szymon Jaroszewicz
Heidelberg University, Heidelberg, Germany
Holger Fröning
UniCredit, Rome, Italy
Francesco Gullo
University of Lisbon, Lisbon, Portugal
Pedro M. Ferreira
Roche, Basel, Switzerland
Damian Roqueiro
Barcelona Supercomputing Center, Barcelona, Spain
Gaia Ceddia
Halmstad University, Halmstad, Sweden
Slawomir Nowaczyk
University of Porto, Porto, Portugal
João Gama
University of Porto, Porto, Portugal
Rita Ribeiro
UPC BarcelonaTech, Barcelona, Spain
Ricard Gavaldà
University of Naples Federico II, Naples, Italy
Elio Masciari
University of North Carolina, Charlotte, USA
Zbigniew Ras
ICAR-CNR, Rende, Italy
Ettore Ritacco
University of Pisa, Pisa, Italy
Francesca Naretto
Aalen University of Applied Sciences, Aalen, Germany
Andreas Theissler
Warsaw University of Technology, Warszaw, Poland
Przemyslaw Biecek
KU Leuven, Leuven, Belgium
Wouter Verbeke
University of Duisburg-Essen, Essen, Germany
Gregor Schiele
Graz University of Technology, Graz, Austria
Franz Pernkopf
AMD, Dublin, Ireland
Michaela Blott
UniCredit, Rome, Italy
Ilaria Bordino
UniCredit, Milan, Italy
Ivan Luciano Danesi
National Agency for New Technologies, Rome, Italy
Giovanni Ponti
Unicredit, Rome, Italy
Lorenzo Severini
University of Bari Aldo Moro, Bari, Italy
Annalisa Appice
University of Bari Aldo Moro, Bari, Italy
Giuseppina Andresini
University of Lisbon, Lisbon, Portugal
Ibéria Medeiros
University of Lisbon, Lisbon, Portugal
Guilherme Graça
Northwestern University, Chicago, USA
Lee Cooper
Roche, Basel, Switzerland
Naghmeh Ghazaleh
University of Lausanne, Lausanne, Switzerland
Jonas Richiardi
Novartis, Basel, Switzerland
Diego Saldana
Novartis, Basel, Switzerland
Konstantinos Sechidis
Fondazione IRCCS Ca’ Granda Ospedale Maggiore Policlinico, Milan, Italy
Arif Canakoglu
Politecnico di Milano, Milan, Italy
Sara Pido
Politecnico di Milano, Milan, Italy
Pietro Pinoli
University of Waikato, Hamilton, New Zealand
Albert Bifet
Halmstad University, Halmstad, Sweden
Sepideh Pashami

Appendices

Appendix

1.1 Supplementary Note 1: Identifiability of de novo Transcriptome Recovery from Long Reads

A family of distributions $P = \{{P}_{\theta }| \theta \in \Theta \}$ parameterized by a vector $\theta $ is called identifiable if it satisfies the following condition:

$$ P_{{{\uptheta }_{1} }} = P_{{{\uptheta }_{2} }} \Rightarrow {\uptheta }_{1} = {\uptheta }_{2} \quad \forall {\uptheta }_{1} ,{\uptheta }_{2} \in {\Theta }, $$

This means that with enough observations from one of the distributions inside the family, one can decide which $\theta $ is the ground-truth value. For the de novo transcriptome recovery problem, $\theta $ is defined as the set of unknown transcripts/sequences $S = \{{s}_{1}, {s}_{2}, ..., {s}_{m}\}$ and their corresponding abundances $\rho = \{{\rho }_{1}, {\rho }_{2}, ..., {\rho }_{m}\}$. For a given read$r$, we define ${P}_{\theta }(r)$ as the probability of observing $r$, given $\theta = (S, \rho ).$ We assume that a read generated from a sequence by adding insertion, deletion, or substitution errors through an i.i.d. process which does not depend on the location of the error. For simplicity, let ${\delta }_{s}, {\delta }_{i},$ and ${\delta }_{d}$ be the probability of observing a substitution, insertion, or deletion error at each location of the sequence respectively. In Theorem 1, we prove that the problem of de novo transcriptome recovery is identifiable.

Supplementary Theorem 1 (Long-Read Sequencing Identifiability):

Assume that the set of sequences $S = \{{s}_{1}, {s}_{2}, ..., {s}_{m}\}$ with the corresponding abundances $\rho = \{{\rho }_{1}, {\rho }_{2}, ..., {\rho }_{m}\}$ is unknown. Suppose that for any read$r$, ${P}_{S, \rho }(r)$ represents the probability of observing read $r$ given that it is generated from a sequence in $S$ with the abundance vector$\rho $. Then, given ${P}_{S, \rho }(r)$ for all reads$r$, both $S$ and $\rho $ can be exactly recovered when substitution error parameter${\delta }_{s} \ne \frac{3}{4}$.

Proof:

Each read $r$ with non-zero $P(r)$ can be obtained by adding insertions, deletions, and substitutions to one or several sequences in $S$. Without loss of generality, we can model the transcriptome recovery problem as denoising the reads obtained by applying three noise channels to the set of sequences in $S$ sequentially (Supplementary Figure 1). In this model, ${P}_{1}(r)$ is the distribution of reads after the substitution channel; and ${P}_{2}(r)$ is the distribution of reads after applying the deletion channel (and before the insertion channel). Lastly, $P(r)$ is the final probability distribution observed after applying the insertion noise. If we prove that the transcriptome recovery is identifiable under each one of these three channels (insertion, deletion, and substitution) individually, then the entire problem is identifiable. This is because of the fact that using the given probability of observing reads $P(r)$, we can recover ${P}_{2}(r)$ by denoising the effect of the insertion channel. Then, with the same argument, we can recover ${P}_{1}(r)$ by denoising the effect of the deletion channel. Finally, we can find the vector $\rho $ by denoising the effect of the substitution channel on the probability vector ${P}_{1}$. Thus, it suffices to show that ${P}_{2}(r), {P}_{1}(r),$ and $\rho $ can be recovered sequentially given $P(r)$. In what follows, we show each one of these three channels is identifiable.

Identifiability Under the Substitution Channel:

For the substitution channel, we can observe that a read $r$ with length $L$ can only be generated from a sequence of size $L$, since the substitution does not change the sequence length. Thus, for estimating the abundances of the sequence with size $L$, we can only focus on the set of reads with size $L$. It means that without loss of generality we can assume that the set of reads and sequences are all of size $L$. Moreover, without loss of generality, assume that S consists of all ${4}^{L}$ possible transcripts with size $L$ with possibly zero abundances. Thus, the problem is reduced to recovering vector $\rho $, given the $P(r)$ and $S$. Let ${\delta }_{s}=\delta $ be the substitution error rate. Then the probability of observing the read ${r}_{i}$ can be written as follows:

$$P\left({r}_{i}\right)=\sum\limits_{j=1}^{n}{\uprho }_{j}P\left({r}_{i}|{s}_{j}\right)=\sum\limits_{j=1}^{n}{\uprho }_{j}{\left(\frac{\updelta }{3}\right)}^{d\left({r}_{i},{s}_{j}\right)}{\left(1-\updelta \right)}^{L-d\left({r}_{i},{s}_{j}\right)}$$

Here $d({r}_{i},{s}_{j})$ is the number of mismatches between ${r}_{i}$ and ${s}_{j}$. This relationship can be written in a matrix form. For example, when $L=1$, let the abundances of ‘A’, ‘C’, ‘G’, and ‘T’ be ${\rho }_{1}, {\rho }_{2}, {\rho }_{3}$ and ${\rho }_{4}$ respectively. Then, the probability of observing each {‘A’, ‘C’, ‘G’, ‘T’} can be obtained by the following equation:

$$\left[\begin{array}{c}P\left({r}_{1}\right)\\ P\left({r}_{2}\right)\\ P\left({r}_{3}\right)\\ P\left({r}_{4}\right)\end{array}\right]= A\left[\begin{array}{c}{\uprho }_{1}\\ {\uprho }_{2}\\ {\uprho }_{3}\\ {\uprho }_{4}\end{array}\right]$$

where

$$ A = \left[ {\begin{array}{*{20}c} {1 - \delta } & {\delta /3} & {\delta /3} & {\delta /3} \\ {\delta /3} & {1 - \delta } & {\delta /3} & {\delta /3} \\ {\delta /3} & {\delta /3} & {1 - \delta } & {\delta /3} \\ {\delta /3} & {\delta /3} & {\delta /3} & {1 - \delta } \\ \end{array} } \right] $$

Thus, the identifiability problem in this case reduces to showing the invertibility of this matrix. Notice that the determinant of this matrix can be computed as

where the second equality is due to the effect of permuting rows on the determinant of the matrix. One can easily check that this polynomial is non-zero for all $\delta \in [0, 1]$ except $\delta = \frac{3}{4}$, which implies the invertibility of the matrix for all $\delta \in [0, 1]$ except $\delta = \frac{3}{4}$.

Now, let us consider the case of general length $L>1$. Assume that we order the set of all possible sequences with size $L$ (all $m={4}^{L}$ possible sequences) according to the order of sequences in base $4$. In this case, the probability transition relation can be written as

$$ \left[ {\begin{array}{*{20}c} {P\left( {r_{1} } \right)} \\ {P\left( {r_{2} } \right)} \\ \ldots \\ {P\left( {r_{m} } \right)} \\ \end{array} } \right] = \underbrace {A \otimes \ldots \otimes A}_{{L\:{\text{times}}}}\left[ {\begin{array}{*{20}c} {{\uprho }_{1} } \\ {{\uprho }_{2} } \\ \ldots \\ {{\uprho }_{m} } \\ \end{array} } \right] $$

where the Kronecker product of two matrices ${A}_{(m \times n)}$ and ${B}_{(p \times q)}$ is a matrix ${C}_{((m\times p) \times (n\times q))}$, defined as:

$$C=A\otimes B=\left[\begin{array}{cccc}{a}_{11}B& {a}_{12}B& ...& {a}_{1n}B\\ {a}_{21}B& {a}_{22}B& ...& {a}_{2n}B\\ ...& ...& ...& ...\\ {a}_{m1}B& {a}_{m2}B& ...& {a}_{mn}B\end{array}\right]$$

The Kronecker product of $L$ matrices is invertible if and only if each one of them is invertible [23]. We have already shown that $A$ is an invertible matrix when ${\delta }_{s}\ne \frac{3}{4}$. Thus, $A \otimes A\otimes ... \otimes A$ is invertible as well, which shows the problem is identifiable when the substitution channel is applied.

Insertion Channel:

Let $R$ be the support of $P$ ($R = \{r: P(r) > 0\}$). We define the operator $\le $ on a pair of reads as follows: ${r}_{1 }\le {r}_{2}$ if ${r}_{2}$ can be obtained by inserting zero or more bases to ${r}_{1}$ in different locations. It is easy to observe that $(R, \le )$ forms a partially ordered set (poset). Since it satisfies the reflexivity, anti-symmetry, and transitivity. We have the following observations:

If $r^{\prime} \le r$ and $r \ne r^{\prime}$, then $\left| {r^{\prime}} \right| < \left| r \right|$.
R has at least one minimal element.
Any minimal element of $R$ belongs to $S$.

To see the correctness of the first observation, assume that $r^{\prime} \le r$, which means r is obtained by adding zero or more characters to r. However, since $r^{\prime} \ne r$, the number of added characters cannot be zero. Thus at least one character is added to $r^{\prime}$ to obtain r. This means that $\left| {r^{\prime}} \right| \le \left| r \right|$.

To prove the second observation, assume that $r_{min}$ is an element of $R$ with the minimum length (it is not necessarily unique). Indeed, such an element exists, because $\left| r \right| \ge 1$ for all $r \in R$. We claim that $r_{min}$ is a minimal element. We prove by contradiction. Assume that $r_{min}$ is not a minimal element. Thus, there exists an element $r^{\prime} \ne r_{min}$ in $R$ such that $r^{\prime} \le r_{min}$. According to the first observation $\left| r \right| < \left| {r_{min} } \right|$, which contradicts the fact that $r_{min}$ is an element with the minimum length in $R$.

Finally, we prove that a minimal element of $R$ belongs to $S$. We prove by contradiction. Assume that ${r}_{min}$ is a minimal element that does not belong to $S$. Thus, there exists an element $s\in S$, such that $s \le {r}_{min}.$ Clearly $r=s$ belongs to $R$, since $P(r)\ge P(r|s)={\rho }_{s}(1 - {\delta }_{i}{)}^{|s|}>0$. Therefore, $s$ is an element in $R$ which is not equal to ${r}_{min}$ and $s\le {r}_{min}$ which contradicts with the fact that ${r}_{min}$ is a minimal element in $R.$

Now, we prove the identifiability of the transcriptome recovery problem under the insertion channel by induction on the size of $S$.

For $|S|=1$, according to the third observation, the minimal member of $R$ belongs to $S$, and its abundance is clearly $1$.

Based on the induction hypothesis, assume that the problem is identifiable when $|S|=m$. We want to prove it is identifiable, when $|S|=m+1$. Based on the third observation, any minimal element ${r}_{min}$ of $R$ belongs to $S$. Set $S^{\prime} = S - \left\{ {r_{min} } \right\}$ and update the probability vector of the reads accordingly: $P^{\prime}\left( r \right) = P\left( r \right) - P(r|r_{min} )\rho_{{r_{min} }}$. For updating the probability vector of reads we need to estimate ${\rho }_{{r}_{min}}$ exactly. Since ${r}_{min}$ is the minimal element, it can be only obtained from a sequence with the exact same structure. Hence, $P({r}_{min}) = {\rho }_{{r}_{min}}(1-\delta {)}^{|{r}_{min}|}$, which means ${\rho }_{{r}_{min}}=\frac{P({r}_{min})}{(1-\delta {)}^{|{r}_{min}|}}.$ According to the assumption of induction, since $\left| {S^{\prime}} \right| = m$. we can recover $S^{\prime}$ using $P^{\prime}$. Thus, $S = S^{\prime} \cup \left\{ {r_{min} } \right\}$, which means we can recover $S$ exactly.

Deletion Channel:

The argument is similar to the insertion channel. The only difference is that the operator $\le $ is defined for a pair of reads ${r}_{1 }\le {r}_{2}$ if ${r}_{2}$ can be obtained by deleting some elements of ${r}_{1}$. Moreover, the minimal element of $R$ is the longest read in this case.

1.2 Supplementary Note 2: Sparse EM Algorithm Inside the Isoform Recovery via Convexification

As discussed in Sect. 3, the maximum likelihood estimator of the abundance vector given the set of transcripts and reads can be written as:

$${\widehat{\rho }}_{ML}=\underset{\rho }{\mathrm{argmax}} \sum\limits_{i=1}^{n}log\left(\sum\limits_{j=1}^{m}{\alpha }_{ij}{\rho }_{j}\right)\hspace{1em}\mathrm{subject\ to}\hspace{1em}\uprho \ge 0,\hspace{1em}\sum\limits_{j=1}^{m}{\uprho }_{j}=1$$

Or equivalently:

$${\widehat{\rho }}_{ML}=\underset{\rho }{\mathrm{argmin}} \sum\limits_{i=1}^{n}-log\left(\sum\limits_{j=1}^{m}{\alpha }_{ij}{\rho }_{j}\right)\hspace{1em}\mathrm{subject\ to}\hspace{1em}\rho \ge 0,\hspace{1em}\sum\limits_{j=1}^{m}{\rho }_{j}=1$$

Applying the Expectation Maximization (EM) algorithm, at each iteration, we minimize a tight local upper bound of the above objective function:

$$ \begin{array}{*{20}l} {{{\uprho }}^{{t + 1}} = \mathop {{\text{argmin}}}\limits_{\rho } - \mathop \sum \limits_{{i = 1}}^{n} \mathop \sum \limits_{{j = 1}}^{m} \frac{{{{\upalpha }}_{{ij}} {{\uprho }}_{j}^{t} }}{{\mathop \sum \nolimits_{{j^{\prime} = 1}}^{m} {{\upalpha }}_{{ij^{\prime}}} {{\uprho }}_{{j^{\prime}}}^{t} }}\log \left( {\frac{{\rho _{j} }}{{\rho _{j}^{t} }}} \right)} \hfill \\ {\quad \quad \quad \quad \quad - \mathop \sum \limits_{{i = 1}}^{n} \log \left( {\mathop \sum \limits_{{j = 1}}^{m} \alpha _{{ij}} \rho _{j} } \right)\quad {\text{subject}}\;{\text{to}}\quad {{\uprho }} \ge 0,\quad \mathop \sum \limits_{{j = 1}}^{m} {{\uprho }}_{j} = 1} \hfill \\ \end{array}$$

The Lagrangian function of the above problem can be written as:

$$L\left(\uprho ,\uplambda \right)=-\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{m}\frac{{\mathrm{\alpha }}_{ij}{\uprho }_{j}^{t}}{\sum\nolimits_{{j}^{^{\prime}}=1}^{m}{\mathrm{\alpha }}_{i{j}^{^{\prime}}}{\uprho }_{{j}^{^{\prime}}}^{t}}\mathrm{log}\left(\frac{{\uprho }_{j}}{{\uprho }_{j}^{t}}\right)-\sum\limits_{i=1}^{n}\mathrm{log}\left(\sum\limits_{j=1}^{m}{\alpha }_{ij}{\rho }_{j}\right)+\uplambda \left(\sum\limits_{j=1}^{m}{\uprho }_{j}-1\right)$$

And the dual problem takes the form of:

$$\underset{\uplambda }{\mathrm{max}}\ \underset{\uprho }{\mathrm{min}}\ L\left(\uprho ,\uplambda \right)$$

Since the Lagrangian function is convex in $\rho $, and the constraints are linear, the strong duality holds [24]. Thus, by setting the gradient of $L$ with respect to $\rho $ to 0, we have:

$$-\sum\limits_{i=1}^{n}\frac{{\mathrm{\alpha }}_{ik}{\uprho }_{k}^{t}}{\sum\nolimits_{{j}^{^{\prime}}=1}^{m}{\mathrm{\alpha }}_{i{j}^{^{\prime}}}{\uprho }_{{j}^{^{\prime}}}^{t}}\frac{1}{{\uprho }_{k}}+{\uplambda }^{*}=0,\hspace{1em}\forall k=1,\dots ,m.$$

Which means:

$${\uprho }_{k}^{t+1}=\frac{1}{{\uplambda }^{*}}\sum\limits_{i=1}^{n}\frac{{\mathrm{\alpha }}_{ik}{\uprho }_{k}^{t}}{\sum\nolimits_{{j}^{^{\prime}}=1}^{m}{\mathrm{\alpha }}_{i{j}^{^{\prime}}}{\uprho }_{{j}^{^{\prime}}}^{t}}\hspace{1em}\forall k=1,\dots ,m.$$

Since $\sum\nolimits_{j=1}^{m}{{\rho }_{j}^{t+1}}= 1$, we have:

$$1=\sum\limits_{k=1}^{m}{\uprho }_{k}^{t+1}=\sum\limits_{k=1}^{m}\frac{1}{{\uplambda }^{*}}\sum\limits_{i=1}^{n}\frac{{\mathrm{\alpha }}_{ik}{\uprho }_{k}^{t}}{\sum\nolimits_{{j}^{^{\prime}}=1}^{m}{\mathrm{\alpha }}_{i{j}^{^{\prime}}}{\uprho }_{{j}^{^{\prime}}}^{t}}=\frac{1}{{\uplambda }^{*}}\sum\limits_{i=1}^{n}\frac{\sum\nolimits_{k=1}^{m}{\mathrm{\alpha }}_{ik}{\uprho }_{k}^{t}}{\sum\nolimits_{{j}^{^{\prime}}=1}^{m}{\mathrm{\alpha }}_{i{j}^{^{\prime}}}{\uprho }_{{j}^{^{\prime}}}^{t}}=\frac{n}{{\uplambda }^{*}}$$

Therefore ${\lambda }^{*}= n$, and

$${\uprho }_{k}^{t+1}=\frac{1}{n}\sum\limits_{i=1}^{n}\frac{{\mathrm{\alpha }}_{ik}{\uprho }_{k}^{t}}{\sum\nolimits_{{j}^{^{\prime}}=1}^{m}{\mathrm{\alpha }}_{i{j}^{^{\prime}}}{\uprho }_{{j}^{^{\prime}}}^{t}}\hspace{1em}\forall k=1,\dots ,m.$$

This update rule is similar to the one used by Express [12] to obtain the abundance vector.

Sparsification.

One naïve approach to impose sparsity to the estimated abundances in the above problem is to use the $ \ell _{1} $ regularizer [25]. However, the abundances are non-negative and their summation is equal to $1$. Therefore, the $\ell _{1} $ regularizer does not change the objective function at all. Thus, instead of using $\ell _{1}$ regularizer, we apply the $\ell{l}q$ regularizer to the objective function with $0< q < 1$. Using this regularizer, the above problem can be written as

$$\begin{array}{*{20}l} {\rho ^{{t + 1}} = \mathop {{\text{argmin}}}\limits_{\rho } - \mathop \sum \limits_{{i = 1}}^{n} \mathop \sum \limits_{{j = 1}}^{m} \frac{{\alpha _{{ij}} \rho _{j}^{t} }}{{\mathop \sum \nolimits_{{j^{\prime} = 1}}^{m} \alpha _{{ij^{\prime}}} \rho _{{j^{\prime}}}^{t} }}\log \left( {\frac{{\rho _{j} }}{{\rho _{j}^{t} }}} \right)} \hfill \\ {\quad \quad \quad \quad \quad \quad - \mathop \sum \limits_{{i = 1}}^{n} \log \left( {\mathop \sum \limits_{{j = 1}}^{m} \alpha _{{ij}} \rho _{j} } \right)\quad {\text{subject}}\;{\text{to}}\quad \rho \ge 0,\quad \rho _{q} \le 1} \hfill \\ \end{array}$$

Thus, the Lagrangian function for the sparse EM takes the following form:

$$L\left(\uprho ,\uplambda \right)=-\sum\limits_{i=1}^{n}\sum\limits_{j=1}^{m}\frac{{\mathrm{\alpha }}_{ij}{\uprho }_{j}^{t}}{\sum\nolimits_{{j}^{^{\prime}}=1}^{m}{\mathrm{\alpha }}_{i{j}^{^{\prime}}}{\uprho }_{{j}^{^{\prime}}}^{t}}\mathrm{log}\left(\frac{{\uprho }_{j}}{{\uprho }_{j}^{t}}\right)-\sum\limits_{i=1}^{n}\mathrm{log}\left(\sum\limits_{j=1}^{m}{\alpha }_{ij}{\rho }_{j}\right)+\uplambda \left({\left(\sum\limits_{j=1}^{m}{\uprho }_{j}^{q}\right)}^\frac{1}{q}-1\right)$$

Again, by writing the optimality condition for the dual problem, we have

$$\nabla {L}_{\uprho }\left(\uprho ,\uplambda \right)=0 \Rightarrow $$

$$-\sum\limits_{i=1}^{n}\frac{{\mathrm{\alpha }}_{ik}{\uprho }_{k}^{t}}{\sum\nolimits_{{j}^{^{\prime}}=1}^{m}{\mathrm{\alpha }}_{i{j}^{^{\prime}}}{\uprho }_{{j}^{^{\prime}}}^{t}}\frac{1}{{\uprho }_{k}^{t+1}}+{\uplambda }^{*}q{\left({\uprho }_{k}^{t+1}\right)}^{q-1}{\left({\left({\uprho }_{1}^{t+1}\right)}^{q}+\dots +{\left({\uprho }_{m}^{t+1}\right)}^{q}\right)}^{\frac{1}{q}-1}=0,\hspace{1em}\forall k=1,\dots ,m.$$

Moreover, according to the complementary slackness, we have:

$${\uplambda }^{*}\left[{\left(\sum\limits_{j=1}^{m}{\left({\uprho }_{j}^{t+1}\right)}^{q}\right)}^\frac{1}{q}-1\right]=0$$

It is easy to observe that ${\lambda }^{*}\ne 0$, otherwise the last two equalities have no solution. Thus, similar to the previous case, we can conclude that ${\lambda }^{*}=n$, and we have the following closed-form update rule:

$${\uprho }_{k}^{t+1}={\left(\frac{1}{n}\sum\limits_{i=1}^{n}\frac{{\mathrm{\alpha }}_{ik}{\uprho }_{k}^{t}}{\sum\nolimits_{{j}^{^{\prime}}=1}^{m}{\mathrm{\alpha }}_{i{j}^{^{\prime}}}{\uprho }_{{j}^{^{\prime}}}^{t}}\right)}^\frac{1}{q},\hspace{1em}\forall k=1,\dots ,m.$$

We refer to $\frac{1}{q}$ as the sparsity power. This hyper-parameter is set to $1.03$ in the implementation of Isoform Recovery via Convexification module. As we increase the sparsity power, the abundance vector is sparser, and thus the number of recovered isoforms is smaller at the end. Sparsification of EM can help to remove false positive isoforms further. However, we should avoid setting it to very large numbers as it may lead to the removal of true positives.

1.3 Supplementary Note 3: Further Analysis of MinHash and Locality Sensitive Hashing Algorithms

Definition 1:

Suppose that ${S}_{1}$ and ${S}_{2}$ are k-mer sets of two given sequences. The Jaccard similarity of ${S}_{1}$ and ${S}_{2}$ is defined as

$$J\left({S}_{1},{S}_{2}\right)=\frac{\left|{S}_{1}\cap {S}_{2}\right|}{\left|{S}_{1}\cup {S}_{2}\right|}$$

Definition 2:

Given a set of sequences ${S}_{1},\dots ,{S}_{m}$ and the set of k-mers $S^{\prime}_{1} , \ldots ,S^{\prime}_{{N^{\prime}}}$, we say a matrix $M \in \left\{ {0,1} \right\}^{{N^{\prime} \times m}}$ is a representation matrix if it satisfies the following property: The element in row $i$ and column $j$ equals $1$ if and only if the i-th k-mer appears in j-th sequence, and $0$ otherwise.

Definition 3:

Assume the representation matrix $M$ has $N^{\prime}$ rows and $m$ columns, and let ${\mathcal{P}} = \left\{ {i_{1} ,i_{2} , \ldots ,i_{{N^{\prime}}} } \right\}$ be a permutation of set $\left\{ {1,2, \ldots ,N^{\prime}} \right\}$, and column $c$ in $M$ corresponds to a sequence $S$. Then the MinHash signature of $S$ with respect to $\mathcal{P}$ is defined as ${\mathrm{MinHash}}_{\mathcal{P}}(c)={i}_{j}$ where $j$ is the smallest number in $\left\{ {1,2, \ldots ,N^{\prime}} \right\}$ such that $M[{i}_{j}][c]=1$.

Example:

Assume that ${S}_{1}={\text{ACCAGTC}}$ and ${S}_{2}={\text{ACCGTCA}}$. The set of 3-mers that appears at least once in ${S}_{1}$ or ${S}_{2}$ is $\{\mathrm{ACC},\mathrm{ CCA},\mathrm{ CAG},\mathrm{ AGT},\mathrm{ GTC},\mathrm{ CCG},\mathrm{ CGT},\mathrm{ TCA}\}.$ Let $\mathcal{P}=\{7, 5, 2, \mathrm{3,1},\mathrm{4,6}\}$ be a random permutation of 3-mer indices. Since $5$ is the first 3-mer index in $\mathcal{P}$ such that its corresponding 3-mer appears in ${S}_{1}$, thus, the MinHash signature of ${S}_{1}$ with respect to $\mathcal{P}$ is $5$. With the same logic, we can observe that the MinHash signature of ${S}_{2}$ with respect to $\mathcal{P}$ is $7$.

Supplementary Theorem 2:

Let $J\left({S}_{1},{S}_{2}\right)$ be the Jaccard similarity between two given sequences ${S}_{1}$ and ${S}_{2}$, and ${\mathcal{P}}_{1},\dots ,{\mathcal{P}}_{h}$ are $h$ distinct randomly generated permutations of set $\{\mathrm{1,2},\dots ,{N}^{^{\prime}}\}$. For each sequence, a vector of MinHash signatures with length $h$ is computed with respect to permutations ${\mathcal{P}}_{1},\dots ,{\mathcal{P}}_{h}$. Then, $\frac{{\# {\text{Matches}}\;{\text{between}}\;{\text{signature}}\;{\text{vectors}}\;{\text{of}}\;S_{1} \;{\text{and}}\;S_{2} }}{h}$ is an unbiased estimator of the Jaccard similarity of ${S}_{1}$ and ${S}_{2}$.

Proof:

Let ${c}_{1}$ and ${c}_{2}$ be the indices of columns in $M$ corresponding to ${S}_{1}$ and ${S}_{2}$. Let ${R}_{1}$ denote the set of rows in which ${c}_{1}$ and ${c}_{2}$ both have the value $1$, and ${R}_{2}$ be the set of rows in which exactly one of ${c}_{1}$ and ${c}_{2}$ is $1$. Fix a permutation $\mathcal{P}={\{i}_{1},{i}_{2},\dots ,{i}_{{N}^{^{\prime}}}\}$, and let $j$ be the smallest index such that at least one of $M\left[{i}_{j}\right]\left[{c}_{1}\right]$ or $M\left[{i}_{j}\right]\left[{c}_{2}\right]$ is $1$. Thus ${i}_{j }\in {R}_{1}\cup {R}_{2}$. Moreover, ${\mathrm{MinHash}}_{\mathcal{P}}\left({\mathrm{c}}_{1}\right)={\mathrm{MinHash}}_{\mathcal{P}}\left({\mathrm{c}}_{2}\right)={i}_{j}$ if and only if ${i}_{j} \in {R}_{1}$. Hence $Pr[{\mathrm{MinHash}}_{\mathcal{P}}\left({\mathrm{c}}_{1}\right)={\mathrm{MinHash}}_{\mathcal{P}}\left({\mathrm{c}}_{2}\right)] = \frac{|{R}_{1}|}{{|{R}_{1}|+|R}_{2}|}$ which is equal to the Jaccard similarity of ${S}_{1}$ and ${S}_{2}$. Thus, we have

$$ \begin{array}{*{20}l} {{\mathbb{E}}\left[ {\frac{{\# {\text{Matches}}\;{\text{between}}\;{\text{signature}}\;{\text{vectors}}\;{\text{of}}\;S_{1} \;{\text{and}}\;S_{2} }}{h}} \right]} \hfill \\ {\quad \quad \quad \quad = \frac{{\mathop \sum \nolimits_{{k = 1}}^{h} Pr\left[ {{\text{MinHash}}_{{{\mathcal{P}}_{k} }} \left( {{\text{c}}_{1} } \right) = {\text{MinHash}}_{{{\mathcal{P}}_{{\text{k}}} }} \left( {{\text{c}}_{2} } \right)} \right]}}{h}} \hfill \\ {\quad \quad \quad \quad = \frac{{hJ\left( {S_{1} ,S_{2} } \right)}}{h} = J\left( {S_{1} ,S_{2} } \right),} \hfill \\ \end{array} $$

which means the estimator is unbiased.

Supplementary Theorem 3:

The variance of the estimator proposed in Theorem 2 is equal to $\frac{J({S}_{1}, {S}_{2}) - {J}^{2}({S}_{1}, {S}_{2})}{h}$.

Proof:

Since the probability of match between two MinHash signatures for each permutation ${\mathcal{P}}_{k}$ equals to $J({S}_{1}, {S}_{2})$, thus $\#\mathrm{Matches\ between\ signature\ vectors\ of}\ {S}_{1}\ \mathrm{and}\ {S}_{2}$ follows a binomial distribution with $h$ trials (number of permutations) and $p = J({S}_{1}, {S}_{2})$. Thus its variance equals $hp(1-p)$. Therefore,

$$ \begin{gathered} {\text{Var}}\left[ {\frac{{\# {\text{Matches}}\;{\text{between}}\;{\text{two}}\;{\text{signatures}}}}{h}} \right] = \frac{{{\text{Var}}\left[ {\# {\text{Matches}}\;{\text{between}}\;{\text{two}}\;{\text{signatures}}} \right]}}{{h^{2} }} \hfill \\ \quad \quad \quad \quad = \frac{{hJ\left( {S_{1} ,S_{2} } \right)\left( {1 - J\left( {S_{1} ,S_{2} } \right)} \right)}}{{h^{2} }} \hfill \\ \quad \quad \quad \quad = \frac{{J\left( {S_{1} ,S_{2} } \right)\left( {1 - J\left( {S_{1} ,S_{2} } \right)} \right)}}{h} \hfill \\ \end{gathered} $$

To obtain the MinHash signatures of reads in a given dataset, we need $O(NL)$ operations to form k-mer set of sequences, and $O(hNL)$ operations to obtain MinHash signatures of reads, where $N$ is the number of reads, $L$ is the average length of the sequences, and $h$ is the length of MinHash signatures.

Choosing the Hyper-parameters of the Locality Sensitive Hashing Algorithm

Hashing the reads using MinHash reduces the computational complexity of comparison significantly, since comparing the MinHash signature of two given reads needs $O\left(h\right)$, while computing the edit distance of two reads requires $O\left({L}^{2}\right)$ operations ($L$ can be varied from $400$ to $10k$, while $h$ can be selected as a constant number typically less than $100$). Still, comparing the MinHash signatures of all $O\left({N}^{2}\right)$ pairs of sequences is inefficient for large-scale datasets consisting of millions of sequences. LSH algorithm copes with this issue by dividing the $h$ entries of the MinHash signature into $b$ bands of size $d$ $(h\boldsymbol{ }=\boldsymbol{ }bd)$. If two sequences are equal in all rows of at least one of the bands, they are considered as a candidate similar pair. For the default hyper-parameters of I-CONVEX $(d=1, b=10, k=15)$ the probability of considering two sequences as similar for different values of their Jaccard similarity is depicted in Supplementary Figure 2. We have selected the default values of hyper-parameters such that the probability of having a false negative is small. However, this leads to a higher rate of false positives as well. As we mentioned in Sect. 3, we validate the obtained candidate pairs by a designed convolutional neural network to remove the false positive edges in our similarity graph.

1.4 Supplementary Note 4: Architecture of Convolutional Neural Network

To validate the similarity of candidate pair obtained from Locality Sensitive Hashing Algorithm we trained a convolutional neural network with the training data described in Results section. In this supplementary, we explain the architecture of the convolutional neural network.

Architecture.

To train the convolutional neural network that validates the similarity of candidate pairs, for each pair in the generated training dataset (which can be similar or dissimilar), both reads are one-hot encoded first (Each letter corresponds to a $4\times 1$ array). Since the length of each read in the generated dataset is 400, we obtain two arrays of size $4\times 400$ after one-hot encoding of the reads. Then, we concatenate these two arrays row-wise to obtain an $8\times 400$ array. These $8\times 400$ arrays are the input data to the convolutional neural network with the layers depicted in Table 1.

Table 1. Structure of the convolutional neural network designed to validate the similarity of candidate pairs.

Full size table

The final value after passing each $8\times 400$ input from the convolutional neural network is a scalar representing the predicted value. This scalar ranges between $-1$ and $1$, representing the similarity of a given pair. To compute the loss, let ${y}_{i}$ and ${\widehat{y}}_{i}$ be the predicted value, and the true label corresponding to the i-th candidate pair. We use ${\ell{l}}_{2}$ loss defined as follows:

$$\ell{l}\left(y,\widehat{y}\right)=\sum\limits_{i=1}^{n}{\left({\widehat{y}}_{i}-{y}_{i}\right)}^{2}$$

To optimize the weight parameters in the network, an Adam optimizer is applied to the ${\ell{l}}_{2}$ loss.

Validation of Candidate Similar Pairs:

To validate a given candidate similar pair by the trained convolutional neural network, first each read is trimmed to a sequence with length $400$ (if the length of a read is less than$400$, we add enough base ‘A’ to the end of the sequence to reach the length of$400$). Each one of the two trimmed reads is one-hot encoded to get an array of size $4\times 400$. By concatenating them, an array with the size $8\times 400$ is obtained. Then, it will be passed through the convolutional neural network, and it gives a number within the range of $[-1, +1]$. If the number is greater than$0$, we consider them as a similar pair.

1.5 Supplementary Note 5: I-CONVEX Pipeline

To run I-CONVEX on a given genome reads dataset, first move the fasta file to the Clustering folder and rename it to reads.fasta. Then execute the following commands in order to obtain pre-clusters:

The result is a Cluster folder containing subfolders, each of which represents a pre-cluster. Now, the following commands run the clustering via convexification module on pre-clusters:

The last command stores the final collected transcripts on the final_transcripts.fasta file.

To run all parts of the algorithm step by step, or change the hyper-parameters such as $k$, $d$, and $b$, follow the advanced version instructions on https://github.com/sinaBaharlouei/I-CONVEX.

Supplementary Figures: Distribution of Sequence Lengths in Synthetic and Real Datasets

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Baharlouei, S., Razaviyayn, M., Tseng, E., Tse, D. (2023). I-CONVEX: Fast and Accurate de Novo Transcriptome Recovery from Long Reads. In: Koprinska, I., et al. Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML PKDD 2022. Communications in Computer and Information Science, vol 1753. Springer, Cham. https://doi.org/10.1007/978-3-031-23633-4_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-23633-4_23
Published: 31 January 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-23632-7
Online ISBN: 978-3-031-23633-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

I-CONVEX: Fast and Accurate de Novo Transcriptome Recovery from Long Reads

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

De Novo Clustering of Long-Read Transcriptome Data Using a Greedy, Quality-Value Based Algorithm

Identifying and quantifying isoforms from accurate full-length transcriptome sequencing reads with Mandalorion

Accurate isoform discovery with IsoQuant using long reads

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Appendices

Appendix

1.1 Supplementary Note 1: Identifiability of de novo Transcriptome Recovery from Long Reads

Supplementary Theorem 1 (Long-Read Sequencing Identifiability):

Proof:

Identifiability Under the Substitution Channel:

Insertion Channel:

Deletion Channel:

1.2 Supplementary Note 2: Sparse EM Algorithm Inside the Isoform Recovery via Convexification

Sparsification.

1.3 Supplementary Note 3: Further Analysis of MinHash and Locality Sensitive Hashing Algorithms

Definition 1:

Definition 2:

Definition 3:

Example:

Supplementary Theorem 2:

Proof:

Supplementary Theorem 3:

Proof:

Choosing the Hyper-parameters of the Locality Sensitive Hashing Algorithm

1.4 Supplementary Note 4: Architecture of Convolutional Neural Network

Architecture.

Validation of Candidate Similar Pairs:

1.5 Supplementary Note 5: I-CONVEX Pipeline

Supplementary Figures: Distribution of Sequence Lengths in Synthetic and Real Datasets

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation