Abstract
We present a novel A⋆ seed heuristic that enables fast and optimal sequence-to-graph alignment, guaranteed to minimize the edit distance of the alignment assuming non-negative edit costs.
We phrase optimal alignment as a shortest path problem and solve it by instantiating the A⋆ algorithm with our seed heuristic. The seed heuristic first extracts non-overlapping substrings (seeds) from the read, finds exact seed matches in the reference, marks preceding reference positions by crumbs, and uses the crumbs to direct the A⋆ search. The key idea is to punish paths for the absence of foreseeable seed matches. We prove admissibility of the seed heuristic, thus guaranteeing alignment optimality.
Our implementation extends the free and open source aligner and demonstrates that the seed heuristic outperforms all state-of-the-art optimal aligners including GraphAligner, Vargas, PaSGAL, and the prefix heuristic previously employed by AStarix. Specifically, we achieve a consistent speedup of >60× on both short Illumina reads and long HiFi reads (up to 25kbp), on both the E. coli linear reference genome (1Mbp) and the MHC variant graph (5Mbp). Our speedup is enabled by the seed heuristic consistently skipping >99.99% of the table cells that optimal aligners based on dynamic programming compute.
AStarix aligner and evaluations: https://github.com/eth-sri/astarix Full paper: https://www.biorxiv.org/content/10.1101/2021.11.05.467453
Competing Interest Statement
The authors have declared no competing interest.
Footnotes
firstname{at}inf.ethz.ch, lastname{at}inf.ethz.ch
Accounting for the review feedback from RECOMB 2022.