Loading [a11y]/accessibility-menu.js
Spark framework for transcriptomic trimming algorithm reduces cost of reading multiple input files | IEEE Conference Publication | IEEE Xplore

Spark framework for transcriptomic trimming algorithm reduces cost of reading multiple input files


Abstract:

In this paper, we investigate the feasibility and performance improvement of adapting a common standalone bioinformatics trimming tool for in-memory processing on a distr...Show More

Abstract:

In this paper, we investigate the feasibility and performance improvement of adapting a common standalone bioinformatics trimming tool for in-memory processing on a distributed Spark framework. The rapid and continuous rise of genomics technologies and applications demands fast and efficient genomic data processing pipelines. ADAM has emerged as a successful framework for handling large scientific datasets, and efforts are ongoing to expand its functionality in the bioinformatics pipeline. We hypothesize that executing as much of the pipeline as possible within the ADAM framework will improve the pipeline's time and disk requirements. We compare Trimmomatic, one of the most common raw read trimming algorithms, to our own simple Scala trimmer and show that the distributed framework allows our trimmer to suffer less overhead from increasing the number of input files. We conclude that executing Trimmomatic in Spark will improve performance with multiple file inputs. Future work will investigate the performance benefit of passing the distributed dataset directly to ADAM in memory rather than writing out an intermediate file to disk.
Date of Conference: 11-14 December 2017
Date Added to IEEE Xplore: 10 May 2018
ISBN Information:
Conference Location: Cambridge, UK

Contact IEEE to Subscribe

References

References is not available for this document.