Abstract:
In this paper, we investigate the feasibility and performance improvement of adapting a common standalone bioinformatics trimming tool for in-memory processing on a distr...Show MoreMetadata
Abstract:
In this paper, we investigate the feasibility and performance improvement of adapting a common standalone bioinformatics trimming tool for in-memory processing on a distributed Spark framework. The rapid and continuous rise of genomics technologies and applications demands fast and efficient genomic data processing pipelines. ADAM has emerged as a successful framework for handling large scientific datasets, and efforts are ongoing to expand its functionality in the bioinformatics pipeline. We hypothesize that executing as much of the pipeline as possible within the ADAM framework will improve the pipeline's time and disk requirements. We compare Trimmomatic, one of the most common raw read trimming algorithms, to our own simple Scala trimmer and show that the distributed framework allows our trimmer to suffer less overhead from increasing the number of input files. We conclude that executing Trimmomatic in Spark will improve performance with multiple file inputs. Future work will investigate the performance benefit of passing the distributed dataset directly to ADAM in memory rather than writing out an intermediate file to disk.
Published in: 2017 12th International Conference for Internet Technology and Secured Transactions (ICITST)
Date of Conference: 11-14 December 2017
Date Added to IEEE Xplore: 10 May 2018
ISBN Information: