ABSTRACT
To run search tasks in a parallel and load-balanced fashion, existing parallel BLAST schemes such as mpiBLAST introduce a data initialization preparation stage to move database fragments from the shared storage to local cluster nodes. Unfortunately, a quickly growing sequence database becomes too heavy to move in the network in today's big data era.
In this paper, we develop a Scalable Data Access Framework (SDAFT) to solve the problem. It employs a distributed file system (DFS) to provide scalable data access for parallel sequence searches. SDAFT consists of two inter-locked components: 1) a data centric load-balanced scheduler (DC-scheduler) to enforce data-process locality and 2) a translation layer to translate conventional parallel I/O operations into HDFS I/O. By experimenting our SDAFT prototype system with real-world database and queries at a wide variety of computing platforms, we found that SDAFT can reduce I/O cost by a factor of 4 to 10 and double the overall execution performance as compared with existing schemes.
- 1000genomes project. http://aws.amazon.com/1000genomes/.Google Scholar
- Fuse: Filesystem in userspace. http://fuse.sourceforge.net/.Google Scholar
- Running hadoop-blast in distributed hadoop. http://salsahpc.indiana.edu/tutorial/hadoopblastex3.html.Google Scholar
- H. Avron and A. Gupta. Managing data-movement for effective shared-memory parallelization of out-of-core sparse solvers. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 102:1--102:11, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press. Google ScholarDigital Library
- J. C. Bennett, H. Abbasi, P.-T. Bremer, R. Grout, A. Gyulassy, T. Jin, S. Klasky, H. Kolla, M. Parashar, V. Pascucci, P. Pebay, D. Thompson, H. Yu, F. Zhang, and J. Chen. Combining in-situ and in-transit processing to enable extreme-scale scientific analysis. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 49:1--49:9, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press. Google ScholarDigital Library
- A. Darling, L. Carey, and W.-c. Feng. The design, implementation, and evaluation of mpiblast. Proceedings of ClusterWorld, 2003, 2003.Google Scholar
- M. R. Garey, D. S. Johnson, and R. Sethi. The complexity of flowshop and jobshop scheduling. Mathematics of operations research, 1(2): 117--129, 1976.Google Scholar
- G. Grider, H. Chen, J. Nunez, S. Poole, R. Wacha, P. Fields, R. Martinez, P. Martinez, S. Khalsa, A. Matthews, and G. Gibson. Pascal - a new parallel and scalable server io networking infrastructure for supporting global storage/file systems in large-size linux clusters. In Performance, Computing, and Communications Conference, 2006. IPCCC 2006. 25th IEEE International, pages 10 pp.--340, 2006.Google Scholar
- H. Lin, X. Ma, W. Feng, and N. F. Samatova. Coordinating computation and i/o in massively parallel sequence search. IEEE Trans. Parallel Distrib. Syst., 22(4): 529--543, Apr. 2011. Google ScholarDigital Library
- W. Lu, J. Jackson, and R. Barga. Azureblast: a case study of developing science applications on the cloud. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC '10, pages 413--420, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
- A. Matsunaga, M. Tsugawa, and J. Fortes. Cloudblast: Combining mapreduce and virtualization on distributed resources for bioinformatics applications. In eScience, 2008. eScience '08. IEEE Fourth International Conference on, pages 222--229, Dec. Google ScholarDigital Library
- Z. Meng, J. Li, Y. Zhou, Q. Liu, Y. Liu, and W. Cao. bcloudblast: An efficient mapreduce program for bioinformatics applications. In Biomedical Engineering and Informatics (BMEI), 2011 4th International Conference on, volume 4, pages 2072--2076, 2011.Google ScholarCross Ref
- C. Mitchell, J. Ahrens, and J. Wang. Visio: Enabling interactive visualization of ultra-scale, time series data via high-bandwidth distributed i/o systems. In Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 68--79, May. Google ScholarDigital Library
- C. Wu and A. Kalyanaraman. An efficient parallel approach for identifying protein families in large-scale metagenomic data sets. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, SC '08, pages 35:1--35:10, Piscataway, NJ, USA, 2008. IEEE Press. Google ScholarDigital Library
- Z. Zhang, D. S. Katz, J. M. Wozniak, A. Espinosa, and I. Foster. Design and analysis of data management in scalable parallel scripting. In High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for, pages 1--11, 2012. Google ScholarDigital Library
- SDAFT: a novel scalable data access framework for parallel BLAST
Recommendations
SLAM: scalable locality-aware middleware for I/O in scientific analysis and visualization
HPDC '14: Proceedings of the 23rd international symposium on High-performance parallel and distributed computingWhereas traditional scientific applications are computationally intensive, recent applications require more data-intensive analysis and visualization. As the computational power and size of compute clusters continue to increase, the I/O read rates and ...
Managing Variant Calling Files the Big Data Way: Using HDFS and Apache Parquet
BDCAT '17: Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and TechnologiesBig Data has been seen as a remedy for the efficient management of the ever-increasing genomic data. In this paper, we investigate the use of Apache Spark to store and process Variant Calling Files (VCF) on a Hadoop cluster. We demonstrate Tomatula, a ...
Comments