skip to main content
10.1145/2534645.2534647acmconferencesArticle/Chapter ViewAbstractPublication PagesscConference Proceedingsconference-collections
research-article

SDAFT: a novel scalable data access framework for parallel BLAST

Published:18 November 2013Publication History

ABSTRACT

To run search tasks in a parallel and load-balanced fashion, existing parallel BLAST schemes such as mpiBLAST introduce a data initialization preparation stage to move database fragments from the shared storage to local cluster nodes. Unfortunately, a quickly growing sequence database becomes too heavy to move in the network in today's big data era.

In this paper, we develop a Scalable Data Access Framework (SDAFT) to solve the problem. It employs a distributed file system (DFS) to provide scalable data access for parallel sequence searches. SDAFT consists of two inter-locked components: 1) a data centric load-balanced scheduler (DC-scheduler) to enforce data-process locality and 2) a translation layer to translate conventional parallel I/O operations into HDFS I/O. By experimenting our SDAFT prototype system with real-world database and queries at a wide variety of computing platforms, we found that SDAFT can reduce I/O cost by a factor of 4 to 10 and double the overall execution performance as compared with existing schemes.

References

  1. 1000genomes project. http://aws.amazon.com/1000genomes/.Google ScholarGoogle Scholar
  2. Fuse: Filesystem in userspace. http://fuse.sourceforge.net/.Google ScholarGoogle Scholar
  3. Running hadoop-blast in distributed hadoop. http://salsahpc.indiana.edu/tutorial/hadoopblastex3.html.Google ScholarGoogle Scholar
  4. H. Avron and A. Gupta. Managing data-movement for effective shared-memory parallelization of out-of-core sparse solvers. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 102:1--102:11, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. J. C. Bennett, H. Abbasi, P.-T. Bremer, R. Grout, A. Gyulassy, T. Jin, S. Klasky, H. Kolla, M. Parashar, V. Pascucci, P. Pebay, D. Thompson, H. Yu, F. Zhang, and J. Chen. Combining in-situ and in-transit processing to enable extreme-scale scientific analysis. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 49:1--49:9, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. A. Darling, L. Carey, and W.-c. Feng. The design, implementation, and evaluation of mpiblast. Proceedings of ClusterWorld, 2003, 2003.Google ScholarGoogle Scholar
  7. M. R. Garey, D. S. Johnson, and R. Sethi. The complexity of flowshop and jobshop scheduling. Mathematics of operations research, 1(2): 117--129, 1976.Google ScholarGoogle Scholar
  8. G. Grider, H. Chen, J. Nunez, S. Poole, R. Wacha, P. Fields, R. Martinez, P. Martinez, S. Khalsa, A. Matthews, and G. Gibson. Pascal - a new parallel and scalable server io networking infrastructure for supporting global storage/file systems in large-size linux clusters. In Performance, Computing, and Communications Conference, 2006. IPCCC 2006. 25th IEEE International, pages 10 pp.--340, 2006.Google ScholarGoogle Scholar
  9. H. Lin, X. Ma, W. Feng, and N. F. Samatova. Coordinating computation and i/o in massively parallel sequence search. IEEE Trans. Parallel Distrib. Syst., 22(4): 529--543, Apr. 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. W. Lu, J. Jackson, and R. Barga. Azureblast: a case study of developing science applications on the cloud. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC '10, pages 413--420, New York, NY, USA, 2010. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Matsunaga, M. Tsugawa, and J. Fortes. Cloudblast: Combining mapreduce and virtualization on distributed resources for bioinformatics applications. In eScience, 2008. eScience '08. IEEE Fourth International Conference on, pages 222--229, Dec. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Z. Meng, J. Li, Y. Zhou, Q. Liu, Y. Liu, and W. Cao. bcloudblast: An efficient mapreduce program for bioinformatics applications. In Biomedical Engineering and Informatics (BMEI), 2011 4th International Conference on, volume 4, pages 2072--2076, 2011.Google ScholarGoogle ScholarCross RefCross Ref
  13. C. Mitchell, J. Ahrens, and J. Wang. Visio: Enabling interactive visualization of ultra-scale, time series data via high-bandwidth distributed i/o systems. In Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 68--79, May. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. C. Wu and A. Kalyanaraman. An efficient parallel approach for identifying protein families in large-scale metagenomic data sets. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, SC '08, pages 35:1--35:10, Piscataway, NJ, USA, 2008. IEEE Press. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Z. Zhang, D. S. Katz, J. M. Wozniak, A. Espinosa, and I. Foster. Design and analysis of data management in scalable parallel scripting. In High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for, pages 1--11, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  1. SDAFT: a novel scalable data access framework for parallel BLAST

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        DISCS-2013: Proceedings of the 2013 International Workshop on Data-Intensive Scalable Computing Systems
        November 2013
        66 pages
        ISBN:9781450325066
        DOI:10.1145/2534645

        Copyright © 2013 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 18 November 2013

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        DISCS-2013 Paper Acceptance Rate10of19submissions,53%Overall Acceptance Rate19of34submissions,56%

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader