research-article

SDAFT: a novel scalable data access framework for parallel BLAST

Authors:
Jiangling Yin

University of Central Florida, Orlando, Florida

University of Central Florida, Orlando, Florida
View Profile

,
Junyao Zhang

University of Central Florida, Orlando, Florida

University of Central Florida, Orlando, Florida
View Profile

,
Jun Wang

University of Central Florida, Orlando, Florida

University of Central Florida, Orlando, Florida
View Profile

,
Wu-chun Feng

Virginia Tech, Blacksburg, VA

Virginia Tech, Blacksburg, VA
View Profile

DISCS-2013: Proceedings of the 2013 International Workshop on Data-Intensive Scalable Computing SystemsNovember 2013Pages 1–6https://doi.org/10.1145/2534645.2534647

Published:18 November 2013Publication History

DISCS-2013: Proceedings of the 2013 International Workshop on Data-Intensive Scalable Computing Systems

Pages 1–6

ABSTRACT

To run search tasks in a parallel and load-balanced fashion, existing parallel BLAST schemes such as mpiBLAST introduce a data initialization preparation stage to move database fragments from the shared storage to local cluster nodes. Unfortunately, a quickly growing sequence database becomes too heavy to move in the network in today's big data era.

In this paper, we develop a Scalable Data Access Framework (SDAFT) to solve the problem. It employs a distributed file system (DFS) to provide scalable data access for parallel sequence searches. SDAFT consists of two inter-locked components: 1) a data centric load-balanced scheduler (DC-scheduler) to enforce data-process locality and 2) a translation layer to translate conventional parallel I/O operations into HDFS I/O. By experimenting our SDAFT prototype system with real-world database and queries at a wide variety of computing platforms, we found that SDAFT can reduce I/O cost by a factor of 4 to 10 and double the overall execution performance as compared with existing schemes.

References

1000genomes project. http://aws.amazon.com/1000genomes/.Google Scholar
Fuse: Filesystem in userspace. http://fuse.sourceforge.net/.Google Scholar
Running hadoop-blast in distributed hadoop. http://salsahpc.indiana.edu/tutorial/hadoopblastex3.html.Google Scholar
H. Avron and A. Gupta. Managing data-movement for effective shared-memory parallelization of out-of-core sparse solvers. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 102:1--102:11, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press. Google ScholarDigital Library
J. C. Bennett, H. Abbasi, P.-T. Bremer, R. Grout, A. Gyulassy, T. Jin, S. Klasky, H. Kolla, M. Parashar, V. Pascucci, P. Pebay, D. Thompson, H. Yu, F. Zhang, and J. Chen. Combining in-situ and in-transit processing to enable extreme-scale scientific analysis. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC '12, pages 49:1--49:9, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press. Google ScholarDigital Library
A. Darling, L. Carey, and W.-c. Feng. The design, implementation, and evaluation of mpiblast. Proceedings of ClusterWorld, 2003, 2003.Google Scholar
M. R. Garey, D. S. Johnson, and R. Sethi. The complexity of flowshop and jobshop scheduling. Mathematics of operations research, 1(2): 117--129, 1976.Google Scholar
G. Grider, H. Chen, J. Nunez, S. Poole, R. Wacha, P. Fields, R. Martinez, P. Martinez, S. Khalsa, A. Matthews, and G. Gibson. Pascal - a new parallel and scalable server io networking infrastructure for supporting global storage/file systems in large-size linux clusters. In Performance, Computing, and Communications Conference, 2006. IPCCC 2006. 25th IEEE International, pages 10 pp.--340, 2006.Google Scholar
H. Lin, X. Ma, W. Feng, and N. F. Samatova. Coordinating computation and i/o in massively parallel sequence search. IEEE Trans. Parallel Distrib. Syst., 22(4): 529--543, Apr. 2011. Google ScholarDigital Library
W. Lu, J. Jackson, and R. Barga. Azureblast: a case study of developing science applications on the cloud. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC '10, pages 413--420, New York, NY, USA, 2010. ACM. Google ScholarDigital Library
A. Matsunaga, M. Tsugawa, and J. Fortes. Cloudblast: Combining mapreduce and virtualization on distributed resources for bioinformatics applications. In eScience, 2008. eScience '08. IEEE Fourth International Conference on, pages 222--229, Dec. Google ScholarDigital Library
Z. Meng, J. Li, Y. Zhou, Q. Liu, Y. Liu, and W. Cao. bcloudblast: An efficient mapreduce program for bioinformatics applications. In Biomedical Engineering and Informatics (BMEI), 2011 4th International Conference on, volume 4, pages 2072--2076, 2011.Google ScholarCross Ref
C. Mitchell, J. Ahrens, and J. Wang. Visio: Enabling interactive visualization of ultra-scale, time series data via high-bandwidth distributed i/o systems. In Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 68--79, May. Google ScholarDigital Library
C. Wu and A. Kalyanaraman. An efficient parallel approach for identifying protein families in large-scale metagenomic data sets. In Proceedings of the 2008 ACM/IEEE conference on Supercomputing, SC '08, pages 35:1--35:10, Piscataway, NJ, USA, 2008. IEEE Press. Google ScholarDigital Library
Z. Zhang, D. S. Katz, J. M. Wozniak, A. Espinosa, and I. Foster. Design and analysis of data management in scalable parallel scripting. In High Performance Computing, Networking, Storage and Analysis (SC), 2012 International Conference for, pages 1--11, 2012. Google ScholarDigital Library

SDAFT: a novel scalable data access framework for parallel BLAST
1. Software and its engineering
  1. Software organization and properties
    1. Contextual software domains
      1. Operating systems
        Memory management
    2. Software system structures

Recommendations

SDAFT

A scalable data access framework (SDAFT) to solve the data movement issue for parallel BLAST.Employs a distributed file system (DFS) to provide scalable data access for parallel sequence searches.A data centric load-balanced scheduler (DC-scheduler) to ...
Read More
SLAM: scalable locality-aware middleware for I/O in scientific analysis and visualization
HPDC '14: Proceedings of the 23rd international symposium on High-performance parallel and distributed computing

Whereas traditional scientific applications are computationally intensive, recent applications require more data-intensive analysis and visualization. As the computational power and size of compute clusters continue to increase, the I/O read rates and ...
Read More
Managing Variant Calling Files the Big Data Way: Using HDFS and Apache Parquet
BDCAT '17: Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies

Big Data has been seen as a remedy for the efficient management of the ever-increasing genomic data. In this paper, we investigate the use of Apache Spark to store and process Variant Calling Files (VCF) on a Hadoop cluster. We demonstrate Tomatula, a ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
DISCS-2013: Proceedings of the 2013 International Workshop on Data-Intensive Scalable Computing Systems
November 2013
66 pages
ISBN:9781450325066
DOI:10.1145/2534645
General Chair:
Xian-He Sun
Illinois Institute of Technology
,
Program Chairs:
Yong Chen
Texas Tech University
,
Philip C. Roth
Oak Ridge National Laboratory
Copyright © 2013 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 18 November 2013
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
HDFS
MPI/POSIX I/O
mpiBLAST
parallel sequence search
Qualifiers
- research-article
Conference

Acceptance Rates
DISCS-2013 Paper Acceptance Rate10of19submissions,53%Overall Acceptance Rate19of34submissions,56%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 115
  Total Downloads
- Downloads (Last 12 months)1
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

SDAFT: a novel scalable data access framework for parallel BLAST

DISCS-2013: Proceedings of the 2013 International Workshop on Data-Intensive Scalable Computing Systems

ABSTRACT

References

Cited By

Recommendations

SDAFT

SLAM: scalable locality-aware middleware for I/O in scientific analysis and visualization

Managing Variant Calling Files the Big Data Way: Using HDFS and Apache Parquet

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

SDAFT: a novel scalable data access framework for parallel BLAST

DISCS-2013: Proceedings of the 2013 International Workshop on Data-Intensive Scalable Computing Systems

ABSTRACT

References

Cited By

Recommendations

SDAFT

SLAM: scalable locality-aware middleware for I/O in scientific analysis and visualization

Managing Variant Calling Files the Big Data Way: Using HDFS and Apache Parquet

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media