research-article

Managing Variant Calling Files the Big Data Way: Using HDFS and Apache Parquet

Authors:

Aikaterini Boufea,

Richard Finkers,

Martijn van Kaauwen,

Ioannis N. AthanasiadisAuthors Info & Claims

BDCAT '17: Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies

Pages 219 - 226

https://doi.org/10.1145/3148055.3148060

Published: 05 December 2017 Publication History

Abstract

Big Data has been seen as a remedy for the efficient management of the ever-increasing genomic data. In this paper, we investigate the use of Apache Spark to store and process Variant Calling Files (VCF) on a Hadoop cluster. We demonstrate Tomatula, a software tool for converting VCF files to Apache Parquet storage format, and an application to query variant calling datasets. We evaluate how the wall time (i.e. time until the query answer is returned to the user) scales out on a Hadoop cluster storing VCF files, either in the original flat-file format, or using the Apache Parquet columnar storage format. Apache Parquet can compress the VCF data by around a factor of 10, and supports easier querying of VCF files as it exposes the field structure. We discuss advantages and disadvantages in terms of storage capacity and querying performance with both flat VCF files and Apache Parquet using an open plant breeding dataset. We conclude that Apache Parquet offers benefits for reducing storage size and wall time, and scales out with larger datasets.

References

[1]

Saulo Aflitos et al. 2014. Exploring genetic variation in the tomato (Solanum section Lycopersicon) clade by whole-genome sequencing. Plant Journal 80, 1 (2014), 136--148.

[2]

Abdalla Ahmed. 2016. Analysis of Metagenomics Next Generation Sequence Data for Fungal ITS Barcoding: Do You Need Advance Bioinformatics Experience? Frontiers in Microbiology 7, July (2016), 1061.

[3]

Apache Software Foundation. 2011--2017. Apache Hadoop. http://hadoop.apache. org/. (2011--2017). {Online; last accessed 23-Feb-2017}.

[4]

Apache Software Foundation. 2017. Apache Parquet. (Jan. 2017). http://parquet. apache.org/ {Online; last accessed 23-Feb-2017}.

[5]

Apache Software Foundation. 2017. Apache Spark. (Jan. 2017). http://spark. apache.org/ {Online; last accessed 23-Feb-2017}.

[6]

Adam Auton et al. 2015. A global reference for human genetic variation. Nature 526 (2015), 68--74. arXiv:15334406

[7]

Aikaterini Boufea and Ioannis N Athanasiadis. 2017. Experimental results of "Managing variant calling datasets the big data way". (May 2017).

[8]

Petr Danecek, Adam Auton, Goncalo Abecasis, Cornelis A. Albers, Eric Banks, Mark A. DePristo, Robert E. Handsaker, Gerton Lunter, Gabor T. Marth, Stephen T. Sherry, Gilean McVean, and Richard Durbin. 2011. The variant call format and VCFtools. Bioinformatics 27, 15 (2011), 2156--2158.

Digital Library

[9]

Dries Decap, Joke Reumers, Charlotte Herzeel, Pascal Costanza, and Jan Fostier. 2015. Halvade: Scalable sequence analysis with MapReduce. Bioinformatics 31, 15 (2015), 2482--2488.

[10]

Umberto Ferraro Petrillo, Gianluca Roscigno, Giuseppe Cattaneo, and Raffaele Giancarlo. 2017. FASTdoop: A Versatile and Efficient Library for the Input of FASTA and FASTQ Files for MapReduce Hadoop Bioinformatics Applications. Bioinformatics 33, 10 (2017), 1575--1577.

[11]

Steven N Hart, Patrick Duffy, Daniel J Quest, Asif Hossain, Mike A Meiners, and Jean Pierre Kocher. 2016. VCF-Miner: GUI-based application for mining variants and annotations stored in VCF files. Briefings in Bioinformatics 17, 2 (2016), 346-- 351.

[12]

Ben Langmead, Kasper D Hansen, and Jeffrey T Leek. 2010. Cloud-scale RNAsequencing differential expression analysis with Myrna. Genome biology 11, 8 (2010), R83.

[13]

Ben Langmead, Michael C Schatz, Jimmy Lin, Mihai Pop, and Steven L Salzberg. 2009. Searching for SNPs with cloud computing. Genome biology 10, 11 (2009), R134.

[14]

Jeremy Leipzig. 2016. A review of bioinformatic pipeline frameworks. Briefings in Bioinformatics 18, 3 (2016), 530--536.

[15]

Timo Lubitz, Jens Hahn, Frank T. Bergmann, Elad Noor, Edda Klipp, and Wolfram Liebermeister. 2016. SBtab: A flexible table format for data exchange in Systems Biology. Bioinformatics 32, April (2016), btw179--.

[16]

Marco Masseroli, Pietro Pinoli, Francesco Venco, Abdulrahman Kaitoua, Vahid Jalili, Fernando Palluzzi, Heiko Muller, and Stefano Ceri. 2015. GenoMetric Query Language: A novel approach to large-scale genomic data management. Bioinformatics 31, 12 (2015), 1881--1888.

[17]

Matt Massie, Frank Nothaft, Christopher Hartl, Christos Kozanitis, André Schumacher, Anthony D Joseph, David A Patterson, Frank Austin Nothaft, and David Patterson. 2013. ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing. Tech. Rep. UCB/EECS-2013--207. EECS Department, University of California, Berkeley, CA, USA. http://www.eecs.berkeley.edu/Pubs/TechRpts/ 2013/EECS-2013--207.html

[18]

Henrik Nordberg, Karan Bhatia, Kai Wang, and Zhong Wang. 2013. BioPig: A Hadoop-based analytic toolkit for large-scale sequence data. Bioinformatics 29, 23 (2013), 3014--3019.

[19]

Aidan R. O'Brien, Neil F. W. Saunders, Yi Guo, Fabian A. Buske, Rodney J. Scott, and Denis C. Bauer. 2015. VariantSpark: population scale clustering of genotype information. BMC Genomics 16, 1 (2015), 1052.

[20]

Michael C. Schatz. 2009. CloudBurst: Highly sensitive read mapping with MapReduce. Bioinformatics 25, 11 (2009), 1363--1369.

Digital Library

[21]

André Schumacher, Luca Pireddu, Matti Niemenmaa, Aleksi Kallio, Eija Korpelainen, Gianluigi Zanetti, and Keijo Heljanko. 2014. SeqPig: Simple and scalable scripting for large sequencing data sets in Hadoop. Bioinformatics 30, 1 (2014), 119--120. arXiv:arXiv:1307.2331

[22]

SURF - Collaborative organization for ICT in Dutch education and research. 2016. SURFsara. https://www.surf.nl/en/about-surf/subsidiaries/surfsara/. (2016). {Online; last accessed 23-Feb-2017}.

[23]

Marek S. Wiewiorka, Antonio Messina, Alicja Pacholewska, Sergio Maffioletti, Piotr Gawrysiak, and Michal J. Okoniewski. 2014. SparkSeq: Fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision. Bioinformatics 30, 18 (2014), 2652--2653.

[24]

Matei Zaharia, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, Ion Stoica, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, and Shivaram Venkataraman. 2016. Apache Spark: a unified engine for big data processing. Commun. ACM 59, 11 (Oct. 2016), 56--65.

Digital Library

Cited By

Erdei RDelinschi DMatei O(2022)Security Centric Scalable Architecture for Distributed Learning and Knowledge Preservation17th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2022)10.1007/978-3-031-18050-7_64(655-665)Online publication date: 12-Oct-2022
https://doi.org/10.1007/978-3-031-18050-7_64
Fan JDong SWang B(2020)Variant-Kudu: An Efficient Tool kit Leveraging Distributed Bitmap Index for Analysis of Massive Genetic Variation DatasetsJournal of Computational Biology10.1089/cmb.2019.0344Online publication date: 6-Jan-2020
https://doi.org/10.1089/cmb.2019.0344
Cao LTao WAn SJin JYan YLiu XGe WSah ABattle LSun JChang RWestover BMadden SStonebraker M(2019)SmileProceedings of the VLDB Endowment10.14778/3352063.335213812:12(2230-2241)Online publication date: 1-Aug-2019
https://dl.acm.org/doi/10.14778/3352063.3352138
Show More Cited By

Index Terms

Managing Variant Calling Files the Big Data Way: Using HDFS and Apache Parquet
1. Applied computing
  1. Computers in other domains
    1. Agriculture
  2. Life and medical sciences
    1. Bioinformatics
2. Information systems
  1. Data management systems

Recommendations

Big data and ICT applications: A study
ICTCS '16: Proceedings of the Second International Conference on Information and Communication Technology for Competitive Strategies

Big Data is used to manage the data due to their large size and complexity, because it can't be handled with the traditional methods and the current technology or tools used for that. Big Data mining is populated with 5 V's volume, variability, velocity,...
'Big data', Hadoop and cloud computing in genomics

Graphical abstractDisplay Omitted Ever improving next generation sequencing technologies has led to an unprecedented proliferation of sequence data.Biology is now one of the fastest growing fields of big data science.Cloud computing and big data ...
A Performance Analysis of MapReduce Task with Large Number of Files Dataset in Big Data Using Hadoop
CSNT '14: Proceedings of the 2014 Fourth International Conference on Communication Systems and Network Technologies

Big Data is a huge amount of data that cannot be managed by the traditional data management system. Hadoop is a technological answer to Big Data. Hadoop Distributed File System (HDFS) and MapReduce programming model is used for storage and retrieval of ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

BDCAT '17: Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies

December 2017

288 pages

ISBN:9781450355490

DOI:10.1145/3148055

General Chairs:
Ashiq Anjum
University of Derby, UK
,
Alan Sill
Texas Tech University, USA
,
Program Chairs:
Xinghui Zhao
Washington State University Vancouver, USA
,
Mohsen Farid
University of Derby, UK
,
Shrideep Pallickara
Colorado State University, USA
,
Jiannong Cao
The HongKong Polytechnic University, Hong Kong

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGARCH: ACM Special Interest Group on Computer Architecture
IEEE TCSC: IEEE Technical Committee on Scalable Computing

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 05 December 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

UCC '17

Sponsor:

SIGARCH
IEEE TCSC

UCC '17: 10th International Conference on Utility and Cloud Computing

December 5 - 8, 2017

Texas, Austin, USA

Acceptance Rates

BDCAT '17 Paper Acceptance Rate 27 of 93 submissions, 29%;

Overall Acceptance Rate 27 of 93 submissions, 29%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

4
Total Citations
View Citations
285
Total Downloads

Downloads (Last 12 months)31
Downloads (Last 6 weeks)1

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Erdei RDelinschi DMatei O(2022)Security Centric Scalable Architecture for Distributed Learning and Knowledge Preservation17th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2022)10.1007/978-3-031-18050-7_64(655-665)Online publication date: 12-Oct-2022
https://doi.org/10.1007/978-3-031-18050-7_64
Fan JDong SWang B(2020)Variant-Kudu: An Efficient Tool kit Leveraging Distributed Bitmap Index for Analysis of Massive Genetic Variation DatasetsJournal of Computational Biology10.1089/cmb.2019.0344Online publication date: 6-Jan-2020
https://doi.org/10.1089/cmb.2019.0344
Cao LTao WAn SJin JYan YLiu XGe WSah ABattle LSun JChang RWestover BMadden SStonebraker M(2019)SmileProceedings of the VLDB Endowment10.14778/3352063.335213812:12(2230-2241)Online publication date: 1-Aug-2019
https://dl.acm.org/doi/10.14778/3352063.3352138

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Table of Contents