skip to main content
10.1145/3148055.3148060acmconferencesArticle/Chapter ViewAbstractPublication PagesbdcatConference Proceedingsconference-collections
research-article

Managing Variant Calling Files the Big Data Way: Using HDFS and Apache Parquet

Published: 05 December 2017 Publication History

Abstract

Big Data has been seen as a remedy for the efficient management of the ever-increasing genomic data. In this paper, we investigate the use of Apache Spark to store and process Variant Calling Files (VCF) on a Hadoop cluster. We demonstrate Tomatula, a software tool for converting VCF files to Apache Parquet storage format, and an application to query variant calling datasets. We evaluate how the wall time (i.e. time until the query answer is returned to the user) scales out on a Hadoop cluster storing VCF files, either in the original flat-file format, or using the Apache Parquet columnar storage format. Apache Parquet can compress the VCF data by around a factor of 10, and supports easier querying of VCF files as it exposes the field structure. We discuss advantages and disadvantages in terms of storage capacity and querying performance with both flat VCF files and Apache Parquet using an open plant breeding dataset. We conclude that Apache Parquet offers benefits for reducing storage size and wall time, and scales out with larger datasets.

References

[1]
Saulo Aflitos et al. 2014. Exploring genetic variation in the tomato (Solanum section Lycopersicon) clade by whole-genome sequencing. Plant Journal 80, 1 (2014), 136--148.
[2]
Abdalla Ahmed. 2016. Analysis of Metagenomics Next Generation Sequence Data for Fungal ITS Barcoding: Do You Need Advance Bioinformatics Experience? Frontiers in Microbiology 7, July (2016), 1061.
[3]
Apache Software Foundation. 2011--2017. Apache Hadoop. http://hadoop.apache. org/. (2011--2017). {Online; last accessed 23-Feb-2017}.
[4]
Apache Software Foundation. 2017. Apache Parquet. (Jan. 2017). http://parquet. apache.org/ {Online; last accessed 23-Feb-2017}.
[5]
Apache Software Foundation. 2017. Apache Spark. (Jan. 2017). http://spark. apache.org/ {Online; last accessed 23-Feb-2017}.
[6]
Adam Auton et al. 2015. A global reference for human genetic variation. Nature 526 (2015), 68--74. arXiv:15334406
[7]
Aikaterini Boufea and Ioannis N Athanasiadis. 2017. Experimental results of "Managing variant calling datasets the big data way". (May 2017).
[8]
Petr Danecek, Adam Auton, Goncalo Abecasis, Cornelis A. Albers, Eric Banks, Mark A. DePristo, Robert E. Handsaker, Gerton Lunter, Gabor T. Marth, Stephen T. Sherry, Gilean McVean, and Richard Durbin. 2011. The variant call format and VCFtools. Bioinformatics 27, 15 (2011), 2156--2158.
[9]
Dries Decap, Joke Reumers, Charlotte Herzeel, Pascal Costanza, and Jan Fostier. 2015. Halvade: Scalable sequence analysis with MapReduce. Bioinformatics 31, 15 (2015), 2482--2488.
[10]
Umberto Ferraro Petrillo, Gianluca Roscigno, Giuseppe Cattaneo, and Raffaele Giancarlo. 2017. FASTdoop: A Versatile and Efficient Library for the Input of FASTA and FASTQ Files for MapReduce Hadoop Bioinformatics Applications. Bioinformatics 33, 10 (2017), 1575--1577.
[11]
Steven N Hart, Patrick Duffy, Daniel J Quest, Asif Hossain, Mike A Meiners, and Jean Pierre Kocher. 2016. VCF-Miner: GUI-based application for mining variants and annotations stored in VCF files. Briefings in Bioinformatics 17, 2 (2016), 346-- 351.
[12]
Ben Langmead, Kasper D Hansen, and Jeffrey T Leek. 2010. Cloud-scale RNAsequencing differential expression analysis with Myrna. Genome biology 11, 8 (2010), R83.
[13]
Ben Langmead, Michael C Schatz, Jimmy Lin, Mihai Pop, and Steven L Salzberg. 2009. Searching for SNPs with cloud computing. Genome biology 10, 11 (2009), R134.
[14]
Jeremy Leipzig. 2016. A review of bioinformatic pipeline frameworks. Briefings in Bioinformatics 18, 3 (2016), 530--536.
[15]
Timo Lubitz, Jens Hahn, Frank T. Bergmann, Elad Noor, Edda Klipp, and Wolfram Liebermeister. 2016. SBtab: A flexible table format for data exchange in Systems Biology. Bioinformatics 32, April (2016), btw179--.
[16]
Marco Masseroli, Pietro Pinoli, Francesco Venco, Abdulrahman Kaitoua, Vahid Jalili, Fernando Palluzzi, Heiko Muller, and Stefano Ceri. 2015. GenoMetric Query Language: A novel approach to large-scale genomic data management. Bioinformatics 31, 12 (2015), 1881--1888.
[17]
Matt Massie, Frank Nothaft, Christopher Hartl, Christos Kozanitis, André Schumacher, Anthony D Joseph, David A Patterson, Frank Austin Nothaft, and David Patterson. 2013. ADAM: Genomics Formats and Processing Patterns for Cloud Scale Computing. Tech. Rep. UCB/EECS-2013--207. EECS Department, University of California, Berkeley, CA, USA. http://www.eecs.berkeley.edu/Pubs/TechRpts/ 2013/EECS-2013--207.html
[18]
Henrik Nordberg, Karan Bhatia, Kai Wang, and Zhong Wang. 2013. BioPig: A Hadoop-based analytic toolkit for large-scale sequence data. Bioinformatics 29, 23 (2013), 3014--3019.
[19]
Aidan R. O'Brien, Neil F. W. Saunders, Yi Guo, Fabian A. Buske, Rodney J. Scott, and Denis C. Bauer. 2015. VariantSpark: population scale clustering of genotype information. BMC Genomics 16, 1 (2015), 1052.
[20]
Michael C. Schatz. 2009. CloudBurst: Highly sensitive read mapping with MapReduce. Bioinformatics 25, 11 (2009), 1363--1369.
[21]
André Schumacher, Luca Pireddu, Matti Niemenmaa, Aleksi Kallio, Eija Korpelainen, Gianluigi Zanetti, and Keijo Heljanko. 2014. SeqPig: Simple and scalable scripting for large sequencing data sets in Hadoop. Bioinformatics 30, 1 (2014), 119--120. arXiv:arXiv:1307.2331
[22]
SURF - Collaborative organization for ICT in Dutch education and research. 2016. SURFsara. https://www.surf.nl/en/about-surf/subsidiaries/surfsara/. (2016). {Online; last accessed 23-Feb-2017}.
[23]
Marek S. Wiewiorka, Antonio Messina, Alicja Pacholewska, Sergio Maffioletti, Piotr Gawrysiak, and Michal J. Okoniewski. 2014. SparkSeq: Fast, scalable and cloud-ready tool for the interactive genomic data analysis with nucleotide precision. Bioinformatics 30, 18 (2014), 2652--2653.
[24]
Matei Zaharia, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, Scott Shenker, Ion Stoica, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, and Shivaram Venkataraman. 2016. Apache Spark: a unified engine for big data processing. Commun. ACM 59, 11 (Oct. 2016), 56--65.

Cited By

View all
  • (2022)Security Centric Scalable Architecture for Distributed Learning and Knowledge Preservation17th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2022)10.1007/978-3-031-18050-7_64(655-665)Online publication date: 12-Oct-2022
  • (2020)Variant-Kudu: An Efficient Tool kit Leveraging Distributed Bitmap Index for Analysis of Massive Genetic Variation DatasetsJournal of Computational Biology10.1089/cmb.2019.0344Online publication date: 6-Jan-2020
  • (2019)SmileProceedings of the VLDB Endowment10.14778/3352063.335213812:12(2230-2241)Online publication date: 1-Aug-2019
  • Show More Cited By

Index Terms

  1. Managing Variant Calling Files the Big Data Way: Using HDFS and Apache Parquet

        Recommendations

        Comments

        Information & Contributors

        Information

        Published In

        cover image ACM Conferences
        BDCAT '17: Proceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies
        December 2017
        288 pages
        ISBN:9781450355490
        DOI:10.1145/3148055
        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

        Sponsors

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        Published: 05 December 2017

        Permissions

        Request permissions for this article.

        Check for updates

        Author Tags

        1. apache parquet
        2. apache spark
        3. big data
        4. bioinformatics
        5. hadoop
        6. hdfs
        7. tomatula
        8. variant calling

        Qualifiers

        • Research-article

        Conference

        UCC '17
        Sponsor:

        Acceptance Rates

        BDCAT '17 Paper Acceptance Rate 27 of 93 submissions, 29%;
        Overall Acceptance Rate 27 of 93 submissions, 29%

        Contributors

        Other Metrics

        Bibliometrics & Citations

        Bibliometrics

        Article Metrics

        • Downloads (Last 12 months)31
        • Downloads (Last 6 weeks)1
        Reflects downloads up to 17 Jan 2025

        Other Metrics

        Citations

        Cited By

        View all
        • (2022)Security Centric Scalable Architecture for Distributed Learning and Knowledge Preservation17th International Conference on Soft Computing Models in Industrial and Environmental Applications (SOCO 2022)10.1007/978-3-031-18050-7_64(655-665)Online publication date: 12-Oct-2022
        • (2020)Variant-Kudu: An Efficient Tool kit Leveraging Distributed Bitmap Index for Analysis of Massive Genetic Variation DatasetsJournal of Computational Biology10.1089/cmb.2019.0344Online publication date: 6-Jan-2020
        • (2019)SmileProceedings of the VLDB Endowment10.14778/3352063.335213812:12(2230-2241)Online publication date: 1-Aug-2019

        View Options

        Login options

        View options

        PDF

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader

        Media

        Figures

        Other

        Tables

        Share

        Share

        Share this Publication link

        Share on social media