Elsevier

Journal of Systems and Software

Volume 121, November 2016, Pages 16-27
Journal of Systems and Software

Adding data analytics capabilities to scaled-out object store

https://doi.org/10.1016/j.jss.2016.07.029Get rights and content

Highlights

  • In-situ MapReduce computation on large-scale data in object store.

  • Scale object store while computation layer remains lightweight.

  • Implementation with Hadoop and Ceph storage system.

  • Improved initial data ingest performance by up to 96.

  • Improved MapReduce performance by up to 20.

Abstract

This work focuses on enabling effective data analytics on scaled-out object storage systems. Typically, applications perform MapReduce computations by first copying large amounts of data to a separate compute cluster (i.e. a Hadoop cluster). However; this approach is not very efficient considering that storage systems can host hundreds of petabytes of data. Network bandwidth can be easily saturated and the overall energy consumption would increase during large-scale data transfer. Instead of moving data between remote clusters; we propose the implementation of a data analytics layer on an object-based storage cluster to perform in-place MapReduce computation on existing data. The analytics layer is tied to the underlying object store, utilizing its data redundancy and distribution policies across the cluster. We implemented this approach with Ceph object storage system and Hadoop, and conducted evaluations with various benchmarks. Performance evaluations show that initial data copy performance is improved by up to 96% and the MapReduce performance is improved by up to 20% compared to the stock Hadoop implementation.

Introduction

High-performance computing on large-scale data has become an important use case in recent years. There are various storage system solutions for end users to perform high-performance computation on large-scale data, while also providing data protection and concurrency between different users Amazon elastic compute cloud.

Clusters and cloud storage applications that work on large-scale data typically employ separate compute and storage clusters, since the requirements of the compute and storage tiers are different from each other. However, a serious drawback of this architecture is the need to move large amounts of data from the storage nodes to the compute nodes in order to perform computation and then to move the results back to the storage cluster. Today, many storage systems store petabytes of data for various applications, such as climate modeling, astronomy, genomics analysis etc., and the amount of data stored in these systems is projected to reach exabyte scale in the near future (Gantz and Reinsel, 2012). Therefore, moving big amounts of data between storage and compute nodes is not an efficient way of performing computation on large-scale data anymore. Additionally, storing data both at the storage and compute sites increases storage overhead and with data replicated multiple times at both sites for resiliency, this overhead becomes even worse. Moving data between storage and compute nodes also increases the total energy consumption and the network load.

On the other hand, there have been many efforts that have gone into improving storage interfaces and abstractions in order to store and access data more efficiently. Object-based storage (Gibson, Nagle, Amiri, Butler, Chang, Gobioff, Hardin, Riedel, Rochberg, Zelenka, 1998, Mesnier, Ganger, Riedel, 2003) is an important effort in this respect and many scaled-out storage systems today  Lustre (Lustre, Maltzahn, Molina-Estolano, Khurana, Nelson, Brandt, Weil, 2010, Swift) are based on the object-based storage abstraction. Object-based storage is an alternative to the traditional block-based storage (i.e. SCSI, ATA). Data is stored in discrete containers, called objects, each of which is identified by a distinct numerical identifier. Each object stores data and data attributes that can be controlled by the user. Data attributes can be used to store metadata describing the data (i.e. size, name, replica locations etc.) and metadata management operations to query these attributes can be offloaded from dedicated servers to object storage for improved performance (Ali et al., 2008). As a result, object-based storage increases the interaction between the storage system and the end-user and simplifies the data management of a storage system.

Using object-based storage features, the computational applications in a cluster or cloud application can benefit from the intelligence of the underlying storage system and eliminate data movement while enabling in-place analytics capabilities. Consequently, the storage layer can be scaled while the computational layer remains lightweight. In this paper, we propose an example of this approach by implementing a computational framework, Hadoop (Shvachko et al., 2010), on Ceph object-based storage system (Weil et al., 2006). We also conduct performance evaluations using Grep (Hadoop Grep, 2009),  Wordcount (Hadoop WordCount, 2011), TestDFSIO HAD and TeraSort (Hadoop TeraSort, 2011) benchmarks with various redundancy and replication policies. The evaluation results indicate that initial data copy performance of Hadoop is improved by up to 96% and MapReduce performance is improved by up to 20%. It is important to note that, Hadoop and Ceph object storage system can still be used as stand-alone systems in this approach, meaning that their normal functionalities are not impacted.

The rest of this paper is organized as follows. Section 2 briefly introduces MapReduce and object-based storage, two main components of this work. Then, Section 3 discusses related studies in a number of categories: improving the performance of Hadoop as a stand-alone system, using a cluster file system as the backend storage of Hadoop and integrating the computation layer of Hadoop, MapReduce, with object storage systems for in-place computation. While presenting studies for the last category, their disadvantages against the method presented in this paper are discussed; namely, data is still transferred to HDFS, data management policies of the underlying storage system are overridden or data-compute locality is only provided through virtualization. Section 4 shows how to enable in-place analytics capabilities on large-scale data using Hadoop and Ceph object storage without transferring data from compute nodes to storage nodes and without changing how the underlying storage is managed. Section 5 gives the performance evaluation results of the proposed method from Grep (Hadoop Grep, 2009), Wordcount (Hadoop WordCount, 2011),  TestDFSIO HAD and TeraSort (Hadoop TeraSort, 2011) benchmarks. Finally, Section 6 summarizes the findings of this work and discusses possible future research directions.

Section snippets

Background

This section gives a brief overview of the main components of the approach proposed in this work - MapReduce and object-based storage.

Related work

This section introduces related studies on improving the performance of Hadoop and its integration with object storage.

There have been several research efforts that analyzed and tried to improve the performance of Hadoop without integrating it with an underlying storage system. Shvachko et al. show the metadata scalability problem in Hadoop, by pointing out that a single namenode in HDFS is sufficient for read-intensive Hadoop workloads, while it will be saturated for write-intensive workloads (

Proposed method architecture

This section presents our approach to integrate Hadoop with an object-based storage system - Ceph is used as the demonstration platform, but any object-based storage system such as PVFS (Carns et al., 2000) or  Lustre could be used. As mentioned in Section 2.1, Hadoop consists of a computation layer, MapReduce, and a storage layer, HDFS, that manages the underlying storage system. This work modifies Hadoop to perform in-place computation on large-scale data without moving or transferring data

Performance evaluation

This section describes the experimental setup first followed by the explanation of the performance evaluation tests and the discussion of the test results.

Conclusions

In this paper, we presented an approach that performs computation on existing large-scale data in an object storage system without moving data anywhere and analyzed the outcomes of this approach. Experimental evaluations with Hadoop and Ceph object-based storage system show that it is possible to implement Hadoop on top of Ceph as a lightweight computational framework and to perform computational tasks in-place alleviating the need to transfer large-scale data to a remote compute cluster.

Acknowledgement

This work was supported in part by a NSF High End Computing University Research Activity grant (award number CCF-0937879). Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the NSF.

Cengiz Karakoyunlu received his BSc degree in Electrical and Computer Engineering from Worcester Polytechnic Institute, in Worcester, MA, in 2010 and Ph.D. in Electrical and Computer Engineering from the University of Connecticut, in Storrs, CT, in 2016. He is currently a Software Engineer at Panasas in Pittsburgh, Pennsylvania. His research interests include distributed storage and file systems, high performance computing and big data.

References (39)

  • Amazon elastic compute cloud....
  • Benchmarking and stress testing an hadoop cluster with terasort, testDFSIO & co....
  • Google cloud platform - compute engine....
  • Hadoop cluster setup....
  • Hadoop JIRA HDFS-941....
  • T10 technical committee of the international committee on information technology standards, Object-Based Storage...
  • Using lustre with apache hadoop, sun microsystems inc....
  • N. Ali et al.

    Revisiting the metadata architecture of parallel file systems

    Petascale Data Storage Workshop, 2008. PDSW ’08. 3rd

    (2008)
  • G. Ananthanarayanan et al.

    Scarlett: Coping with skewed content popularity in mapreduce clusters

    Proceedings of the Sixth Conference on Computer Systems, EuroSys ’11, ACM, New York, NY, USA

    (2011)
  • R. Ananthanarayanan et al.

    Cloud analytics: do we really need to reinvent the storage stack?

    Proceedings of the 2009 Conference on Hot Topics in Cloud Computing, HotCloud’09, USENIX Association, Berkeley, CA, USA

    (2009)
  • P.H. Carns et al.

    Pvfs: A parallel file system for linux clusters

    Proceedings of the 4th Annual Linux Showcase & Conference - Volume 4, ALS’00, USENIX Association, Berkeley, CA, USA

    (2000)
  • ChengY. et al.

    Cast: tiering storage for data analytics in the cloud

    Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’15, ACM, New York, NY, USA

    (2015)
  • J. Dean et al.

    Mapreduce: simplified data processing on large clusters

    Commun. ACM

    (2008)
  • M.Y. Eltabakh et al.

    Cohadoop: flexible data placement and its exploitation in Hadoop

    Proc. VLDB Endow.

    (2011)
  • J. Gantz et al.

    The digital universe in 2020: Big data, bigger digital shadows, biggest growth in the far east

    IDC iView: IDC Anal. Future

    (2012)
  • G.A. Gibson et al.

    A cost-effective, high-bandwidth storage architecture

    SIGPLAN Not.

    (1998)
  • Hadoop Grep2009....
  • S. Ibrahim et al.

    Maestro: replica-aware map scheduling for mapreduce

    Cluster, Cloud and Grid Computing (CCGrid), 2012 12th IEEE/ACM International Symposium on

    (2012)
  • A. Kathpal et al.

    Nakshatra: towards running batch analytics on an archive

    Modelling, Analysis Simulation of Computer and Telecommunication Systems (MASCOTS), 2014  IEEE 22nd International Symposium on

    (2014)
  • Cited by (0)

    Cengiz Karakoyunlu received his BSc degree in Electrical and Computer Engineering from Worcester Polytechnic Institute, in Worcester, MA, in 2010 and Ph.D. in Electrical and Computer Engineering from the University of Connecticut, in Storrs, CT, in 2016. He is currently a Software Engineer at Panasas in Pittsburgh, Pennsylvania. His research interests include distributed storage and file systems, high performance computing and big data.

    Prof. John A. Chandy is a Professor and the Associate Head of the Electrical and Computer Engineering Department at the University of Connecticut. Prof. Chandy is also Co-Director of the Connecticut Cybersecurity Center, Interim Director of the UConn Center for Hardware Assurance, Security, and Engineering, and Co-Director of the Comcast Center for Cybersecurity Innovation. Prior to joining UConn, he had executive and engineering positions in software companies working particularly in the areas of clustered storage architectures, tools for the online delivery of psychotherapy and soft-skills training, distributed architectures, and unstructured data representation. His current research areas are in high-performance storage systems, reconfigurable computing, embedded systems security, distributed systems software and architectures, and multiple-valued logic. Dr. Chandy earned Ph.D. and M.S. degrees in Electrical Engineering from the University of Illinois in 1996 and 1993, respectively, and a S.B. in Electrical Engineering from the Massachusetts Institute of Technology in 1989.

    Alma Riska received her Ph.D. in Computer Science from the College of William and Mary, in Williamsburg, VA, in 2002. She was a Research Staff Member at Seagate Research in Pittsburgh, Pennsylvania and a Consultant Software Engineer at EMC in Cambridge, Massachusetts. She is currently Principal Software Engineer at NetApp in Waltham, Massachusetts. Her research interests are on performance and reliability modeling of computer systems, in general, and storage systems, in particular. The emphasis of her work is on applying analytic techniques and detailed workload characterization in designing more reliable and better performing storage systems that can adapt their operating into the dynamically changing operational environment. She is a member of IEEE and ACM.

    View full text