Adding data analytics capabilities to scaled-out object store
Introduction
High-performance computing on large-scale data has become an important use case in recent years. There are various storage system solutions for end users to perform high-performance computation on large-scale data, while also providing data protection and concurrency between different users Amazon elastic compute cloud.
Clusters and cloud storage applications that work on large-scale data typically employ separate compute and storage clusters, since the requirements of the compute and storage tiers are different from each other. However, a serious drawback of this architecture is the need to move large amounts of data from the storage nodes to the compute nodes in order to perform computation and then to move the results back to the storage cluster. Today, many storage systems store petabytes of data for various applications, such as climate modeling, astronomy, genomics analysis etc., and the amount of data stored in these systems is projected to reach exabyte scale in the near future (Gantz and Reinsel, 2012). Therefore, moving big amounts of data between storage and compute nodes is not an efficient way of performing computation on large-scale data anymore. Additionally, storing data both at the storage and compute sites increases storage overhead and with data replicated multiple times at both sites for resiliency, this overhead becomes even worse. Moving data between storage and compute nodes also increases the total energy consumption and the network load.
On the other hand, there have been many efforts that have gone into improving storage interfaces and abstractions in order to store and access data more efficiently. Object-based storage (Gibson, Nagle, Amiri, Butler, Chang, Gobioff, Hardin, Riedel, Rochberg, Zelenka, 1998, Mesnier, Ganger, Riedel, 2003) is an important effort in this respect and many scaled-out storage systems today Lustre (Lustre, Maltzahn, Molina-Estolano, Khurana, Nelson, Brandt, Weil, 2010, Swift) are based on the object-based storage abstraction. Object-based storage is an alternative to the traditional block-based storage (i.e. SCSI, ATA). Data is stored in discrete containers, called objects, each of which is identified by a distinct numerical identifier. Each object stores data and data attributes that can be controlled by the user. Data attributes can be used to store metadata describing the data (i.e. size, name, replica locations etc.) and metadata management operations to query these attributes can be offloaded from dedicated servers to object storage for improved performance (Ali et al., 2008). As a result, object-based storage increases the interaction between the storage system and the end-user and simplifies the data management of a storage system.
Using object-based storage features, the computational applications in a cluster or cloud application can benefit from the intelligence of the underlying storage system and eliminate data movement while enabling in-place analytics capabilities. Consequently, the storage layer can be scaled while the computational layer remains lightweight. In this paper, we propose an example of this approach by implementing a computational framework, Hadoop (Shvachko et al., 2010), on Ceph object-based storage system (Weil et al., 2006). We also conduct performance evaluations using Grep (Hadoop Grep, 2009), Wordcount (Hadoop WordCount, 2011), TestDFSIO HAD and TeraSort (Hadoop TeraSort, 2011) benchmarks with various redundancy and replication policies. The evaluation results indicate that initial data copy performance of Hadoop is improved by up to 96% and MapReduce performance is improved by up to 20%. It is important to note that, Hadoop and Ceph object storage system can still be used as stand-alone systems in this approach, meaning that their normal functionalities are not impacted.
The rest of this paper is organized as follows. Section 2 briefly introduces MapReduce and object-based storage, two main components of this work. Then, Section 3 discusses related studies in a number of categories: improving the performance of Hadoop as a stand-alone system, using a cluster file system as the backend storage of Hadoop and integrating the computation layer of Hadoop, MapReduce, with object storage systems for in-place computation. While presenting studies for the last category, their disadvantages against the method presented in this paper are discussed; namely, data is still transferred to HDFS, data management policies of the underlying storage system are overridden or data-compute locality is only provided through virtualization. Section 4 shows how to enable in-place analytics capabilities on large-scale data using Hadoop and Ceph object storage without transferring data from compute nodes to storage nodes and without changing how the underlying storage is managed. Section 5 gives the performance evaluation results of the proposed method from Grep (Hadoop Grep, 2009), Wordcount (Hadoop WordCount, 2011), TestDFSIO HAD and TeraSort (Hadoop TeraSort, 2011) benchmarks. Finally, Section 6 summarizes the findings of this work and discusses possible future research directions.
Section snippets
Background
This section gives a brief overview of the main components of the approach proposed in this work - MapReduce and object-based storage.
Related work
This section introduces related studies on improving the performance of Hadoop and its integration with object storage.
There have been several research efforts that analyzed and tried to improve the performance of Hadoop without integrating it with an underlying storage system. Shvachko et al. show the metadata scalability problem in Hadoop, by pointing out that a single namenode in HDFS is sufficient for read-intensive Hadoop workloads, while it will be saturated for write-intensive workloads (
Proposed method architecture
This section presents our approach to integrate Hadoop with an object-based storage system - Ceph is used as the demonstration platform, but any object-based storage system such as PVFS (Carns et al., 2000) or Lustre could be used. As mentioned in Section 2.1, Hadoop consists of a computation layer, MapReduce, and a storage layer, HDFS, that manages the underlying storage system. This work modifies Hadoop to perform in-place computation on large-scale data without moving or transferring data
Performance evaluation
This section describes the experimental setup first followed by the explanation of the performance evaluation tests and the discussion of the test results.
Conclusions
In this paper, we presented an approach that performs computation on existing large-scale data in an object storage system without moving data anywhere and analyzed the outcomes of this approach. Experimental evaluations with Hadoop and Ceph object-based storage system show that it is possible to implement Hadoop on top of Ceph as a lightweight computational framework and to perform computational tasks in-place alleviating the need to transfer large-scale data to a remote compute cluster.
Acknowledgement
This work was supported in part by a NSF High End Computing University Research Activity grant (award number CCF-0937879). Any opinions, findings and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect those of the NSF.
Cengiz Karakoyunlu received his BSc degree in Electrical and Computer Engineering from Worcester Polytechnic Institute, in Worcester, MA, in 2010 and Ph.D. in Electrical and Computer Engineering from the University of Connecticut, in Storrs, CT, in 2016. He is currently a Software Engineer at Panasas in Pittsburgh, Pennsylvania. His research interests include distributed storage and file systems, high performance computing and big data.
References (39)
- Amazon elastic compute cloud....
- Benchmarking and stress testing an hadoop cluster with terasort, testDFSIO & co....
- Google cloud platform - compute engine....
- Hadoop cluster setup....
- Hadoop JIRA HDFS-941....
- T10 technical committee of the international committee on information technology standards, Object-Based Storage...
- Using lustre with apache hadoop, sun microsystems inc....
- et al.
Revisiting the metadata architecture of parallel file systems
Petascale Data Storage Workshop, 2008. PDSW ’08. 3rd
(2008) - et al.
Scarlett: Coping with skewed content popularity in mapreduce clusters
Proceedings of the Sixth Conference on Computer Systems, EuroSys ’11, ACM, New York, NY, USA
(2011) - et al.
Cloud analytics: do we really need to reinvent the storage stack?
Proceedings of the 2009 Conference on Hot Topics in Cloud Computing, HotCloud’09, USENIX Association, Berkeley, CA, USA
(2009)
Pvfs: A parallel file system for linux clusters
Proceedings of the 4th Annual Linux Showcase & Conference - Volume 4, ALS’00, USENIX Association, Berkeley, CA, USA
Cast: tiering storage for data analytics in the cloud
Proceedings of the 24th International Symposium on High-Performance Parallel and Distributed Computing, HPDC ’15, ACM, New York, NY, USA
Mapreduce: simplified data processing on large clusters
Commun. ACM
Cohadoop: flexible data placement and its exploitation in Hadoop
Proc. VLDB Endow.
The digital universe in 2020: Big data, bigger digital shadows, biggest growth in the far east
IDC iView: IDC Anal. Future
A cost-effective, high-bandwidth storage architecture
SIGPLAN Not.
Maestro: replica-aware map scheduling for mapreduce
Cluster, Cloud and Grid Computing (CCGrid), 2012 12th IEEE/ACM International Symposium on
Nakshatra: towards running batch analytics on an archive
Modelling, Analysis Simulation of Computer and Telecommunication Systems (MASCOTS), 2014 IEEE 22nd International Symposium on
Cited by (0)
Cengiz Karakoyunlu received his BSc degree in Electrical and Computer Engineering from Worcester Polytechnic Institute, in Worcester, MA, in 2010 and Ph.D. in Electrical and Computer Engineering from the University of Connecticut, in Storrs, CT, in 2016. He is currently a Software Engineer at Panasas in Pittsburgh, Pennsylvania. His research interests include distributed storage and file systems, high performance computing and big data.
Prof. John A. Chandy is a Professor and the Associate Head of the Electrical and Computer Engineering Department at the University of Connecticut. Prof. Chandy is also Co-Director of the Connecticut Cybersecurity Center, Interim Director of the UConn Center for Hardware Assurance, Security, and Engineering, and Co-Director of the Comcast Center for Cybersecurity Innovation. Prior to joining UConn, he had executive and engineering positions in software companies working particularly in the areas of clustered storage architectures, tools for the online delivery of psychotherapy and soft-skills training, distributed architectures, and unstructured data representation. His current research areas are in high-performance storage systems, reconfigurable computing, embedded systems security, distributed systems software and architectures, and multiple-valued logic. Dr. Chandy earned Ph.D. and M.S. degrees in Electrical Engineering from the University of Illinois in 1996 and 1993, respectively, and a S.B. in Electrical Engineering from the Massachusetts Institute of Technology in 1989.
Alma Riska received her Ph.D. in Computer Science from the College of William and Mary, in Williamsburg, VA, in 2002. She was a Research Staff Member at Seagate Research in Pittsburgh, Pennsylvania and a Consultant Software Engineer at EMC in Cambridge, Massachusetts. She is currently Principal Software Engineer at NetApp in Waltham, Massachusetts. Her research interests are on performance and reliability modeling of computer systems, in general, and storage systems, in particular. The emphasis of her work is on applying analytic techniques and detailed workload characterization in designing more reliable and better performing storage systems that can adapt their operating into the dynamically changing operational environment. She is a member of IEEE and ACM.