Speed Up Big Data Analytics by Unveiling the Storage Distribution of Sub-Datasets | IEEE Journals & Magazine | IEEE Xplore

Speed Up Big Data Analytics by Unveiling the Storage Distribution of Sub-Datasets


Abstract:

In this paper, we study the problem of sub-dataset analysis over distributed file systems, e.g., the Hadoop file system. Our experiments show that the sub-datasets distri...Show More

Abstract:

In this paper, we study the problem of sub-dataset analysis over distributed file systems, e.g., the Hadoop file system. Our experiments show that the sub-datasets distribution over HDFS blocks, which is hidden by HDFS, can often cause corresponding analyses to suffer from a seriously imbalanced or inefficient parallel execution. Specifically, the content clustering of sub-datasets results in some computational nodes carrying out much more workload than others; furthermore, it leads to inefficient sampling of sub-datasets, as analysis programs will often read large amounts of irrelevant data. We conduct a comprehensive analysis on how imbalanced computing patterns and inefficient sampling occur. We then propose a storage distribution aware method to optimize sub-dataset analysis over distributed storage systems referred to as DataNet. First, we propose an efficient algorithm to obtain the meta-data of sub-dataset distributions. Second, we design an elastic storage structure called ElasticMap based on the HashMap and BloomFilter techniques to store the meta-data. Third, we employ distribution-aware algorithms for sub-dataset applications to achieve balanced and efficient parallel execution. Our proposed method can benefit different sub-dataset analyses with various computational requirements. Experiments are conducted on PRObEs Marmot 128-node cluster testbed and the results show the performance benefits of DataNet.
Published in: IEEE Transactions on Big Data ( Volume: 4, Issue: 2, 01 June 2018)
Page(s): 231 - 244
Date of Publication: 29 November 2016

ISSN Information:

Funding Agency:


References

References is not available for this document.