Journals & Magazines >IEEE Transactions on Big Data >Volume: 4 Issue: 2

Speed Up Big Data Analytics by Unveiling the Storage Distribution of Sub-Datasets

Download PDF
Download References
Request Permissions
Save to
Alerts

Abstract:

In this paper, we study the problem of sub-dataset analysis over distributed file systems, e.g., the Hadoop file system. Our experiments show that the sub-datasets distri...Show More

Metadata

Abstract:

In this paper, we study the problem of sub-dataset analysis over distributed file systems, e.g., the Hadoop file system. Our experiments show that the sub-datasets distribution over HDFS blocks, which is hidden by HDFS, can often cause corresponding analyses to suffer from a seriously imbalanced or inefficient parallel execution. Specifically, the content clustering of sub-datasets results in some computational nodes carrying out much more workload than others; furthermore, it leads to inefficient sampling of sub-datasets, as analysis programs will often read large amounts of irrelevant data. We conduct a comprehensive analysis on how imbalanced computing patterns and inefficient sampling occur. We then propose a storage distribution aware method to optimize sub-dataset analysis over distributed storage systems referred to as DataNet. First, we propose an efficient algorithm to obtain the meta-data of sub-dataset distributions. Second, we design an elastic storage structure called ElasticMap based on the HashMap and BloomFilter techniques to store the meta-data. Third, we employ distribution-aware algorithms for sub-dataset applications to achieve balanced and efficient parallel execution. Our proposed method can benefit different sub-dataset analyses with various computational requirements. Experiments are conducted on PRObEs Marmot 128-node cluster testbed and the results show the performance benefits of DataNet.

Published in: IEEE Transactions on Big Data ( Volume: 4, Issue: 2, 01 June 2018)

Page(s): 231 - 244

Date of Publication: 29 November 2016

ISSN Information:

DOI: 10.1109/TBDATA.2016.2632744

Funding Agency:

Contents

References is not available for this document.

Speed Up Big Data Analytics by Unveiling the Storage Distribution of Sub-Datasets

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

References

IEEE Account

Purchase Details

Profile Information

Need Help?

Speed Up Big Data Analytics by Unveiling the Storage Distribution of Sub-Datasets

Alerts

Abstract:

Metadata

Abstract:

ISSN Information:

Funding Agency:

References

IEEE Account

Purchase Details

Profile Information

Need Help?