Abstract:
In this paper, we study the problem of sub-dataset analysis over distributed file systems, e.g., the Hadoop file system. Our experiments show that the sub-datasets distri...Show MoreMetadata
Abstract:
In this paper, we study the problem of sub-dataset analysis over distributed file systems, e.g., the Hadoop file system. Our experiments show that the sub-datasets distribution over HDFS blocks, which is hidden by HDFS, can often cause corresponding analyses to suffer from a seriously imbalanced or inefficient parallel execution. Specifically, the content clustering of sub-datasets results in some computational nodes carrying out much more workload than others; furthermore, it leads to inefficient sampling of sub-datasets, as analysis programs will often read large amounts of irrelevant data. We conduct a comprehensive analysis on how imbalanced computing patterns and inefficient sampling occur. We then propose a storage distribution aware method to optimize sub-dataset analysis over distributed storage systems referred to as DataNet. First, we propose an efficient algorithm to obtain the meta-data of sub-dataset distributions. Second, we design an elastic storage structure called ElasticMap based on the HashMap and BloomFilter techniques to store the meta-data. Third, we employ distribution-aware algorithms for sub-dataset applications to achieve balanced and efficient parallel execution. Our proposed method can benefit different sub-dataset analyses with various computational requirements. Experiments are conducted on PRObEs Marmot 128-node cluster testbed and the results show the performance benefits of DataNet.
Published in: IEEE Transactions on Big Data ( Volume: 4, Issue: 2, 01 June 2018)