DRAW: A New Data-gRouping-AWare Data Placement Scheme for Data Intensive Applications with Interest Locality

Wang, Jun; Shang, Pengju; Yin, Jiangling

doi:10.1007/978-1-4939-1905-5_7

Jun Wang³,
Pengju Shang³ &
Jiangling Yin³

1655 Accesses

Abstract

Recent years have seen an increasing number of scientists employ data parallel computing frameworks such as MapReduce and Hadoop to run data intensive applications and conduct analysis. In these co-located compute and storage frameworks, a wise data placement scheme can significantly improve the performance. Existing data parallel frameworks, e.g. Hadoop, or Hadoop-based clouds, distribute the data using a random placement method for simplicity and load balance. However, we observe that many data intensive applications exhibit interest locality which only sweep part of a big data set. The data often accessed together results from their grouping semantics. Without taking data grouping into consideration, the random placement does not perform well and is way below the efficiency of optimal data distribution. In this paper, we develop a new Data-gRouping-Aware (DRAW) data placement scheme to address the above-mentioned problem. DRAW dynamically scrutinizes data access from system log files. It extracts optimal data groupings and re-organizes data layouts to achieve the maximum parallelism per group subjective to load balance. By experimenting two real-world MapReduce applications with different data placement schemes on a 40-node test bed, we conclude that DRAW increases the total number of local map tasks executed up to 59.8 %, reduces the completion latency of the map phase up to 41.7 %, and improves the overall performance by 36.4 %, in comparison with Hadoop’s default random placement.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 179.00; Price excludes VAT (USA)

Hardcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Reducing partition skew on MapReduce: an incremental allocation approach

Article 17 June 2019

GEODIS: towards the optimization of data locality-aware job scheduling in geo-distributed data centers

Article 20 July 2017

Optimizing the Data-Process Relationship for Fast Mining of Frequent Itemsets in MapReduce

Notes

1.
These numbers can be affected by the number of launched reduce tasks, the required data size, etc.
2.
If the initial data distribution is not balanced, Hadoop users can start a balancer (an utility in Hadoop), to re-balance the data among the nodes.
3.
In other words, LoA denotes how sub-optimal the random distribution is, on average. The more LoA is close to 1, the closer the random and optimal approaches are.
4.
By using Hadoop system call “fsck” with parameters “-files -blocks -location” for each file.
5.
In current version, we define the jumping threshold as 30 %, which means if 30 % or more data are being relocated, a new higher DRAW launching frequency will be generated.
6.
Similar to jumping threshold, we define this diving threshold as 10 %, which means if 10 % or less data are being relocated, we will lower the frequency.
7.
There is one exception for Chicken: the data is more evenly distributed in 2-replica case than 3-replica.

References

http://bowtie-bio.sourceforge.net/index.shtml.
http://developer.yahoo.com/hadoop/tutorial/module1.html.
http://genome.ucsc.edu/.
http://hadoop.apache.org/common/docs/r0.18.3/hdfs_design.html.
http://lbvm.sourceforge.net/.
http://michael.dipperstein.com/bwt/.
http://sector.sourceforge.net/benchmark.html.
https://issues.apache.org/jira/browse/hadoop-2559.
http://t8web.lanl.gov/people/heitmann/arxiv/.
http://www.unidata.ucar.edu/software/netcdf/docs/.
Ahmed Amer, Darrell D. E. Long, and Randal C. Burns. Group-based management of distributed file caches. In Proceedings of the 22 nd International Conference on Distributed Computing Systems (ICDCS ’02), ICDCS ’02, pages 525-, Washington, DC, USA, 2002. IEEE Computer Society.
Google Scholar
Anup Bhatkar and J. L. Rana. Estimating neutral divergence amongst mammals for comparative genomics with mammalian scope. In Proceedings of the 9th International Conference on Information Technology, pages 3–6, Washington, DC, USA, 2006. IEEE Computer Society.
Google Scholar
Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51:107–113, January 2008.
Article Google Scholar
Matthew T. Dougherty, Michael J. Folk, Erez Zadok, Herbert J. Bernstein, Frances C. Bernstein, Kevin W. Eliceiri, Werner Benger, and Christoph Best. Unifying biological image formats with hdf5.Commun. ACM, 52:42–47, October 2009.
Google Scholar
Anna Dumitriu. X and y (number 5). In ACM SIGGRAPH 2004 Art gallery, SIGGRAPH ’04, pages 28-, New York, NY, USA, 2004. ACM.
Google Scholar
Gregory Ganger and M. Frans Kaashoek. Embedded inodes and explicit grouping: Exploiting disk bandwidth for small files. In Proceedings of the 1997 USENIX Technical Conference, pages 1–17, 1997.
Google Scholar
Narasimhaiah Gorla and Kang Zhang. Deriving program physical structures using bond energy algorithm. In Proceedings of the Sixth Asia Pacific Software Engineering Conference, APSEC ’99, pages 359-, Washington, DC, USA, 1999. IEEE Computer Society.
Google Scholar
Yoonsoo Hahn and Byungkook Lee. Identification of nine human-specific frameshift mutations by comparative analysis of the human and the chimpanzee genome sequences. Bioinformatics, 21:186–194, January 2005.
Article Google Scholar
Roger S. Holmes and Erwin Goldberg. Brief communication: Computational analyses of mammalian lactate dehydrogenases: Human, mouse, opossum and platypus ldhs.Comput. Biol. Chem., 33:379–385, October 2009.
Google Scholar
Xie Jiong, Yin Shu, Ruan Xiaojun, Ding Zhiyang, Tian Yun, J. Majors, A. Manzanares, and Qin Xiao. Improving mapreduce performance through data placement in heterogeneous hadoop clusters. April 2010.
Google Scholar
Steven Y. Ko, Imranul Hoque, Brian Cho, and Indranil Gupta. Making cloud intermediate data fault-tolerant. In Proceedings of the 1st ACM symposium on Cloud computing, SoCC ’10, pages 181–192, New York, NY, USA, 2010. ACM.
Google Scholar
Geoffrey H. Kuenning and Gerald J. Popek. Automated hoarding for mobile computers. In Proceedings of the sixteenth ACM symposium on Operating systems principles, SOSP ’97, pages 264–275, New York, NY, USA, 1997. ACM.
Google Scholar
Jian Guo Liu, Moustafa Ghanem, Vasa Curcin, Christian Haselwimmer, Yike Guo, Gareth Morgan, and Kyran Mish. Achievements and experiences from a grid-based earthquake analysis and modelling study. In Proceedings of the Second IEEE International Conference on e-Science and Grid Computing, E-SCIENCE ’06, pages 35-, Washington, DC, USA, 2006. IEEE Computer Society.
Google Scholar
M. Tamer Özsu and Patrick Valduriez.Principles of distributed database systems (2nd ed.). Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1999.
Google Scholar
Manuel Rodriguez-Martinez, Jaime Seguel, and Melvin Greer. Open source cloud computing tools: A case study with a weather application. In Proceedings of the 2010 IEEE 3rd International Conference on Cloud Computing, CLOUD ’10, pages 443–449, Washington, DC, USA, 2010. IEEE Computer Society.
Google Scholar
Michael C. Schatz. Cloudburst. Bioinformatics, 25:1363–1369, June 2009.
Article Google Scholar
Saba Sehrish, Grant Mackey, Jun Wang, and John Bent. Mrap: a novel mapreduce-based framework to support hpc analytics applications with access patterns. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC ’10, pages 107–118, New York, NY, USA, 2010. ACM.
Google Scholar
Matthias Specht, Renaud Lebrun, and Christoph P. E. Zollikofer. Visualizing shape transformation between chimpanzee and human braincases.Vis. Comput., 23:743–751, August 2007.
Google Scholar
Shivam Tripathi and Rao S. Govindaraju. Change detection in rainfall and temperature patterns over India. In Proceedings of the Third International Workshop on Knowledge Discovery from Sensor Data, SensorKDD ’09, pages 133–141, New York, NY, USA, 2009. ACM.
Google Scholar
Dong Yuan, Yun Yang, Xiao Liu, and Jinjun Chen. A cost-effective strategy for intermediate data storage in scientific cloud workflow systems. pages 1–12, May 2010.
Google Scholar
Dong Yuan, Yun Yang, Xiao Liu, and Jinjun Chen. A data placement strategy in scientific cloud workflows. Future Gener. Comput. Syst., 26:1200–1214, October 2010.
Google Scholar
Baopeng Zhang, Ning Zhang, Honghui Li, Feng Liu, and Kai Miao. An efficient cloud computing-based architecture for freight system application in china railway. In Proceedings of the 1st International Conference on Cloud Computing, CloudCom ’09, pages 359–368, Berlin, Heidelberg, 2009. Springer-Verlag.
Google Scholar
L. Q. Zhou, Z. G. Yu, P. R. Nie, F. F. Liao, V. V. Anh, and Y. J. Chen. Log-correlation distance and fourier transform with Kullback-Leibler divergence distance for construction of vertebrate phylogeny using complete mitochondrial genomes. In Proceedings of the Third International Conference on Natural Computation - Volume 02, ICNC ’07, pages 304–308, Washington, DC, USA, 2007. IEEE Computer Society
Google Scholar

Download references

Acknowledgements

This work is supported in part by the US National Science Foundation Grant CNS-1115665, CCF-1337244 and National Science Foundation Early Career Award 0953946.

Author information

Authors and Affiliations

EECS, University of Central Florida, Orlando, FL, 32826, USA
Jun Wang, Pengju Shang & Jiangling Yin

Authors

Jun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Pengju Shang
View author publications
You can also search for this author in PubMed Google Scholar
Jiangling Yin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jun Wang .

Editor information

Editors and Affiliations

University of Florida, Gainesville, Florida, USA
Xiaolin Li
Indiana University, Bloomington, Indiana, USA
Judy Qiu

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Wang, J., Shang, P., Yin, J. (2014). DRAW: A New Data-gRouping-AWare Data Placement Scheme for Data Intensive Applications with Interest Locality. In: Li, X., Qiu, J. (eds) Cloud Computing for Data-Intensive Applications. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-1905-5_7

Download citation

DOI: https://doi.org/10.1007/978-1-4939-1905-5_7
Published: 15 November 2014
Publisher Name: Springer, New York, NY
Print ISBN: 978-1-4939-1904-8
Online ISBN: 978-1-4939-1905-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

DRAW: A New Data-gRouping-AWare Data Placement Scheme for Data Intensive Applications with Interest Locality

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Reducing partition skew on MapReduce: an incremental allocation approach

GEODIS: towards the optimization of data locality-aware job scheduling in geo-distributed data centers

Optimizing the Data-Process Relationship for Fast Mining of Frequent Itemsets in MapReduce

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

DRAW: A New Data-gRouping-AWare Data Placement Scheme for Data Intensive Applications with Interest Locality

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Reducing partition skew on MapReduce: an incremental allocation approach

GEODIS: towards the optimization of data locality-aware job scheduling in geo-distributed data centers

Optimizing the Data-Process Relationship for Fast Mining of Frequent Itemsets in MapReduce

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this chapter

Cite this chapter

Download citation

Share this chapter

Publish with us

Search

Navigation