Skip to main content

DRAW: A New Data-gRouping-AWare Data Placement Scheme for Data Intensive Applications with Interest Locality

  • Chapter
  • First Online:
Book cover Cloud Computing for Data-Intensive Applications

Abstract

Recent years have seen an increasing number of scientists employ data parallel computing frameworks such as MapReduce and Hadoop to run data intensive applications and conduct analysis. In these co-located compute and storage frameworks, a wise data placement scheme can significantly improve the performance. Existing data parallel frameworks, e.g. Hadoop, or Hadoop-based clouds, distribute the data using a random placement method for simplicity and load balance. However, we observe that many data intensive applications exhibit interest locality which only sweep part of a big data set. The data often accessed together results from their grouping semantics. Without taking data grouping into consideration, the random placement does not perform well and is way below the efficiency of optimal data distribution. In this paper, we develop a new Data-gRouping-Aware (DRAW) data placement scheme to address the above-mentioned problem. DRAW dynamically scrutinizes data access from system log files. It extracts optimal data groupings and re-organizes data layouts to achieve the maximum parallelism per group subjective to load balance. By experimenting two real-world MapReduce applications with different data placement schemes on a 40-node test bed, we conclude that DRAW increases the total number of local map tasks executed up to 59.8 %, reduces the completion latency of the map phase up to 41.7 %, and improves the overall performance by 36.4 %, in comparison with Hadoop’s default random placement.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 179.00
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    These numbers can be affected by the number of launched reduce tasks, the required data size, etc.

  2. 2.

    If the initial data distribution is not balanced, Hadoop users can start a balancer (an utility in Hadoop), to re-balance the data among the nodes.

  3. 3.

    In other words, LoA denotes how sub-optimal the random distribution is, on average. The more LoA is close to 1, the closer the random and optimal approaches are.

  4. 4.

    By using Hadoop system call “fsck” with parameters “-files -blocks -location” for each file.

  5. 5.

    In current version, we define the jumping threshold as 30 %, which means if 30 % or more data are being relocated, a new higher DRAW launching frequency will be generated.

  6. 6.

    Similar to jumping threshold, we define this diving threshold as 10 %, which means if 10 % or less data are being relocated, we will lower the frequency.

  7. 7.

    There is one exception for Chicken: the data is more evenly distributed in 2-replica case than 3-replica.

References

  1. http://bowtie-bio.sourceforge.net/index.shtml.

  2. http://developer.yahoo.com/hadoop/tutorial/module1.html.

  3. http://genome.ucsc.edu/.

  4. http://hadoop.apache.org/common/docs/r0.18.3/hdfs_design.html.

  5. http://lbvm.sourceforge.net/.

  6. http://michael.dipperstein.com/bwt/.

  7. http://sector.sourceforge.net/benchmark.html.

  8. https://issues.apache.org/jira/browse/hadoop-2559.

  9. http://t8web.lanl.gov/people/heitmann/arxiv/.

  10. http://www.unidata.ucar.edu/software/netcdf/docs/.

  11. Ahmed Amer, Darrell D. E. Long, and Randal C. Burns. Group-based management of distributed file caches. In Proceedings of the 22 nd International Conference on Distributed Computing Systems (ICDCS ’02), ICDCS ’02, pages 525-, Washington, DC, USA, 2002. IEEE Computer Society.

    Google Scholar 

  12. Anup Bhatkar and J. L. Rana. Estimating neutral divergence amongst mammals for comparative genomics with mammalian scope. In Proceedings of the 9th International Conference on Information Technology, pages 3–6, Washington, DC, USA, 2006. IEEE Computer Society.

    Google Scholar 

  13. Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51:107–113, January 2008.

    Article  Google Scholar 

  14. Matthew T. Dougherty, Michael J. Folk, Erez Zadok, Herbert J. Bernstein, Frances C. Bernstein, Kevin W. Eliceiri, Werner Benger, and Christoph Best. Unifying biological image formats with hdf5.Commun. ACM, 52:42–47, October 2009.

    Google Scholar 

  15. Anna Dumitriu. X and y (number 5). In ACM SIGGRAPH 2004 Art gallery, SIGGRAPH ’04, pages 28-, New York, NY, USA, 2004. ACM.

    Google Scholar 

  16. Gregory Ganger and M. Frans Kaashoek. Embedded inodes and explicit grouping: Exploiting disk bandwidth for small files. In Proceedings of the 1997 USENIX Technical Conference, pages 1–17, 1997.

    Google Scholar 

  17. Narasimhaiah Gorla and Kang Zhang. Deriving program physical structures using bond energy algorithm. In Proceedings of the Sixth Asia Pacific Software Engineering Conference, APSEC ’99, pages 359-, Washington, DC, USA, 1999. IEEE Computer Society.

    Google Scholar 

  18. Yoonsoo Hahn and Byungkook Lee. Identification of nine human-specific frameshift mutations by comparative analysis of the human and the chimpanzee genome sequences. Bioinformatics, 21:186–194, January 2005.

    Article  Google Scholar 

  19. Roger S. Holmes and Erwin Goldberg. Brief communication: Computational analyses of mammalian lactate dehydrogenases: Human, mouse, opossum and platypus ldhs.Comput. Biol. Chem., 33:379–385, October 2009.

    Google Scholar 

  20. Xie Jiong, Yin Shu, Ruan Xiaojun, Ding Zhiyang, Tian Yun, J. Majors, A. Manzanares, and Qin Xiao. Improving mapreduce performance through data placement in heterogeneous hadoop clusters. April 2010.

    Google Scholar 

  21. Steven Y. Ko, Imranul Hoque, Brian Cho, and Indranil Gupta. Making cloud intermediate data fault-tolerant. In Proceedings of the 1st ACM symposium on Cloud computing, SoCC ’10, pages 181–192, New York, NY, USA, 2010. ACM.

    Google Scholar 

  22. Geoffrey H. Kuenning and Gerald J. Popek. Automated hoarding for mobile computers. In Proceedings of the sixteenth ACM symposium on Operating systems principles, SOSP ’97, pages 264–275, New York, NY, USA, 1997. ACM.

    Google Scholar 

  23. Jian Guo Liu, Moustafa Ghanem, Vasa Curcin, Christian Haselwimmer, Yike Guo, Gareth Morgan, and Kyran Mish. Achievements and experiences from a grid-based earthquake analysis and modelling study. In Proceedings of the Second IEEE International Conference on e-Science and Grid Computing, E-SCIENCE ’06, pages 35-, Washington, DC, USA, 2006. IEEE Computer Society.

    Google Scholar 

  24. M. Tamer Ă–zsu and Patrick Valduriez.Principles of distributed database systems (2nd ed.). Prentice-Hall, Inc., Upper Saddle River, NJ, USA, 1999.

    Google Scholar 

  25. Manuel Rodriguez-Martinez, Jaime Seguel, and Melvin Greer. Open source cloud computing tools: A case study with a weather application. In Proceedings of the 2010 IEEE 3rd International Conference on Cloud Computing, CLOUD ’10, pages 443–449, Washington, DC, USA, 2010. IEEE Computer Society.

    Google Scholar 

  26. Michael C. Schatz. Cloudburst. Bioinformatics, 25:1363–1369, June 2009.

    Article  Google Scholar 

  27. Saba Sehrish, Grant Mackey, Jun Wang, and John Bent. Mrap: a novel mapreduce-based framework to support hpc analytics applications with access patterns. In Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, HPDC ’10, pages 107–118, New York, NY, USA, 2010. ACM.

    Google Scholar 

  28. Matthias Specht, Renaud Lebrun, and Christoph P. E. Zollikofer. Visualizing shape transformation between chimpanzee and human braincases.Vis. Comput., 23:743–751, August 2007.

    Google Scholar 

  29. Shivam Tripathi and Rao S. Govindaraju. Change detection in rainfall and temperature patterns over India. In Proceedings of the Third International Workshop on Knowledge Discovery from Sensor Data, SensorKDD ’09, pages 133–141, New York, NY, USA, 2009. ACM.

    Google Scholar 

  30. Dong Yuan, Yun Yang, Xiao Liu, and Jinjun Chen. A cost-effective strategy for intermediate data storage in scientific cloud workflow systems. pages 1–12, May 2010.

    Google Scholar 

  31. Dong Yuan, Yun Yang, Xiao Liu, and Jinjun Chen. A data placement strategy in scientific cloud workflows. Future Gener. Comput. Syst., 26:1200–1214, October 2010.

    Google Scholar 

  32. Baopeng Zhang, Ning Zhang, Honghui Li, Feng Liu, and Kai Miao. An efficient cloud computing-based architecture for freight system application in china railway. In Proceedings of the 1st International Conference on Cloud Computing, CloudCom ’09, pages 359–368, Berlin, Heidelberg, 2009. Springer-Verlag.

    Google Scholar 

  33. L. Q. Zhou, Z. G. Yu, P. R. Nie, F. F. Liao, V. V. Anh, and Y. J. Chen. Log-correlation distance and fourier transform with Kullback-Leibler divergence distance for construction of vertebrate phylogeny using complete mitochondrial genomes. In Proceedings of the Third International Conference on Natural Computation - Volume 02, ICNC ’07, pages 304–308, Washington, DC, USA, 2007. IEEE Computer Society

    Google Scholar 

Download references

Acknowledgements

This work is supported in part by the US National Science Foundation Grant CNS-1115665, CCF-1337244 and National Science Foundation Early Career Award 0953946.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jun Wang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer Science+Business Media New York

About this chapter

Cite this chapter

Wang, J., Shang, P., Yin, J. (2014). DRAW: A New Data-gRouping-AWare Data Placement Scheme for Data Intensive Applications with Interest Locality. In: Li, X., Qiu, J. (eds) Cloud Computing for Data-Intensive Applications. Springer, New York, NY. https://doi.org/10.1007/978-1-4939-1905-5_7

Download citation

  • DOI: https://doi.org/10.1007/978-1-4939-1905-5_7

  • Published:

  • Publisher Name: Springer, New York, NY

  • Print ISBN: 978-1-4939-1904-8

  • Online ISBN: 978-1-4939-1905-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics