skip to main content
10.1145/2484762.2484785acmotherconferencesArticle/Chapter ViewAbstractPublication PagesxsedeConference Proceedingsconference-collections
research-article

Exploiting MapReduce and data compression for data-intensive applications

Published: 22 July 2013 Publication History

Abstract

HPC platform shows good success for predominantly compute-intensive jobs, however, data intensive jobs still struggle on HPC platform as large amounts of concurrent data movement from I/O nodes to compute nodes can easily saturate the network links. MapReduce, the "moving computation to data" paradigm for many pleasingly parallel applications, assumes that data are resident on local disks and computation is scheduled where the data are located. However, on an HPC machine data must be staged from a broader file system (such as Luster), to HDFS where it can be accessed; this staging can represent a substantial delay in processing. In this paper we look at data compression's effect on reducing bandwidth needs of getting data to the application, as well as its impact on the overall performance of data-intensive applications. Our study examines two types of applications, a 3D-time series caries lesion assessment focusing on large scale medical image dataset, and a HTRC word counting task concerning large scale text analysis running on XSEDE resources. Our extensive experimental results demonstrate significant performance improvement in terms of storage space, data stage-in time, and job execution time.

References

[1]
hadoop-deployer homepage. https://github.iu.edu/gruan/hadoop-deployer/.
[2]
HathiTrust research center homepage. http://www.hathitrust.org/htrc/.
[3]
Matlab compiler. http://www.mathworks.com/products/compiler/.
[4]
myHadoop homepage. http://myhadoop.sourceforge.net/.
[5]
ParaView homepage. http://www.paraview.org/.
[6]
R.-M. Chao, H.-C. Wu, and Z.-C. Chen. Image segmentation by automatic histogram thresholding. In Proceedings of the 2nd International Conference on Interaction Sciences: Information Technology, Culture and Human, ICIS '09, pages 136--141, New York, NY, USA, 2009. ACM.
[7]
T.-W. Chen, Y.-L. Chen, and S.-Y. Chien. Fast image segmentation based on k-means clustering with histograms in hsv color space. In Multimedia Signal Processing, 2008 IEEE 10th Workshop on, pages 322--325, oct. 2008.
[8]
Y. Chen, A. Ganapathi, and R. H. Katz. To compress or not to compress - compute vs. io tradeoffs for mapreduce energy efficiency. In Proceedings of the first ACM SIGCOMM workshop on Green Networking, pages 23--28, New Delhi, India, August 2010.
[9]
A. Crume, J. Buck, C. Maltzahn, and S. Brandt. Compressing intermediate keys between mappers and reducers in scihadoop. In Proceedings of the 7th parallel data storage workshop (PDSW'12), Salt Lake, UT, November 2012.
[10]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Sixth Symposium on Operating System Design and Implementation (OSDI'04), volume 37, CA, USA, December 2004.
[11]
J. Diaz, C. Muñoz-Caro, and A. Niño. A survey of parallel programming models and tools in the multi and many-core era. IEEE Trans. Parallel Distrib. Syst., 23(8):1369--1386, August 2012.
[12]
I. Foster, Y. Zhao, I. Raicu, and S. Lu. Cloud computing and grid computing 360-degree compared. In Grid Computing Environments Workshop (GCE'08), 2008.
[13]
Z. Guo, G. Fox, and M. Zhou. Investigation of data locality in mapreduce. In Proceedings of the 2012 12th IEEE/ACM International Symposium on Cluster Computing and the Grid (CCGRID'12), pages 419--426, May 2012.
[14]
D. Metcalf, R. Kikinis, C. Guttmann, L. Vaina, and F. Jolesz. 4d connected component labelling applied to quantitative analysis of ms lesion temporal development. In Engineering in Medicine and Biology Society, 1992 14th Annual International Conference of the IEEE, volume 3, pages 945--946, 29 1992--nov. 1 1992.
[15]
O. Michailovich and A. Tannenbaum. Despeckling of medical ultrasound images. Ultrasonics, Ferroelectrics and Frequency Control, IEEE Transactions on, 53(1):64--78, jan. 2006.
[16]
B. Nicolae. High throughput data-compression for cloud storage. In Proceedings of the Third international conference on Data management in grid and peer-to-peer systems (Globe'10), pages 1--12, 2010.
[17]
R. A. Oldfield, A. Wilson, G. Davidson, and C. Ulmer. Access to external resources using service-node proxies. In Proceedings of the Cray User Group Meeting, Atlanta, GA, May 2009.
[18]
R. A. Oldfield, A. Wilson, G. Davidson, and C. Ulmer. Experiences integrating netezza and cray xt3. In Cray Hybrid Solutions Summit, June 2010.
[19]
C. Olston, B. Reed, U. Srivastava, R. Kumar, and A. Tomkins. Pig latin: a not-so-foreign language for data processing. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD'08), pages 1099--1110, Vancouver, BC, Canada, June 2008.
[20]
B. Welton, D. Kimpe, J. Cope, C. M. Patrick, K. Iskra, and R. Ross. Improving i/o forwarding throughput with data compression. In Proceedings of IEEE International Conference on Cluster Computing (CLUSTER'11), pages 438--445, Austin, TX, September 2011.
[21]
T. White. Hadoop: The Definitive Guide, 3rd Edition. O'Reilly Media/Yahoo Press, May 2012.
[22]
H. Zhang, H. Li, M. Boyles, R. Henschel, E. Kohara, and M. Ando. Exploiting hpc resources for the 3d-time series analysis of caries lesion activity. In Proceedings of the 1st Conference of the Extreme Science and Engineering Discovery Environment: Bridging from the eXtreme to the campus and beyond (XSEDE'12), NY, USA, July 2012.

Cited By

View all
  • (2023)Hybrid Meta Heuristics Optimization of Hadoop Parameters for Increased Performance Gains2023 4th IEEE Global Conference for Advancement in Technology (GCAT)10.1109/GCAT59970.2023.10353504(1-6)Online publication date: 6-Oct-2023
  • (2021)Optimizing Hadoop parameter for speedup using Q-Learning Reinforcement Learning2021 Fourth International Conference on Electrical, Computer and Communication Technologies (ICECCT)10.1109/ICECCT52121.2021.9616965(1-7)Online publication date: 15-Sep-2021
  • (2021)The 16,384-node Parallelism of 3D-CNN Training on An Arm CPU based Supercomputer2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC53243.2021.00029(152-161)Online publication date: Dec-2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
XSEDE '13: Proceedings of the Conference on Extreme Science and Engineering Discovery Environment: Gateway to Discovery
July 2013
433 pages
ISBN:9781450321709
DOI:10.1145/2484762
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 July 2013

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. 3D-time series analysis
  2. HPC
  3. HTRC
  4. MapReduce
  5. XSEDE
  6. caries lesion activity
  7. compression
  8. matlab

Qualifiers

  • Research-article

Funding Sources

Conference

XSEDE '13

Acceptance Rates

Overall Acceptance Rate 129 of 190 submissions, 68%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 22 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Hybrid Meta Heuristics Optimization of Hadoop Parameters for Increased Performance Gains2023 4th IEEE Global Conference for Advancement in Technology (GCAT)10.1109/GCAT59970.2023.10353504(1-6)Online publication date: 6-Oct-2023
  • (2021)Optimizing Hadoop parameter for speedup using Q-Learning Reinforcement Learning2021 Fourth International Conference on Electrical, Computer and Communication Technologies (ICECCT)10.1109/ICECCT52121.2021.9616965(1-7)Online publication date: 15-Sep-2021
  • (2021)The 16,384-node Parallelism of 3D-CNN Training on An Arm CPU based Supercomputer2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC)10.1109/HiPC53243.2021.00029(152-161)Online publication date: Dec-2021
  • (2017)Large-scale 3D Reconstruction with an R-based Analysis WorkflowProceedings of the Fourth IEEE/ACM International Conference on Big Data Computing, Applications and Technologies10.1145/3148055.3148062(85-93)Online publication date: 5-Dec-2017
  • (2016)The Memory Challenge in Reduce Phase of MapReduce ApplicationsIEEE Transactions on Big Data10.1109/TBDATA.2016.26077562:4(380-386)Online publication date: 1-Dec-2016
  • (2016)Accelerating mathematical knot simulations with R on the web2016 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2016.7840864(2315-2321)Online publication date: Dec-2016
  • (2015)Scalable dental computing on cyberinfrastructureProceedings of the 2015 IEEE International Conference on Big Data (Big Data)10.1109/BigData.2015.7364042(2470-2478)Online publication date: 29-Oct-2015
  • (2014)Exploiting Image Processing and Geometric Analysis in Carious Lesion AssessmentProceedings of International Conference on Internet Multimedia Computing and Service10.1145/2632856.2632901(163-166)Online publication date: 10-Jul-2014
  • (2014)Visual analysis of large dental imaging data in caries research2014 IEEE 4th Symposium on Large Data Analysis and Visualization (LDAV)10.1109/LDAV.2014.7013207(77-84)Online publication date: Nov-2014

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media