research-article

Optimizing MapReduce for GPUs with effective shared memory usage

Authors:

Gagan AgrawalAuthors Info & Claims

HPDC '12: Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing

Pages 199 - 210

https://doi.org/10.1145/2287076.2287109

Published: 18 June 2012 Publication History

Abstract

Accelerators and heterogeneous architectures in general, and GPUs in particular, have recently emerged as major players in high performance computing. For many classes of applications, MapReduce has emerged as the framework for easing parallel programming and improving programmer productivity. There have already been several efforts on implementing MapReduce on GPUs.

In this paper, we propose a new implementation of MapReduce for GPUs, which is very effective in utilizing shared memory, a small programmable cache on modern GPUs. The main idea is to use a reduction-based method to execute a MapReduce application. The reduction-based method allows us to carry out reductions in shared memory. To support a general and efficient implementation, we support the following features: a memory hierarchy for maintaining the reduction object, a multi-group scheme in shared memory to trade-off space requirements and locking overheads, a general and efficient data structure for the reduction object, and an efficient swapping mechanism.

We have evaluated our framework with seven commonly used MapReduce applications and compared it with the sequential implementations, MapCG, a recent MapReduce implementation on GPUs, and Ji et al.'s work, a recent MapReduce implementation that utilizes shared memory in a different way. The main observations from our experimental results are as follows. For four of the seven applications that can be considered as reduction-intensive applications, our framework has a speedup of between 5 and 200 over MapCG (for large datasets). Similarly, we achieved a speedup of between 2 and 60 over Ji et al.'s work.

References

[1]

David W. Aha, Dennis F. Kibler, and Marc K. Albert. Instance-based Learning Algorithms. Machine Learning, pages 6:37--66, 1991.

Digital Library

[2]

Muthu Manikandan Baskaran, Uday Bondhugula, Sriram Krishnamoorthy, J. Ramanujam, Atanas Rountev, and P. Sadayappan. Automatic Data Movement and Computation Mapping for Multi-level Parallel Architectures with Explicitly Managed Memories. In PPoPP '08, pages 1--10, New York, NY, USA, 2008. ACM.

Digital Library

[3]

Randal E. Bryant. Data-Intensive Supercomputing: The Case for DISC. Technical Report CMU-CS-07-128, School of Computer Science, Carnegie Mellon University, 2007.

[4]

Bryan Catanzaro, Narayanan Sundaram, and Kurt Keutzer. A Map Reduce Framework for Programming Graphics Processors. In Third Workshop on Software Tools for MultiCore Systems (STMCS), 2008.

[5]

Cheng-Tao Chu, Sang Kyun Kim, Yi-An Lin, YuanYuan Yu, Gary R. Bradski, Andrew Y. Ng, and Kunle Olukotun. Map-Reduce for Machine Learning on Multicore. In NIPS, pages 281--288, 2006.

Digital Library

[6]

Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, pages 137--150, 2004.

Digital Library

[7]

Marwa Elteir, Heshan Lin, and Wu-chun Feng. StreamMR: An Optimized MapReduce Framework for AMD GPUs. In ICPADS '11, Tainan, Taiwan, December 2011.

Digital Library

[8]

Reza Farivar, Abhishek Verma, Ellick Chan, and Roy Campbell. MITHRA: Multiple Data Independent Tasks on a Heterogeneous Resource Architecture. In CLUSTER, pages 1--10. IEEE, 2009.

[9]

Dan Gillick, Arlo Faria, and John Denero. MapReduce: Distributed Computing for Machine Learning. 2008.

[10]

N. Govindaraju, J. Gray, R. Kumar, and D. Manocha. Gputerasort: High Performance Graphics Co-processor Sorting for Large Database Management. In SIGMOD '06, pages 325--336.

Digital Library

[11]

Eladio Gutierrez, Sergio Romero, Maria A. Trenas, and Emilio L. Zapata. High Performance Computing for Computational Science - VECPAR 2008. pages 430--443. Springer-Verlag, 2008.

[12]

Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. Mars: A MapReduce Framework on Graphics Processors. In PACT, pages 260--269, 2008.

Digital Library

[13]

Chuntao Hong, Dehao Chen, Wenguang Chen, Weimin Zheng, and Haibo Lin. MapCG: Writing Parallel Program Portable between CPU and GPU. In PACT, pages 217--226, 2010.

Digital Library

[14]

Anil K. Jain and Richard C. Dubes. Algorithms for Clustering Data. Prentice-Hall, Inc., 1988.

Digital Library

[15]

F. Ji and X. Ma. Using Shared Memory to Accelerate Mapreduce on Graphics Processing Units. In IPDPS '11, pages 805--816, 2011.

Digital Library

[16]

Mark Harris Lars Nyland and Jan Prins. Chapter 31 Fast N-Body Simulation with CUDA, 2007.

[17]

Maryam Moazeni, Alex Bui, and Majid Sarrafzadeh. A Memory Optimization Technique for Software-managed Scratchpad Memory in GPUs. Application Specific Processors, Symposium on, 0:43--49, 2009.

[18]

M. Mustafa Rafique, Benjamin Rose, Ali Raza Butt, and Dimitrios S. Nikolopoulos. CellMR: A Framework for Supporting Mapreduce on Asymmetric Cell-Based Clusters. In IPDPS, pages 1--12, 2009.

Digital Library

[19]

Colby Ranger, Ramanan Raghuraman, Arun Penmetsa, Gary R. Bradski, and Christos Kozyrakis. Evaluating MapReduce for Multi-core and Multiprocessor Systems. In Proceedings of 13th HPCA, pages 13--24, 2007.

Digital Library

[20]

Koichi Shirahata, Hitoshi Sato, and Satoshi Matsuoka. Hybrid Map Task Scheduling for GPU-Based Heterogeneous Clusters. In CloudCom '10, pages 733--740, 2010.

Digital Library

[21]

Jeff A. Stuart and John D. Owens. Multi-GPU MapReduce on GPU Clusters. In IPDPS, 2011.

Digital Library

[22]

Richard M. Yoo, Anthony Romano, and Christos Kozyrakis. Phoenix Rebirth: Scalable MapReduce on a Large-Scale Shared-Memory System. In IISWC, pages 198--207, 2009.

Digital Library

Cited By

Lei YLiu QLiu H(2025)Developing a memory-efficient GPGPU-parallelized contact detection algorithm for 3D engineering-scale FDEM simulationsComputers and Geotechnics10.1016/j.compgeo.2024.107031179(107031)Online publication date: Mar-2025
https://doi.org/10.1016/j.compgeo.2024.107031
Lei YYang XLiu QLiu HChu ZWen JHuang Y(2024)An enhanced polar-based GPGPU-parallelized contact detection algorithm for 3D FDEM and its application to cracking analysis of shield tunnel segmental liningsTunnelling and Underground Space Technology10.1016/j.tust.2024.105782148(105782)Online publication date: Jun-2024
https://doi.org/10.1016/j.tust.2024.105782
Ghosh RGhosh H(2023)Distributed Shared MemoryDistributed Systems10.1002/9781119825968.ch13(337-369)Online publication date: 10-Feb-2023
https://doi.org/10.1002/9781119825968.ch13
Show More Cited By

Index Terms

Optimizing MapReduce for GPUs with effective shared memory usage
1. Computing methodologies
  1. Parallel computing methodologies
    1. Parallel programming languages
2. Software and its engineering
  1. Software notations and tools
    1. General programming languages
      1. Language types
        Parallel programming languages

Recommendations

Shared memory multiplexing: a novel way to improve GPGPU throughput
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques

On-chip shared memory (a.k.a. local data share) is a critical resource to many GPGPU applications. In current GPUs, the shared memory is allocated when a thread block (also called a workgroup) is dispatched to a streaming multiprocessor (SM) and is ...
A timeenergy performance analysis of MapReduce on heterogeneous systems with GPUs

Motivated by the explosion of Big Data analytics, performance improvements in low-power (wimpy) systems and the increasing energy efficiency of GPUs, this paper presents a timeenergy performance analysis of MapReduce on heterogeneous systems with GPUs. ...
Accelerating MapReduce framework on multi-GPU systems

Graphics processors evolve rapidly and promise to support power-efficient, cost, differentiated price-performance, and scalable high performance computing. MapReduce is a well-known distributed programming model to ease the development of applications ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

HPDC '12: Proceedings of the 21st international symposium on High-Performance Parallel and Distributed Computing

June 2012

308 pages

ISBN:9781450308052

DOI:10.1145/2287076

General Chair:
Dick Epema
Delft University of Technology and Eindhoven University of Technology, The Netherlands
,
Program Chairs:
Thilo Kielmann
Vrije Universiteit, The Netherlands
,
Matei Ripeanu
The University of British Columbia, Canada

Copyright © 2012 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

University of Arizona: University of Arizona
SIGARCH: ACM Special Interest Group on Computer Architecture

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 June 2012

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

HPDC'12

Sponsor:

University of Arizona
SIGARCH

HPDC'12: The 21st International Symposium on High-Performance Parallel and Distributed Computing

June 18 - 22, 2012

Delft, The Netherlands

Acceptance Rates

HPDC '12 Paper Acceptance Rate 23 of 143 submissions, 16%;

Overall Acceptance Rate 166 of 966 submissions, 17%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

43
Total Citations
View Citations
1,027
Total Downloads

Downloads (Last 12 months)22
Downloads (Last 6 weeks)4

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Lei YLiu QLiu H(2025)Developing a memory-efficient GPGPU-parallelized contact detection algorithm for 3D engineering-scale FDEM simulationsComputers and Geotechnics10.1016/j.compgeo.2024.107031179(107031)Online publication date: Mar-2025
https://doi.org/10.1016/j.compgeo.2024.107031
Lei YYang XLiu QLiu HChu ZWen JHuang Y(2024)An enhanced polar-based GPGPU-parallelized contact detection algorithm for 3D FDEM and its application to cracking analysis of shield tunnel segmental liningsTunnelling and Underground Space Technology10.1016/j.tust.2024.105782148(105782)Online publication date: Jun-2024
https://doi.org/10.1016/j.tust.2024.105782
Ghosh RGhosh H(2023)Distributed Shared MemoryDistributed Systems10.1002/9781119825968.ch13(337-369)Online publication date: 10-Feb-2023
https://doi.org/10.1002/9781119825968.ch13
Papadimitriou MFumero JStratikopoulos AKotselidis CTitzer BXu HZhang I(2021)Automatically exploiting the memory hierarchy of GPUs through just-in-time compilationProceedings of the 17th ACM SIGPLAN/SIGOPS International Conference on Virtual Execution Environments10.1145/3453933.3454014(57-70)Online publication date: 7-Apr-2021
https://dl.acm.org/doi/10.1145/3453933.3454014
Awaysheh FAlazab MGarg SNiyato DVerikoukis C(2021)Big Data Resource Management & Networks: Taxonomy, Survey, and Future DirectionsIEEE Communications Surveys & Tutorials10.1109/COMST.2021.309499323:4(2098-2130)Online publication date: Dec-2022
https://doi.org/10.1109/COMST.2021.3094993
Kim HHong SPark JHan H(2019)Static code transformations for thread‐dense memory accesses in GPU computingConcurrency and Computation: Practice and Experience10.1002/cpe.551232:5Online publication date: 18-Oct-2019
https://doi.org/10.1002/cpe.5512
Chen CLi KOuyang ALi K(2018)FlinkCL: An OpenCL-Based In-Memory Computing Architecture on Heterogeneous CPU-GPU Clusters for Big DataIEEE Transactions on Computers10.1109/TC.2018.283971967:12(1765-1779)Online publication date: 1-Dec-2018
https://doi.org/10.1109/TC.2018.2839719
Chen ZXu JTang JKwiat KKamhoua CWang C(2018)GPU-Accelerated High-Throughput Online Stream Data ProcessingIEEE Transactions on Big Data10.1109/TBDATA.2016.26161164:2(191-202)Online publication date: 1-Jun-2018
https://doi.org/10.1109/TBDATA.2016.2616116
Mei SGuan HWang Q(2018)An Overview on the Convergence of High Performance Computing and Big Data Processing2018 IEEE 24th International Conference on Parallel and Distributed Systems (ICPADS)10.1109/PADSW.2018.8644997(1046-1051)Online publication date: Dec-2018
https://doi.org/10.1109/PADSW.2018.8644997
Chen SWei CChiu YLai B(2017)A Hadoop-based Principle Component Analysis on embedded heterogeneous platform2017 International Symposium on VLSI Design, Automation and Test (VLSI-DAT)10.1109/VLSI-DAT.2017.7939667(1-4)Online publication date: Apr-2017
https://doi.org/10.1109/VLSI-DAT.2017.7939667
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten