poster

Acceleration of bulk memory operations in a heterogeneous multicore architecture

Authors:
JongHyuk Lee

University of Houston, Houston, TX, USA

University of Houston, Houston, TX, USA
View Profile

,
Ziyi Liu

University of Houston, Houston, TX, USA

University of Houston, Houston, TX, USA
View Profile

,
Xiaonan Tian

University of Houston, Houston, TX, USA

University of Houston, Houston, TX, USA
View Profile

,
Dong Hyuk Woo

Intel Labs, Santa Clara, CA, USA

Intel Labs, Santa Clara, CA, USA
View Profile

,
Weidong Shi

University of Houston, Houston, TX, USA

University of Houston, Houston, TX, USA
View Profile

,
Dainis Boumber

University of Houston, Houston, TX, USA

University of Houston, Houston, TX, USA
View Profile

,
Yonghong Yan

University of Houston, Houston, TX, USA

University of Houston, Houston, TX, USA
View Profile

,
Kyeong-An Kwon

University of Houston, Houston, TX, USA

University of Houston, Houston, TX, USA
View Profile

PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniquesSeptember 2012Pages 423–424https://doi.org/10.1145/2370816.2370877

Published:19 September 2012Publication History

PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques

Pages 423–424

ABSTRACT

In this paper, we present a novel approach of using the integrated GPU to accelerate conventional operations that are normally performed by the CPUs, the bulk memory operations, such as memcpy or memset. Offloading the bulk memory operations to the GPU has many advantages, i) the throughput driven GPU outperforms the CPU on the bulk memory operations; ii) for on-die GPU with unified cache between the GPU and the CPU, the GPU private caches can be leveraged by the CPU for storing moved data and reducing the CPU cache bottleneck; iii) with additional lightweight hardware, asynchronous offload can be supported as well; and iv) different from the prior arts using dedicated hardware copy engines (e.g., DMA), our approach leverages the exiting GPU hardware resources as much as possible. The performance results based on our solution showed that offloaded bulk memory operations outperform CPU up to 4.3 times in micro benchmarks while still using less resources. Using eight real world applications and a cycle based full system simulation environment, the results showed 30% speedup for five, more than 20% speedup for two of the eight applications.

References

Fes2: A full-system execution-driven simulator for x86. http://fes2.cs.uiuc.edu/index.html, 2007.Google Scholar
Magnusson, P., Christensson, M., Eskilson, J., Forsgren, D., Hallberg, G., Hogberg, J., Larsson, F., Moestedt, A., and Werner, B. Simics: A full system simulation platform. Computer 35, 2 (Feb 2002), 50--58. Google ScholarDigital Library
Meng, J., and Skadron, K. Avoiding cache thrashing due to private data placement in last-level cache for manycore scaling. In Proceedings of the 2009 IEEE international conference on Computer design (Piscataway, NJ, USA, 2009), ICCD'09, IEEE Press, pp. 282--288. Google ScholarDigital Library

Index Terms

Acceleration of bulk memory operations in a heterogeneous multicore architecture
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems

Recommendations

Accelerated bulk memory operations on heterogeneous multi-core systems

A traditional fixed-function graphics accelerator has evolved into a programmable general-purpose graphics processing unit over the past few years, the general-purpose computing on GPU (GPGPU). Recently, revolutionary measures have been taken along this ...
Read More
Improving performance of GPU code using novel features of the NVIDIA kepler architecture

Graphics processing unit GPU computing is a popular approach to simulating complex models and performing massive calculations. GPUs have attracted a great deal of interest because they offer both high performance and energy efficiency. Efficient General-...
Read More
Heterogeneous acceleration of volumetric JPEG 2000 using OpenCL

This paper discusses an OpenCL version of a volumetric JPEG 2000 codec that runs on GPUs, multi-core processors or a combination of both. Since the performance critical part consists of a fine-grained discrete wavelet transform and coarse-grained ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques
September 2012
512 pages
ISBN:9781450311823
DOI:10.1145/2370816
General Chairs:
Pen-Chung Yew
University of Minnesota
,
Sangyeun Cho
University of Pittsburgh
,
Program Chairs:
Luiz DeRose
Cray, Inc.
,
David J. Lilja
University of Minnesota
Copyright © 2012 Authors
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 September 2012
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
bulk memory operation
gpu
heterogeneous multicore architecture
simd
Qualifiers
- poster
Conference

Acceptance Rates
Overall Acceptance Rate121of471submissions,26%
Upcoming Conference
PACT '24

Sponsor:

sigarch

International Conference on Parallel Architectures and Compilation Techniques

October 14 - 16, 2024

Southern California , CA , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 2
  Total Citations
  View Citations
- 192
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Acceleration of bulk memory operations in a heterogeneous multicore architecture

PACT '12: Proceedings of the 21st international conference on Parallel architectures and compilation techniques

ABSTRACT

References

Cited By

Index Terms

Recommendations

Accelerated bulk memory operations on heterogeneous multi-core systems

Improving performance of GPU code using novel features of the NVIDIA kepler architecture

Heterogeneous acceleration of volumetric JPEG 2000 using OpenCL