Manual Parallelization Versus State-of-the-Art Parallelization Techniques: The SPEC CPU2006 as a Case Study

doi:10.1016/B978-0-12-420232-0.00005-2

Advances in Computers

Volume 92, 2014, Pages 203-251

https://doi.org/10.1016/B978-0-12-420232-0.00005-2 Get rights and content

Abstract

Being multiprocessors (both on-chip and/or off-chip), modern computer systems can automatically exploit the benefits of parallel programs, but their resources remain underutilized in executing still-prevailing sequential applications. An obvious solution is in the parallelization of such applications. The first part overviews the broad issues in parallelization. Various parallelization approaches and contemporary software and hardware tools for extracting parallelism from sequential applications are studied. It also attempts to identify typical code patterns amenable for parallelization. The second part represents a case study where the SPEC CPU2006 suite is considered as a representative collection of typical sequential applications. Following that, it discusses the possibilities and potentials of automatic parallelization and vectorization of the sequential C++ applications from the CPU2006 suite. Since these potentials are generally limited, it explores the issues in manual parallelization of these applications. After previously identified patterns are applied by source-to-source code modifications, the effects of parallelization are evaluated by profiling and executing on two representative parallel machines. Finally, the presented results are carefully discussed.

Introduction

There is an everlasting strive for an increasing power and speed of computer systems. Parallel systems have long been considered as a promising solution for both throughput-oriented and speedup-oriented computing. Parallel processing has been used predominantly for demanding scientific applications and for large-scale systems over the years, but the architecture, technology, and application trends have been forcing it rapidly towards the commercial computing in medium-scale and even small-scale systems [1].

The importance of parallel processing has been boosted in the last decade by current trends in processor architecture and technology. Until recently, the performance of the processors grew steadily following Moore's law primarily as a consequence of an ever-increasing number of progressively faster transistors. However, inability to obtain further benefits of instruction-level parallelism (ILP), the problems with power dissipation, technology constraints, and design and verification difficulties in complex superscalars gave rise to chip multiprocessors (CMP) [2]. Since a CMP includes multiple simple superscalar cores on a chip, in addition to having the benefits of ILP, a multicore processor can issue multiple instructions per cycle from instruction streams (threads) [3]. Hence, parallel processing is brought further down to the laptop and embedded-system level. While parallelism in superscalars is extracted on the instruction level transparently to a software designer, an increasing level of parallelism in multicores (thread-level parallelism) requires the more intensive involvement of a programmer.

The first part of this chapter represents a survey on parallelization issues and approaches. It reviews some important issues in solving data dependences by loop transformations in order to improve the parallelization abilities. Then, a broad spectrum of parallelization approaches and contemporary tools is explored in order to provide the state-of-the-art in the field. A special attention is devoted to typical parallelization patterns found within sequential applications. Speedups obtained by applying these simple yet powerful patterns are significant, as we show later in our case study.

In a situation where all contemporary processors are multicores, the benefits of multithreaded, parallel workloads are easily exploited. However, the real challenge nowadays is to fully utilize the resources of multicore processors to improve the performance of a considerable amount of existing general-purpose sequential applications. Since the nature of such applications is best reflected in benchmark suites, the second part of this chapter is a case study on parallelization of the SPEC CPU2006 benchmark suite [4], [5] as one of the most representative and widely used suites for uniprocessors. It examines the potentials of autoparallelization and vectorization of the SPEC CPU2006 benchmarks in the state-of-the-art compilers and the efforts to parallelize its applications outside this suite. In order to achieve an additional level of parallelism, this study is oriented towards making manual source-to-source code modifications based on profiling information. To this end, the SPEC CPU2006 benchmarks are carefully examined for places where typical parallelization patterns can be efficiently applied. Finally, the resulting speedups are evaluated on two large parallel machines. The evaluation environment and methodology are also described in order to illustrate the details of entire process. Based on this experience, some general indications on where to find the parallelization potential are discussed.

Section snippets

Parallelization Theory

The main goal of parallelization is to decrease the application execution time as much as possible. The success in parallelization is mainly reflected in a performance indicator referred to as speedup, which shows how much a parallel program is faster than a corresponding sequential program. Amdahl's law determines the maximum speedup when only part of the program can be parallelized. This law tells that speedup of a program is limited by the time spent in the nonparallelized section.

Parallelization Techniques and Tools

Some sequential programs can be easily transformed into their parallel counterparts consisting of tasks that can run independently. These parallel programs, known as embarrassingly parallel[1], can be executed very efficiently, and they scale perfectly, since there is no communication among the tasks. Some examples are genetic algorithms and brute-force searches in cryptography. For such programs, parallel paradigms, models, and languages are presented. For sequential programs that cannot be

About Manual Parallelization

The automatic parallelization is quite attractive since it relieves the programmer of a great burden; however, this approach has its disadvantages. It is limited in detecting data dependences and thus inappropriate for complex code. As we will see, it can even induce performance degradation. In these conditions, manual parallelization approach where the programmer is directly involved in identifying and implementing parallelism is preferred. This section discusses the issues in manual

Case Study: Parallelization of SPEC CPU2006

In situation where all contemporary processors are multicores, the benefits of multithreaded, parallel workloads are easily exploited. However, the real challenge nowadays is to fully utilize resources of multicore processors to improve the performance of a bunch of existing general-purpose sequential applications. Since the nature of such applications is best reflected in uniprocessor benchmark suites, the focus of this case study is parallelization of SPEC CPU2006 [4], [5] as one of the most

Conclusion

Rapidly growing acceptance of parallel systems and multicore processors emphasizes the importance of more efficient use of their resources in case of execution of the sequential programs. Therefore, the problem of their parallelization is imminent. The chapter first explores theoretical background in the field and overviews various parallelization approaches and tools. As a case study, the chapter also examines the manual parallelization of the standard SPEC CPU2006 benchmark suite. Automatic

Acknowledgments

A. V. would like to thank Prof. Lawrence Rauchwerger, who accepted him for an internship, for many very helpful suggestions, and also to the entire Parasol Laboratory faculty and staff at Texas A&M University, who helped him a lot in configuring machines and starting jobs on them.

About the Authors

Aleksandar Vitorović received a Bachelor degree in Computer Science in 2008 and a Master degree in Computer Science in 2010 by the School of Electrical Engineering, University of Belgrade. In 2010, he started a Ph.D. program at EPFL, Lausanne, Switzerland. His main research interests are program parallelization and distributed systems.

References (91)

A.R. Hurson et al.
Parallelization of DOALL and DOACROSS loops—a survey
D. Culler et al.
Parallel Computer Architecture: A Hardware/Software Approach
(1998)
K. Olukotun et al.
Chip Multiprocessor Architecture Techniques to Improve Throughput and Latency
(2007)
K. Asanovic
The Landscape of Parallel Computing Research: A View from BerkeleyTechnical report UCB/EECS-2006-183
(2006)
SPEC CPU2006. http://www.spec.org/cpu2006/ (accessed 23 July...
J.L. Henning
SPEC CPU2006 benchmark descriptions
SIGARCH Comput. Archit. News
(2006)
R. Chandra et al.
Parallel Programming in OpenMP
(2001)
R. Allen et al.
Optimizing Compilers for Modern Architectures: A Dependence-Based Approach
(2001)
A.V. Aho et al.
Compilers: Principles, Techniques, and Tools Used
(1986)
L. Rauchwerger et al.
The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization
IEEE Trans. Parallel Distrib. Syst.
(1999)

J. Dean et al.

MapReduce: simplified data processing on large clusters

M. Wolfe, Understanding the CUDA data parallel threading model. http://www.pgroup.com/lit/articles/insider/v2n1a5.htm...

Message Passing Interface (MPI) tutorial. https://computing.llnl.gov/tutorials/mpi/ (accessed 23 July...

OpenMP tutorial. https://computing.llnl.gov/tutorials/openMP/ (accessed 23 July...

M. Frigo et al.

The implementation of the Cilk-5 multithreaded language

C.E. Leiserson

The Cilk++ concurrency platform

J.-L. Gaudiot et al.

The Sisal model of functional programming and its implementation

W.W. Carlson et al.

Introduction to UPC and Language SpecificationTechnical report CCS-TR-99-157

(1999)

J. Reinders

Intel Threading Building Blocks: Outfitting C++ for Multicore Processor Parallelism

(2007)

P. An et al.

STAPL: an adaptive, generic parallel C++ library

M. Cole

Algorithmic Skeletons: Structured Management of Parallel Computation

(1991)

H. González‐Vélez et al.

A survey of algorithmic skeleton frameworks: high‐level structured parallel programming enablers

Softw. Pract. Exp.

(2010)

M. Leyton et al.

Skandium: multi-core programming with algorithmic skeletons

A. Benoit et al.

Two fundamental concepts in skeletal parallel programming

P. Alvaro et al.

Consistency analysis in bloom: a CALM and collected approach

M. Isard et al.

Dryad: distributed data-parallel programs from sequential building blocks

Apache™ Hadoop®. http://hadoop.apache.org/ (accessed 23 July...

MathWorks MATLAB. http://www.mathworks.com/products/matlab/ (accessed 23 July...

M. Griebl, Automatic parallelization of loop programs for distributed memory architectures, habilitation thesis,...

C. Bastoul

Efficient code generation for automatic parallelization and optimization

C. Chen et al.

CHiLL: A Framework for Composing High-Level Loop TransformationsTechnical report 08-897

(2008)

S. Garcia et al.

Kremlin: rethinking and rebooting gprof for the multicore age

M.W. Hall et al.

Maximizing multiprocessor performance with the SUIF compiler

Computer

(1996)

D. Padua

Polaris: An Optimizing Compiler for Parallel Workstations and Scalable MultiprocessorsTechnical report 1475

(1996)

S. Rus et al.

The value evolution graph and its use in memory reference analysis

C. Dave et al.

Cetus: a source-to-source compiler infrastructure for multicores

Computer

(2009)

D. Quinlan, et al., ROSE user manual: a tool for building source-to-source translators. http://rosecompiler.org/...

A. Basumallik et al.

Towards automatic translation of OpenMP to MPI

DMS Software Reengineering Toolkit. http://www.semdesigns.com/products/DMS/DMSToolkit.html (accessed 23 July...

M. Gonzàlez

Nanos mercurium: a research compiler for OpenMP

M.D. Linderman et al.

Merge: a programming model for heterogeneous multi-core systems abstract

V.T. Ravi et al.

Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations

H.L.A. van der Spek et al.

A compile/run-time environment for the automatic transformation of linked list data structures

Int. J. Parallel Prog.

(2008)

M. Kulkarni et al.

Defining and Implementing Commutativity Conditions for Parallel ExecutionTechnical report

(2009)

T.H. Cormen et al.

Introduction to Algorithms

(2003)

Cited by (5)

Distributing and Parallelizing Non-canonical Loops
2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
A Novel Loop Fission Technique Inspired by Implicit Computational Complexity
2022, arXiv
Influence of loop transformations on performance and energy consumption of the multithreded WZ factorization
2022, Proceedings of the 17th Conference on Computer Science and Intelligence Systems, FedCSIS 2022
UniDRM: Unified Data and Resource Management for Federated Vehicular Cloud Computing
2021, IEEE Access
Modelling flood events with a cumulant CO lattice Boltzmann shallow water model
2021, Natural Hazards

About the Authors

Milo Tomašević was born in Nikšić, Montenegro. He received his B.Sc. in Electrical engineering and M.Sc. and Ph.D. in Computer Engineering from the University of Belgrade, Serbia, in 1980, 1984, and 1992, respectively. He is currently an Associate Professor and Head of Department of Computer Engineering, School of Electrical Engineering, University of Belgrade, Serbia. He was previously with the Pupin Institute, Belgrade, for over a decade where he was involved in many research and development projects. His current research interests are mainly in computer architecture (especially multiprocessor systems), parallel programming, cryptography, and algorithms and data structures. In these areas, he published almost 100 papers in international scientific journals, books, and proceedings of international and domestic conferences. He served as a reviewer for several journals and conferences and delivered tutorials at major conferences from the field of computer architecture and companies.

Veljko Milutinović received his Ph.D. in Electrical Engineering from University of Belgrade in 1982. During the 1980s, for about a decade, he was on the faculty of Purdue University, West Lafayette, Indiana, USA, where he coauthored the architecture and design of the world's first DARPA GaAs microprocessor. Since the 1990s, after returning to Serbia, he is on the faculty of the School of Electrical Engineering, University of Belgrade, where he is teaching courses related to computer engineering, sensor networks, and data mining. During the 1990s, he also took part in teaching at the University of Purdue, Stanford and MIT. After the year 2000, he participated in several FP6 and FP7 projects through collaboration with leading universities and industries in the EU/US, including Microsoft, Intel, IBM, Ericsson, especially Maxeler. He has lectured by invitation to over 100 European universities. He published about 50 papers in SCI journals and about 20 books with major publishers in the United States. Professor Milutinović is a Fellow of the IEEE and a Member of Academia Europaea.

View full text

Chapter Five - Manual Parallelization Versus State-of-the-Art Parallelization Techniques: The SPEC CPU2006 as a Case Study

Abstract

Introduction

Section snippets

Parallelization Theory

Parallelization Techniques and Tools

About Manual Parallelization

Case Study: Parallelization of SPEC CPU2006

Conclusion

Acknowledgments

Parallel Computer Architecture: A Hardware/Software Approach

Chip Multiprocessor Architecture Techniques to Improve Throughput and Latency

The Landscape of Parallel Computing Research: A View from BerkeleyTechnical report UCB/EECS-2006-183

SPEC CPU2006 benchmark descriptions

SIGARCH Comput. Archit. News

Parallel Programming in OpenMP

Optimizing Compilers for Modern Architectures: A Dependence-Based Approach

Compilers: Principles, Techniques, and Tools Used

The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization

IEEE Trans. Parallel Distrib. Syst.

MapReduce: simplified data processing on large clusters

The implementation of the Cilk-5 multithreaded language

The Cilk++ concurrency platform

The Sisal model of functional programming and its implementation

Introduction to UPC and Language SpecificationTechnical report CCS-TR-99-157

Intel Threading Building Blocks: Outfitting C++ for Multicore Processor Parallelism

STAPL: an adaptive, generic parallel C++ library

Algorithmic Skeletons: Structured Management of Parallel Computation

A survey of algorithmic skeleton frameworks: high‐level structured parallel programming enablers

Softw. Pract. Exp.

Skandium: multi-core programming with algorithmic skeletons

Two fundamental concepts in skeletal parallel programming

Consistency analysis in bloom: a CALM and collected approach

Dryad: distributed data-parallel programs from sequential building blocks

Efficient code generation for automatic parallelization and optimization

CHiLL: A Framework for Composing High-Level Loop TransformationsTechnical report 08-897

Kremlin: rethinking and rebooting gprof for the multicore age

Maximizing multiprocessor performance with the SUIF compiler

Computer

Polaris: An Optimizing Compiler for Parallel Workstations and Scalable MultiprocessorsTechnical report 1475

The value evolution graph and its use in memory reference analysis

Cetus: a source-to-source compiler infrastructure for multicores

Computer

Towards automatic translation of OpenMP to MPI

Nanos mercurium: a research compiler for OpenMP

Merge: a programming model for heterogeneous multi-core systems abstract

Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations

A compile/run-time environment for the automatic transformation of linked list data structures

Int. J. Parallel Prog.

Defining and Implementing Commutativity Conditions for Parallel ExecutionTechnical report

Introduction to Algorithms