Elsevier

Advances in Computers

Volume 92, 2014, Pages 203-251
Advances in Computers

Chapter Five - Manual Parallelization Versus State-of-the-Art Parallelization Techniques: The SPEC CPU2006 as a Case Study

https://doi.org/10.1016/B978-0-12-420232-0.00005-2Get rights and content

Abstract

Being multiprocessors (both on-chip and/or off-chip), modern computer systems can automatically exploit the benefits of parallel programs, but their resources remain underutilized in executing still-prevailing sequential applications. An obvious solution is in the parallelization of such applications. The first part overviews the broad issues in parallelization. Various parallelization approaches and contemporary software and hardware tools for extracting parallelism from sequential applications are studied. It also attempts to identify typical code patterns amenable for parallelization. The second part represents a case study where the SPEC CPU2006 suite is considered as a representative collection of typical sequential applications. Following that, it discusses the possibilities and potentials of automatic parallelization and vectorization of the sequential C++ applications from the CPU2006 suite. Since these potentials are generally limited, it explores the issues in manual parallelization of these applications. After previously identified patterns are applied by source-to-source code modifications, the effects of parallelization are evaluated by profiling and executing on two representative parallel machines. Finally, the presented results are carefully discussed.

Introduction

There is an everlasting strive for an increasing power and speed of computer systems. Parallel systems have long been considered as a promising solution for both throughput-oriented and speedup-oriented computing. Parallel processing has been used predominantly for demanding scientific applications and for large-scale systems over the years, but the architecture, technology, and application trends have been forcing it rapidly towards the commercial computing in medium-scale and even small-scale systems [1].

The importance of parallel processing has been boosted in the last decade by current trends in processor architecture and technology. Until recently, the performance of the processors grew steadily following Moore's law primarily as a consequence of an ever-increasing number of progressively faster transistors. However, inability to obtain further benefits of instruction-level parallelism (ILP), the problems with power dissipation, technology constraints, and design and verification difficulties in complex superscalars gave rise to chip multiprocessors (CMP) [2]. Since a CMP includes multiple simple superscalar cores on a chip, in addition to having the benefits of ILP, a multicore processor can issue multiple instructions per cycle from instruction streams (threads) [3]. Hence, parallel processing is brought further down to the laptop and embedded-system level. While parallelism in superscalars is extracted on the instruction level transparently to a software designer, an increasing level of parallelism in multicores (thread-level parallelism) requires the more intensive involvement of a programmer.

The first part of this chapter represents a survey on parallelization issues and approaches. It reviews some important issues in solving data dependences by loop transformations in order to improve the parallelization abilities. Then, a broad spectrum of parallelization approaches and contemporary tools is explored in order to provide the state-of-the-art in the field. A special attention is devoted to typical parallelization patterns found within sequential applications. Speedups obtained by applying these simple yet powerful patterns are significant, as we show later in our case study.

In a situation where all contemporary processors are multicores, the benefits of multithreaded, parallel workloads are easily exploited. However, the real challenge nowadays is to fully utilize the resources of multicore processors to improve the performance of a considerable amount of existing general-purpose sequential applications. Since the nature of such applications is best reflected in benchmark suites, the second part of this chapter is a case study on parallelization of the SPEC CPU2006 benchmark suite [4], [5] as one of the most representative and widely used suites for uniprocessors. It examines the potentials of autoparallelization and vectorization of the SPEC CPU2006 benchmarks in the state-of-the-art compilers and the efforts to parallelize its applications outside this suite. In order to achieve an additional level of parallelism, this study is oriented towards making manual source-to-source code modifications based on profiling information. To this end, the SPEC CPU2006 benchmarks are carefully examined for places where typical parallelization patterns can be efficiently applied. Finally, the resulting speedups are evaluated on two large parallel machines. The evaluation environment and methodology are also described in order to illustrate the details of entire process. Based on this experience, some general indications on where to find the parallelization potential are discussed.

Section snippets

Parallelization Theory

The main goal of parallelization is to decrease the application execution time as much as possible. The success in parallelization is mainly reflected in a performance indicator referred to as speedup, which shows how much a parallel program is faster than a corresponding sequential program. Amdahl's law determines the maximum speedup when only part of the program can be parallelized. This law tells that speedup of a program is limited by the time spent in the nonparallelized section.

Parallelization Techniques and Tools

Some sequential programs can be easily transformed into their parallel counterparts consisting of tasks that can run independently. These parallel programs, known as embarrassingly parallel[1], can be executed very efficiently, and they scale perfectly, since there is no communication among the tasks. Some examples are genetic algorithms and brute-force searches in cryptography. For such programs, parallel paradigms, models, and languages are presented. For sequential programs that cannot be

About Manual Parallelization

The automatic parallelization is quite attractive since it relieves the programmer of a great burden; however, this approach has its disadvantages. It is limited in detecting data dependences and thus inappropriate for complex code. As we will see, it can even induce performance degradation. In these conditions, manual parallelization approach where the programmer is directly involved in identifying and implementing parallelism is preferred. This section discusses the issues in manual

Case Study: Parallelization of SPEC CPU2006

In situation where all contemporary processors are multicores, the benefits of multithreaded, parallel workloads are easily exploited. However, the real challenge nowadays is to fully utilize resources of multicore processors to improve the performance of a bunch of existing general-purpose sequential applications. Since the nature of such applications is best reflected in uniprocessor benchmark suites, the focus of this case study is parallelization of SPEC CPU2006 [4], [5] as one of the most

Conclusion

Rapidly growing acceptance of parallel systems and multicore processors emphasizes the importance of more efficient use of their resources in case of execution of the sequential programs. Therefore, the problem of their parallelization is imminent. The chapter first explores theoretical background in the field and overviews various parallelization approaches and tools. As a case study, the chapter also examines the manual parallelization of the standard SPEC CPU2006 benchmark suite. Automatic

Acknowledgments

A. V. would like to thank Prof. Lawrence Rauchwerger, who accepted him for an internship, for many very helpful suggestions, and also to the entire Parasol Laboratory faculty and staff at Texas A&M University, who helped him a lot in configuring machines and starting jobs on them.

About the Authors

Aleksandar Vitorović received a Bachelor degree in Computer Science in 2008 and a Master degree in Computer Science in 2010 by the School of Electrical Engineering, University of Belgrade. In 2010, he started a Ph.D. program at EPFL, Lausanne, Switzerland. His main research interests are program parallelization and distributed systems.

References (91)

  • A.R. Hurson et al.

    Parallelization of DOALL and DOACROSS loops—a survey

  • D. Culler et al.

    Parallel Computer Architecture: A Hardware/Software Approach

    (1998)
  • K. Olukotun et al.

    Chip Multiprocessor Architecture Techniques to Improve Throughput and Latency

    (2007)
  • K. Asanovic

    The Landscape of Parallel Computing Research: A View from BerkeleyTechnical report UCB/EECS-2006-183

    (2006)
  • SPEC CPU2006. http://www.spec.org/cpu2006/ (accessed 23 July...
  • J.L. Henning

    SPEC CPU2006 benchmark descriptions

    SIGARCH Comput. Archit. News

    (2006)
  • R. Chandra et al.

    Parallel Programming in OpenMP

    (2001)
  • R. Allen et al.

    Optimizing Compilers for Modern Architectures: A Dependence-Based Approach

    (2001)
  • A.V. Aho et al.

    Compilers: Principles, Techniques, and Tools Used

    (1986)
  • L. Rauchwerger et al.

    The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization

    IEEE Trans. Parallel Distrib. Syst.

    (1999)
  • J. Dean et al.

    MapReduce: simplified data processing on large clusters

  • M. Wolfe, Understanding the CUDA data parallel threading model. http://www.pgroup.com/lit/articles/insider/v2n1a5.htm...
  • Message Passing Interface (MPI) tutorial. https://computing.llnl.gov/tutorials/mpi/ (accessed 23 July...
  • OpenMP tutorial. https://computing.llnl.gov/tutorials/openMP/ (accessed 23 July...
  • M. Frigo et al.

    The implementation of the Cilk-5 multithreaded language

  • C.E. Leiserson

    The Cilk++ concurrency platform

  • J.-L. Gaudiot et al.

    The Sisal model of functional programming and its implementation

  • W.W. Carlson et al.

    Introduction to UPC and Language SpecificationTechnical report CCS-TR-99-157

    (1999)
  • J. Reinders

    Intel Threading Building Blocks: Outfitting C++ for Multicore Processor Parallelism

    (2007)
  • P. An et al.

    STAPL: an adaptive, generic parallel C++ library

  • M. Cole

    Algorithmic Skeletons: Structured Management of Parallel Computation

    (1991)
  • H. González‐Vélez et al.

    A survey of algorithmic skeleton frameworks: high‐level structured parallel programming enablers

    Softw. Pract. Exp.

    (2010)
  • M. Leyton et al.

    Skandium: multi-core programming with algorithmic skeletons

  • A. Benoit et al.

    Two fundamental concepts in skeletal parallel programming

  • P. Alvaro et al.

    Consistency analysis in bloom: a CALM and collected approach

  • M. Isard et al.

    Dryad: distributed data-parallel programs from sequential building blocks

  • Apache™ Hadoop®. http://hadoop.apache.org/ (accessed 23 July...
  • MathWorks MATLAB. http://www.mathworks.com/products/matlab/ (accessed 23 July...
  • M. Griebl, Automatic parallelization of loop programs for distributed memory architectures, habilitation thesis,...
  • C. Bastoul

    Efficient code generation for automatic parallelization and optimization

  • C. Chen et al.

    CHiLL: A Framework for Composing High-Level Loop TransformationsTechnical report 08-897

    (2008)
  • S. Garcia et al.

    Kremlin: rethinking and rebooting gprof for the multicore age

  • M.W. Hall et al.

    Maximizing multiprocessor performance with the SUIF compiler

    Computer

    (1996)
  • D. Padua

    Polaris: An Optimizing Compiler for Parallel Workstations and Scalable MultiprocessorsTechnical report 1475

    (1996)
  • S. Rus et al.

    The value evolution graph and its use in memory reference analysis

  • C. Dave et al.

    Cetus: a source-to-source compiler infrastructure for multicores

    Computer

    (2009)
  • D. Quinlan, et al., ROSE user manual: a tool for building source-to-source translators. http://rosecompiler.org/...
  • A. Basumallik et al.

    Towards automatic translation of OpenMP to MPI

  • DMS Software Reengineering Toolkit. http://www.semdesigns.com/products/DMS/DMSToolkit.html (accessed 23 July...
  • M. Gonzàlez

    Nanos mercurium: a research compiler for OpenMP

  • M.D. Linderman et al.

    Merge: a programming model for heterogeneous multi-core systems abstract

  • V.T. Ravi et al.

    Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations

  • H.L.A. van der Spek et al.

    A compile/run-time environment for the automatic transformation of linked list data structures

    Int. J. Parallel Prog.

    (2008)
  • M. Kulkarni et al.

    Defining and Implementing Commutativity Conditions for Parallel ExecutionTechnical report

    (2009)
  • T.H. Cormen et al.

    Introduction to Algorithms

    (2003)
  • Cited by (5)

    About the Authors

    Aleksandar Vitorović received a Bachelor degree in Computer Science in 2008 and a Master degree in Computer Science in 2010 by the School of Electrical Engineering, University of Belgrade. In 2010, he started a Ph.D. program at EPFL, Lausanne, Switzerland. His main research interests are program parallelization and distributed systems.

    Milo Tomašević was born in Nikšić, Montenegro. He received his B.Sc. in Electrical engineering and M.Sc. and Ph.D. in Computer Engineering from the University of Belgrade, Serbia, in 1980, 1984, and 1992, respectively. He is currently an Associate Professor and Head of Department of Computer Engineering, School of Electrical Engineering, University of Belgrade, Serbia. He was previously with the Pupin Institute, Belgrade, for over a decade where he was involved in many research and development projects. His current research interests are mainly in computer architecture (especially multiprocessor systems), parallel programming, cryptography, and algorithms and data structures. In these areas, he published almost 100 papers in international scientific journals, books, and proceedings of international and domestic conferences. He served as a reviewer for several journals and conferences and delivered tutorials at major conferences from the field of computer architecture and companies.

    Veljko Milutinović received his Ph.D. in Electrical Engineering from University of Belgrade in 1982. During the 1980s, for about a decade, he was on the faculty of Purdue University, West Lafayette, Indiana, USA, where he coauthored the architecture and design of the world's first DARPA GaAs microprocessor. Since the 1990s, after returning to Serbia, he is on the faculty of the School of Electrical Engineering, University of Belgrade, where he is teaching courses related to computer engineering, sensor networks, and data mining. During the 1990s, he also took part in teaching at the University of Purdue, Stanford and MIT. After the year 2000, he participated in several FP6 and FP7 projects through collaboration with leading universities and industries in the EU/US, including Microsoft, Intel, IBM, Ericsson, especially Maxeler. He has lectured by invitation to over 100 European universities. He published about 50 papers in SCI journals and about 20 books with major publishers in the United States. Professor Milutinović is a Fellow of the IEEE and a Member of Academia Europaea.

    View full text