Chapter Five - Manual Parallelization Versus State-of-the-Art Parallelization Techniques: The SPEC CPU2006 as a Case Study
Introduction
There is an everlasting strive for an increasing power and speed of computer systems. Parallel systems have long been considered as a promising solution for both throughput-oriented and speedup-oriented computing. Parallel processing has been used predominantly for demanding scientific applications and for large-scale systems over the years, but the architecture, technology, and application trends have been forcing it rapidly towards the commercial computing in medium-scale and even small-scale systems [1].
The importance of parallel processing has been boosted in the last decade by current trends in processor architecture and technology. Until recently, the performance of the processors grew steadily following Moore's law primarily as a consequence of an ever-increasing number of progressively faster transistors. However, inability to obtain further benefits of instruction-level parallelism (ILP), the problems with power dissipation, technology constraints, and design and verification difficulties in complex superscalars gave rise to chip multiprocessors (CMP) [2]. Since a CMP includes multiple simple superscalar cores on a chip, in addition to having the benefits of ILP, a multicore processor can issue multiple instructions per cycle from instruction streams (threads) [3]. Hence, parallel processing is brought further down to the laptop and embedded-system level. While parallelism in superscalars is extracted on the instruction level transparently to a software designer, an increasing level of parallelism in multicores (thread-level parallelism) requires the more intensive involvement of a programmer.
The first part of this chapter represents a survey on parallelization issues and approaches. It reviews some important issues in solving data dependences by loop transformations in order to improve the parallelization abilities. Then, a broad spectrum of parallelization approaches and contemporary tools is explored in order to provide the state-of-the-art in the field. A special attention is devoted to typical parallelization patterns found within sequential applications. Speedups obtained by applying these simple yet powerful patterns are significant, as we show later in our case study.
In a situation where all contemporary processors are multicores, the benefits of multithreaded, parallel workloads are easily exploited. However, the real challenge nowadays is to fully utilize the resources of multicore processors to improve the performance of a considerable amount of existing general-purpose sequential applications. Since the nature of such applications is best reflected in benchmark suites, the second part of this chapter is a case study on parallelization of the SPEC CPU2006 benchmark suite [4], [5] as one of the most representative and widely used suites for uniprocessors. It examines the potentials of autoparallelization and vectorization of the SPEC CPU2006 benchmarks in the state-of-the-art compilers and the efforts to parallelize its applications outside this suite. In order to achieve an additional level of parallelism, this study is oriented towards making manual source-to-source code modifications based on profiling information. To this end, the SPEC CPU2006 benchmarks are carefully examined for places where typical parallelization patterns can be efficiently applied. Finally, the resulting speedups are evaluated on two large parallel machines. The evaluation environment and methodology are also described in order to illustrate the details of entire process. Based on this experience, some general indications on where to find the parallelization potential are discussed.
Section snippets
Parallelization Theory
The main goal of parallelization is to decrease the application execution time as much as possible. The success in parallelization is mainly reflected in a performance indicator referred to as speedup, which shows how much a parallel program is faster than a corresponding sequential program. Amdahl's law determines the maximum speedup when only part of the program can be parallelized. This law tells that speedup of a program is limited by the time spent in the nonparallelized section.
Parallelization Techniques and Tools
Some sequential programs can be easily transformed into their parallel counterparts consisting of tasks that can run independently. These parallel programs, known as embarrassingly parallel[1], can be executed very efficiently, and they scale perfectly, since there is no communication among the tasks. Some examples are genetic algorithms and brute-force searches in cryptography. For such programs, parallel paradigms, models, and languages are presented. For sequential programs that cannot be
About Manual Parallelization
The automatic parallelization is quite attractive since it relieves the programmer of a great burden; however, this approach has its disadvantages. It is limited in detecting data dependences and thus inappropriate for complex code. As we will see, it can even induce performance degradation. In these conditions, manual parallelization approach where the programmer is directly involved in identifying and implementing parallelism is preferred. This section discusses the issues in manual
Case Study: Parallelization of SPEC CPU2006
In situation where all contemporary processors are multicores, the benefits of multithreaded, parallel workloads are easily exploited. However, the real challenge nowadays is to fully utilize resources of multicore processors to improve the performance of a bunch of existing general-purpose sequential applications. Since the nature of such applications is best reflected in uniprocessor benchmark suites, the focus of this case study is parallelization of SPEC CPU2006 [4], [5] as one of the most
Conclusion
Rapidly growing acceptance of parallel systems and multicore processors emphasizes the importance of more efficient use of their resources in case of execution of the sequential programs. Therefore, the problem of their parallelization is imminent. The chapter first explores theoretical background in the field and overviews various parallelization approaches and tools. As a case study, the chapter also examines the manual parallelization of the standard SPEC CPU2006 benchmark suite. Automatic
Acknowledgments
A. V. would like to thank Prof. Lawrence Rauchwerger, who accepted him for an internship, for many very helpful suggestions, and also to the entire Parasol Laboratory faculty and staff at Texas A&M University, who helped him a lot in configuring machines and starting jobs on them.
About the Authors
Aleksandar Vitorović received a Bachelor degree in Computer Science in 2008 and a Master degree in Computer Science in 2010 by the School of Electrical Engineering, University of Belgrade. In 2010, he started a Ph.D. program at EPFL, Lausanne, Switzerland. His main research interests are program parallelization and distributed systems.
References (91)
- et al.
Parallelization of DOALL and DOACROSS loops—a survey
- et al.
Parallel Computer Architecture: A Hardware/Software Approach
(1998) - et al.
Chip Multiprocessor Architecture Techniques to Improve Throughput and Latency
(2007) The Landscape of Parallel Computing Research: A View from BerkeleyTechnical report UCB/EECS-2006-183
(2006)- SPEC CPU2006. http://www.spec.org/cpu2006/ (accessed 23 July...
SPEC CPU2006 benchmark descriptions
SIGARCH Comput. Archit. News
(2006)- et al.
Parallel Programming in OpenMP
(2001) - et al.
Optimizing Compilers for Modern Architectures: A Dependence-Based Approach
(2001) - et al.
Compilers: Principles, Techniques, and Tools Used
(1986) - et al.
The LRPD test: speculative run-time parallelization of loops with privatization and reduction parallelization
IEEE Trans. Parallel Distrib. Syst.
(1999)
MapReduce: simplified data processing on large clusters
The implementation of the Cilk-5 multithreaded language
The Cilk++ concurrency platform
The Sisal model of functional programming and its implementation
Introduction to UPC and Language SpecificationTechnical report CCS-TR-99-157
Intel Threading Building Blocks: Outfitting C++ for Multicore Processor Parallelism
STAPL: an adaptive, generic parallel C++ library
Algorithmic Skeletons: Structured Management of Parallel Computation
A survey of algorithmic skeleton frameworks: high‐level structured parallel programming enablers
Softw. Pract. Exp.
Skandium: multi-core programming with algorithmic skeletons
Two fundamental concepts in skeletal parallel programming
Consistency analysis in bloom: a CALM and collected approach
Dryad: distributed data-parallel programs from sequential building blocks
Efficient code generation for automatic parallelization and optimization
CHiLL: A Framework for Composing High-Level Loop TransformationsTechnical report 08-897
Kremlin: rethinking and rebooting gprof for the multicore age
Maximizing multiprocessor performance with the SUIF compiler
Computer
Polaris: An Optimizing Compiler for Parallel Workstations and Scalable MultiprocessorsTechnical report 1475
The value evolution graph and its use in memory reference analysis
Cetus: a source-to-source compiler infrastructure for multicores
Computer
Towards automatic translation of OpenMP to MPI
Nanos mercurium: a research compiler for OpenMP
Merge: a programming model for heterogeneous multi-core systems abstract
Compiler and runtime support for enabling generalized reduction computations on heterogeneous parallel configurations
A compile/run-time environment for the automatic transformation of linked list data structures
Int. J. Parallel Prog.
Defining and Implementing Commutativity Conditions for Parallel ExecutionTechnical report
Introduction to Algorithms
Cited by (5)
Distributing and Parallelizing Non-canonical Loops
2023, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)Influence of loop transformations on performance and energy consumption of the multithreded WZ factorization
2022, Proceedings of the 17th Conference on Computer Science and Intelligence Systems, FedCSIS 2022Modelling flood events with a cumulant CO lattice Boltzmann shallow water model
2021, Natural Hazards
About the Authors
Aleksandar Vitorović received a Bachelor degree in Computer Science in 2008 and a Master degree in Computer Science in 2010 by the School of Electrical Engineering, University of Belgrade. In 2010, he started a Ph.D. program at EPFL, Lausanne, Switzerland. His main research interests are program parallelization and distributed systems.
Milo Tomašević was born in Nikšić, Montenegro. He received his B.Sc. in Electrical engineering and M.Sc. and Ph.D. in Computer Engineering from the University of Belgrade, Serbia, in 1980, 1984, and 1992, respectively. He is currently an Associate Professor and Head of Department of Computer Engineering, School of Electrical Engineering, University of Belgrade, Serbia. He was previously with the Pupin Institute, Belgrade, for over a decade where he was involved in many research and development projects. His current research interests are mainly in computer architecture (especially multiprocessor systems), parallel programming, cryptography, and algorithms and data structures. In these areas, he published almost 100 papers in international scientific journals, books, and proceedings of international and domestic conferences. He served as a reviewer for several journals and conferences and delivered tutorials at major conferences from the field of computer architecture and companies.
Veljko Milutinović received his Ph.D. in Electrical Engineering from University of Belgrade in 1982. During the 1980s, for about a decade, he was on the faculty of Purdue University, West Lafayette, Indiana, USA, where he coauthored the architecture and design of the world's first DARPA GaAs microprocessor. Since the 1990s, after returning to Serbia, he is on the faculty of the School of Electrical Engineering, University of Belgrade, where he is teaching courses related to computer engineering, sensor networks, and data mining. During the 1990s, he also took part in teaching at the University of Purdue, Stanford and MIT. After the year 2000, he participated in several FP6 and FP7 projects through collaboration with leading universities and industries in the EU/US, including Microsoft, Intel, IBM, Ericsson, especially Maxeler. He has lectured by invitation to over 100 European universities. He published about 50 papers in SCI journals and about 20 books with major publishers in the United States. Professor Milutinović is a Fellow of the IEEE and a Member of Academia Europaea.