skip to main content
10.1145/2048066.2048108acmconferencesArticle/Chapter ViewAbstractPublication PagessplashConference Proceedingsconference-collections
research-article

Kismet: parallel speedup estimates for serial programs

Published:22 October 2011Publication History

ABSTRACT

Software engineers now face the difficult task of refactoring serial programs for parallel execution on multicore processors. Currently, they are offered little guidance as to how much benefit may come from this task, or how close they are to the best possible parallelization. This paper presents Kismet, a tool that creates parallel speedup estimates for unparallelized serial programs. Kismet differs from previous approaches in that it does not require any manual analysis or modification of the program. This difference allows quick analysis of many programs, avoiding wasted engineering effort on those that are fundamentally limited. To accomplish this task, Kismet builds upon the hierarchical critical path analysis (HCPA) technique, a recently developed dynamic analysis that localizes parallelism to each of the potentially nested regions in the target program. It then uses a parallel execution time model to compute an approximate upper bound for performance, modeling constraints that stem from both hardware parameters and internal program structure.

Our evaluation applies Kismet to eight high-parallelism NAS Parallel Benchmarks running on a 32-core AMD multicore system, five low-parallelism SpecInt benchmarks, and six medium-parallelism benchmarks running on the finegrained MIT Raw processor. The results are compelling. Kismet is able to significantly improve the accuracy of parallel speedup estimates relative to prior work based on critical path analysis.

References

  1. Intel Parallel Advisor 2011. http://software.intel.com/en-us/articles/intel-parallel-advisor.Google ScholarGoogle Scholar
  2. NAS Parallel Benchmarks 2.3; OpenMP C. www.hpcc.jp/Omni/.Google ScholarGoogle Scholar
  3. V. Adve, J. Mellor-Crummey, M. Anderson, J.-C. Wang, D. A. Reed, and K. Kennedy. An integrated compilation and performance analysis environment for data parallel programs. In SC '95: Proceedings of the ACM/IEEE conference on Supercomputing, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A. Agarwal, S. Amarasinghe, R. Barua, M. Frank, W. Lee, V. Sarkar, D. Srikrishna, and M. Taylor. The RAW compiler project. In Proceedings of the Second SUIF Compiler Workshop, 1997.Google ScholarGoogle Scholar
  5. G. Ammons, T. Ball, and J. R. Larus. Exploiting hardware performance counters with flow and context sensitive profiling. In PLDI '97: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. T. E. Anderson, and E. D. Lazowska. Quartz: A tool for tuning parallel program performance. In SIGMETRICS, vol. 18, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. T. Austin, and G. S. Sohi. Dynamic dependency analysis of ordinary programs. In ISCA '92: Proceedings of the International Symposium on Computer Architecture, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. J. Babb, M. Frank, V. Lee, E. Waingold, R. Barua, M. Taylor, J. Kim, S. Devabhaktuni, and A. Agarwal. The raw benchmark suite: computation structures for general purpose computing. In FCCM '97: Proceedings of the IEEE Symposium on FPGA-Based Custom Computing Machines, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Bailey et al. The NAS parallel benchmarks. In SC '91: Proceedings of the Conference on Supercomputing, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. B. J. Barnes, B. Rountree, D. K. Lowenthal, J. Reeves, B. de Supinski, and M. Schulz. A regression-based approach to scalability prediction. In ICS '08: Proceedings of the International Conference on Supercomputing, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. J. M. Bull, and D. O'Neill. A microbenchmark suite for OpenMP 2.0. SIGARCH Computer Architecture News, Dec 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. E. S. Chung, P. A. Milder, J. C. Hoe, and K. Mai. Single-chip heterogeneous computing: Does the future include custom logic, fpgas, and gpgpus? In MICRO '10: Proceedings of the IEEE/ACM International Symposium on Microarchitecture, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. L. De Rose, and D. Reed. Svpablo: A multi-language architecture-independent performance analysis system. In ICPP '99:International Conference on Parallel Processing, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. E. Waingold et al. Baring It All to Software: Raw Machines. IEEE Computer, Sept 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. S. Garcia, D. Jeon, C. Louie, S. Kota Venkata, and M. B. Taylor. Bridging the parallelization gap: Automating parallelism discovery and planning. In HotPar '10: Proceedings of the USENIX workshop on Hot Topics in Parallelism, 2010.Google ScholarGoogle Scholar
  16. S. Garcia, D. Jeon, C. Louie, and M. B. Taylor. Kremlin: Rethinking and rebooting gprof for the multicore age. In PLDI '11: Proceedings of the Conference on Programming Language Design and Implementation, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. N. Goulding, J. Sampson, G. Venkatesh, S. Garcia, J. Auricchio, J. Babb, M. Taylor, and S. Swanson. GreenDroid: A Mobile Application Processor for a Future of Dark Silicon. In Hotchips, 2010.Google ScholarGoogle ScholarCross RefCross Ref
  18. Y. He, C. Leiserson, and W. Leiserson. The Cilkview Scalability Analyzer. In SPAA '10: Proceedings of the Symposium on Parallelism in Algorithms and Architectures, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. D. Hill, and M. R. Marty. Amdahl's law in the multicore era. IEEE Computer, July 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. K. Hoste, A. Phansalkar, L. Eeckhout, A. Georges, L. K. John, and K. De Bosschere. Performance prediction based on inherent program similarity. In PACT '06: Parallel Architectures and Compilation Techniques, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. D. Jeon, S. Garcia, C. Louie, S. Kota Venkata, and M. B. Taylor. Kremlin: Like gprof, but for Parallelization. In PPoPP '11: Principles and Practice of Parallel Programming, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. D. Jeon, S. Garcia, C. Louie, and M. B. Taylor. Parkour: Parallel speedup estimates for serial programs. In HotPar '11: Proceedings of the USENIX workshop on Hot Topics in Parallelism, May 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. T. S. Karkhanis, and J. E. Smith. A first-order superscalar processor model. In ISCA '04: Proceedings of the International Symposium on Computer Architecture. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. H. Kim, A. Raman, F. Liu, J. W. Lee, and D. I. August. Scalable speculative parallelization on commodity clusters. In MICRO '10: Proceedings of the IEEE/ACM International Symposium on Microarchitecture, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. M. Kim, H. Kim, and C. Luk. Prospector: A dynamic data-dependence profiler to help parallel programming. In HotPar '10: Proceedings of the USENIX workshop on Hot Topics in parallelism, 2010.Google ScholarGoogle Scholar
  26. M. Kim, H. Kim, and C.-K. Luk. SD3: A scalable approach to dynamic data-dependence profiling. MICRO '10: Proceedings of the International Symposium on Microarchitecture, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. M. Kulkarni, M. Burtscher, R. Inkulu, K. Pingali, and C. Casçaval. How much parallelism is there in irregular applications? In PPoPP '09: Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. M. Kumar. Measuring parallelism in computation-intensive scientific/engineering applications. IEEE TOC, Sep 1988. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. M. S. Lam, and R. P. Wilson. Limits of control flow on parallelism. In ISCA, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. J. R. Larus. Loop-level parallelism in numeric and symbolic programs. IEEE Trans. Parallel Distrib. Syst., 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. C. Lattner, and V. Adve. LLVM: A compilation framework for lifelong program analysis & transformation. In CGO '04: Proceedings of the International Symposium on Code Generation and Optimization, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. W. Lee, R. Barua, M. Frank, D. Srikrishna, J. Babb, V. Sarkar, and S. Amarasinghe. Space-time scheduling of instruction-level parallelism on a Raw machine. In ASPLOS '98: International Conference on Architectural Support for Programming Languages and Operating Systems, Oct 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. S.-W. Liao, A. Diwan, R. P. Bosch, Jr., A. Ghuloum, and M. S. Lam. SUIF Explorer: an interactive and interprocedural parallelizer. In PPoPP '99: Proceedings of the ACM SIGPLAN symposium on Principles and Practice of Parallel Programming. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. G. Loh. A time-stamping algorithm for efficient performance estimation of superscalar processors. In SIGMETRICS, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. M. B. Taylor et al. Evaluation of the raw microprocessor: An exposed-wire-delay architecture for ilp and streams. In ISCA '04: Proceedings of the International Symposium on Computer Architecture, Jun 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. M. B. Taylor et al. The Raw Microprocessor: A Computation Fabric for Software Circuits and General-Purpose Programs. In IEEE Micro, Mar/Apr 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. M. Martin, D. Sorin, B. Beckmann, M. Marty, M. Xu, A. R. Alameldeen, K. Moore, M. Hill, and D. Wood. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. SIGARCH Comput. Archit. News, Nov 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. M. Martonosi, D. Felt, and M. Heinrich. Integrating performance monitoring and communication in parallel computers. In SIGMETRICS, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. B. P. Miller, M. D. Callaghan, J. M. Cargille, J. K. Hollingsworth, R. B. Irvin, K. L. Karavanic, K. Kunchithapadam, and T. Newhall. The Paradyn Parallel Performance Measurement Tool. IEEE Computer, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. N. Nethercote, and J. Seward. How to shadow every byte of memory used by a program. In VEE '07: Proceedings of the 3rd international conference on Virtual Execution Environments, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. N. Nethercote, and J. Seward. Valgrind: A framework for heavyweight dynamic binary instrumentation. In PLDI '07: Proceedings of the Conference on Programming Language Design and Implementation, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. D. Ofelt, and J. L. Hennessy. Efficient performance prediction for modern microprocessors. In SIGMETRICS, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. M. K. Prabhu, and K. Olukotun. Exposing speculative thread parallelism in spec2000. In PPoPP '05: Proceedings of the ACM SIGPLAN symposium on Principles and Practice of Parallel Programming, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. E. Raman, G. Ottoni, A. Raman, M. J. Bridges, and D. I. August. Parallel-stage decoupled software pipelining. In CGO '08: Proceedings of the International Symposium on Code Generation and Optimization, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. L. Rauchwerger, P. K. Dubey, and R. Nair. Measuring limits of parallelism and characterizing its vulnerability to resource constraints. In MICRO '93: Proceedings of the international symposium on Microarchitecture, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. S. Bell et al. TILE64 - Processor: A 64-Core SoC with Mesh Interconnect. In ISSCC '08: IEEE Solid-State Circuits Conference, 2008.Google ScholarGoogle Scholar
  47. N. R. Tallent, and J. M. Mellor Crummey. Effective performance measurement and analysis of multithreaded applications. In PPoPP '09: Proceedings of the ACM SIGPLAN symposium on Principles and practice of parallel programming, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. M. B. Taylor. Design Decisions in the Implementation of a Raw Architecture Workstation. Master's thesis, Massachusetts Institute of Technology, Sept 1999.Google ScholarGoogle Scholar
  49. M. B. Taylor. Tiled Microprocessors. Ph.D. thesis, Massachusetts Institute of Technology, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. M. B. Taylor, W. Lee, S. P. Amarasinghe, and A. Agarwal. Scalar operand networks. IEEE Transactions on Parallel and Distributed Systems, Feb 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  51. K. B. Theobald, G. R. Gao, and L. J. Hendren. On the limits of program parallelism and its smoothability. In MICRO '92: Proceedings of the International Symposium on Microarchitecture, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. D. W. Wall. Limits of instruction-level parallelism. In Proceedings of the Conference on Architectural Support for Programming Languages and Operating Systems, 1991. Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. J. Zhai, W. Chen, and W. Zheng. Phantom: predicting performance of parallel applications on large-scale parallel machines using a single node. In PPoPP '10: Proceedings of the ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. X. Zhang, A. Navabi, and S. Jagannathan. Alchemist: A transparent dependence distance profiling infrastructure. In CGO '09: Proceedings of the International Symposium on Code Generation and Optimization, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Y. Zhang, and R. Gupta. Timestamped whole program path representation and its applications. In PLDI '01: Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. L. Zhao, R. Iyer, J. Moses, R. lllikkal, S. Makineni, and D. Newell. Exploring Large-Scale CMP Architectures Using ManySim. IEEE Micro, July 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Q. Zhao, D. Bruening, and S. Amarasinghe. Efficient memory shadowing for 64-bit architectures. In ISMM '10: Proceedings of the International Symposium on Memory Management, Jun 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  58. Q. Zhao, D. Bruening, and S. Amarasinghe. Umbra: Efficient and scalable memory shadowing. In CGO '10: Proceedings of the IEEE/ACM international symposium on Code Generation and Optimization, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  59. H. Zhong, M. Mehrara, S. Lieberman, and S. Mahlke. Uncovering hidden loop level parallelism in sequential applications. In HPCA '08: Proceedings of the International Symposium on High Performance Computer Architecture, 2008.Google ScholarGoogle Scholar
  60. D. A. Zier, and B. Lee. Performance evaluation of dynamic speculative multithreading with the cascadia architecture. IEEE Transactions on Parallel and Distributed Systems, Jan 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Kismet: parallel speedup estimates for serial programs

          Recommendations

          Comments

          Login options

          Check if you have access through your login credentials or your institution to get full access on this article.

          Sign in

          PDF Format

          View or Download as a PDF file.

          PDF

          eReader

          View online with eReader.

          eReader