ABSTRACT
We present cache-efficient chip multiprocessor (CMP) algorithms with good speed-up for some widely used dynamic programming algorithms. We consider three types of caching systems for CMPs: D-CMP with a private cache for each core, S-CMP with a single cache shared by all cores, and Multicore, which has private L1 caches and a shared L2 cache. We derive results for three classes of problems: local dependency dynamic programming (LDDP), Gaussian Elimination Paradigm (GEP), and parenthesis problem.
For each class of problems, we develop a generic CMP algorithm with an associated tiling sequence. We then tailor this tiling sequence to each caching model and provide a parallel schedule that results in a cache-efficient parallel execution up to the critical path length of the underlying dynamic programming algorithm.
We present experimental results on an 8-core Opteron for two sequence alignment problems that are important examples of LDDP. Our experimental results show good speed-ups for simple versions of our algorithms.
- G. Blelloch, R. Chowdhury, P. Gibbons, V. Ramachandran, S. Chen, and M. Kozuch. Provably good multicore cache performance for divide-and-conquer algorithms. In Proc. ACM-SIAM SODA, pages 501--510, 2008. Google ScholarDigital Library
- G. Blelloch and P. Gibbons. Effectively sharing a cache among threads. In Proc. ACM SPAA, pages 235--244, 2004. Google ScholarDigital Library
- G. Blelloch, P. Gibbons, and Y. Matias. Provably efficient scheduling for languages with fine-grained parallelism. JACM, 46(2):281--321, 1999. Google ScholarDigital Library
- R. Blumofe and C. Leiserson. Scheduling multithreaded computations by work stealing. JACM, 46(5):720--748, 1999. Google ScholarDigital Library
- C. Cherng and R. Ladner. Cache efficient simple dynamic programming. In Proc. Intl Conf Analysis of Algorithms, pages 49--58, 2005.Google Scholar
- R. Chowdhury, H. Le, and V. Ramachandran. Efficient cache-oblivious string algorithms for Bioinformatics. Technical Report TR-07-03, Dept. of Computer Sciences, UT-Austin, 2007.Google Scholar
- R. Chowdhury and V. Ramachandran. Cache-oblivious dynamic programming. In Proc. ACM-SIAM SODA, pages 591--600, 2006. Google ScholarDigital Library
- R. Chowdhury and V. Ramachandran. The cache-oblivious gaussian elimination paradigm: Theoretical framework, parallelization and experimental evaluation. In Proc. {ACM} SPAA, pages 71--80, 2007. Google ScholarDigital Library
- R. Chowdhury and V. Ramachandran. Cache-efficient dynamic programming algorithms for multicores. Technical Report TR-08-16, Dept. of Computer Sciences, UT-Austin, 2008.Google ScholarDigital Library
- T. Cormen, C. Leiserson, R. Rivest, and C. Stein. Introduction to Algorithms. The MIT Press, second edition, 2001. Google ScholarDigital Library
- D. Culler, R. Karp, D. Patterson, A. Sahay, K. Schauser, S. E., R. Subramonian, and T. von Eicken. Logp: Toward a realistic model of parallel computation. In Proc. 4th SIGPLAN Symp. Principles Practices of Parallel Programming, pages 1--12, 1993. Google ScholarDigital Library
- T. DeSantis, I. Dubosarskiy, S. Murray, and G. Andersen. Comprehensive aligned sequence construction for automated design of effective probes (CASCADE-P) using 16S rDNA. Bioinformatics, 19:1461--1468, 2003. url: http://greengenes.llnl.gov/16S/.Google ScholarCross Ref
- M. Frigo, C. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proc. IEEE FOCS, pages 285--297, 1999. Google ScholarDigital Library
- M. Frigo and V. Strumpen. The cache complexity of multithreaded cache oblivious algorithms. In Proc ACM SPAA, pages 271--280, 2006. Google ScholarDigital Library
- Z. Galil and K. Park. Parallel algorithms for dynamic programming recurrences with more than o(1) dependency. JPDC, 21:213--222, 1994. Google ScholarDigital Library
- P. Gibbons, Y. Matias, and V. Ramachandran. Can shared-memory model serve as a bridging model for parallel computation? In Proc. ACM SPAA, pages 72--83, 1997. Google ScholarDigital Library
- J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach. Morgan Kaufmann, third edition, 2002. Google ScholarDigital Library
- D. Hirschberg. A linear space algorithm for computing maximal common subsequences. CACM, 18(6):341--343, 1975. Google ScholarDigital Library
- R. Karp and V. Ramachandran. Parallel algorithms for shared memory machines. In Handbook of Theor Comp Sci, pages 869--941. Elsevier, 1990.Google Scholar
- B. Knudsen. Multiple parsimony alignment with "affalign". Software package multalign.tar.Google Scholar
- B. Knudsen. Optimal multiple parsimony alignment with affine gap cost using a phylogenetic tree. In Proc. Workshop Algs in Bioinf. , pages 433--446, 2003.Google ScholarCross Ref
- W. Pearson and D. Lipman. Improved tools for biological sequence comparison. In Proc. Natl Acad. Sciences, volume 85, pages 2444--2448, 1988.Google ScholarCross Ref
- D. Powell. Software package align3str_checkp.tar.gz.Google Scholar
- D. Powell, L. Allison, and T. Dix. Fast, optimal alignment of three sequences using linear gap cost. Journal of Theoretical Biology, 207(3):325--336, 2000.Google ScholarCross Ref
- G. Tan, N. Sun, and G. R. Gao. A parallel dynamic programming algorithm on a multi-core architecture. In ACM SPAA, pages 135--144, 2007. Google ScholarDigital Library
- J. Thomas et al. Comparative analyses of multi-species sequences from targeted genomic regions. Nature, 424:788--793, 2003.Google ScholarCross Ref
- L. Valiant. General context-free recognition in less than cubic time. JCSS, 10:308--315, 1975.Google ScholarDigital Library
- L. Valiant. A bridging model for parallel computation. CACM, 33(8):103--111, 1990. Google ScholarDigital Library
Index Terms
- Cache-efficient dynamic programming algorithms for multicores
Recommendations
High performance cache replacement using re-reference interval prediction (RRIP)
ISCA '10Practical cache replacement policies attempt to emulate optimal replacement by predicting the re-reference interval of a cache block. The commonly used LRU replacement policy always predicts a near-immediate re-reference interval on cache hits and ...
High performance cache replacement using re-reference interval prediction (RRIP)
ISCA '10: Proceedings of the 37th annual international symposium on Computer architecturePractical cache replacement policies attempt to emulate optimal replacement by predicting the re-reference interval of a cache block. The commonly used LRU replacement policy always predicts a near-immediate re-reference interval on cache hits and ...
Reactive NUCA: near-optimal block placement and replication in distributed caches
ISCA '09: Proceedings of the 36th annual international symposium on Computer architectureIncreases in on-chip communication delay and the large working sets of server and scientific workloads complicate the design of the on-chip last-level cache for multicore processors. The large working sets favor a shared cache design that maximizes the ...
Comments