Abstract
The topic of cache performance has been well studied in recent years. Compiler optimizations exist and optimizations have been done for many problems. Much of this work has focused on dense linear algebra problems. At first glance, the Floyd--Warshall algorithm appears to fall into this category. In this paper, we begin by applying two standard cache-friendly optimizations to the Floyd--Warshall algorithm and show limited performance improvements. We then discuss the unidirectional space time representation (USTR). We show analytically that the USTR can be used to reduce the amount of processor-memory traffic by a factor of O(√C), where C is the cache size, for a large class of algorithms. Since the USTR leads to a tiled implementation, we develop a tile size selection heuristic to intelligently narrow the search space for the tile size that minimizes total execution time. Using the USTR, we develop a cache-friendly implementation of the Floyd--Warshall algorithm. We show experimentally that this implementation minimizes the level-1 and level-2 cache misses and TLB misses and, therefore, exhibits the best overall performance. Using this implementation, we show a 2x improvement in performance over the best compiler optimized implementation on three different architectures. Finally, we show analytically that our implementation of the Floyd--Warshall algorithm is asymptotically optimal with respect to processor-memory traffic. We show experimental results for the Pentium III, Alpha, and MIPS R12000 machines using problem sizes between 1024 and 2048 vertices. We demonstrate improved cache performance using the Simplescalar simulator.
- ADVISOR Project. http://advisor.usc.edu/.Google Scholar
- Burger, D. and Austin, T. M. 1997. The SimpleScalar Tool Set, Version 2.0, University of Wisconsin-Madison Computer Sciences Department Technical Report #1342, June, 1997. Google Scholar
- Chame, J., Hall, M., and Shin, J. 2000. Compiler transformations for exploiting bandwidth in PIM-based systems. In Proc. of Solving the Memory Wall Workshop (June).Google Scholar
- Chatterjee, S. and Sen, S. 2000. Cache efficient matrix transposition. In Proc. of International Symposium on High Performance Computer Architecture (Jan.).Google Scholar
- Chilimbi, T. M., Davidson, B., and Larus, J. R. 1999. Cache-conscious structure definition. ACM SIGPLAN'99 Conference on Programming Language Design and Implementation (May). Google Scholar
- Chilimbi, T. M., Hill, M. D., and Larus, J. R. 1999. Cache-conscious structure layout. In Proc. of ACM SIGPLAN Conference on Programming Language Design and Implementation (May). Google Scholar
- Cormen, T. H., Leiserson, C. E., and Rivest, R. L. 1990. Introduction to Algorithms. MIT Press, Cambridge, MA. Google Scholar
- Cosnard, M., Quinton, P., Robert, Y., and Tchuente, M. (eds.) 1986. Parallel Algorithms and Architectures, North Holland, Amsterdam. Google Scholar
- Diniz, P. 2001. USC ISI, Personal Communication (March).Google Scholar
- Frigo, M., Leiserson, C. E., Prokop, H., and Ramachandran, S. 1999. Cache-oblivious algorithms. In Proc. of 40th Annual Symposium on Foundations of Computer Science, 17--18, New York, (Oct.). Google Scholar
- Hall, M. W., Kogge, P., Koller, J., Diniz, P., Chame, J., Draper, J., Lacoss, J., Brockman, J., Athas, W., Srivastava, A., Freeh, V., Shin, J., and Park, J. 1999. Mapping irregular applications to DIVA, a PIM-based data-intensive architecture. In Proc. of International Conference on Supercomputing (Nov.). Google Scholar
- Horowitz, E. and Sahni, S. 1978. Fundamentals of Computer Algorithms. Computer Society Press.Google Scholar
- Hong, J. and Kung, H. 1981. I/O Complexity: The Red Blue Pebble game. In Proc. of ACM Symposium on Theory of Computing. Google Scholar
- Kallahalla, M. and Varman, P. J. 2001. Optimal prefetching and caching for parallel I/O systems. In Proc. of 13th ACM Symposium on Parallel Algorithms and Architectures. Google Scholar
- Kwak, H., Lee, B., Hurson, A. R., Yoon, S., and Hahn, W. 1999. Effects of multithreading on cache performance. IEEE Trans. Comput. 48, 2 (Feb.). Google Scholar
- Lam, M. S., Rothberg, E. E., and Wolf, M. E. 1991. The cache performance and optimizations of blocked algorithms. In Proc. of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, Palo Alto, CA (Apr.). Google Scholar
- Park, N., Kang, D., Bondalapati, K., and Prasanna, V. K. 2000. Dynamic data layouts for cache-conscious factorization of the DFT. In Proc. of International Parallel and Distributed Processing Symposium (May). Google Scholar
- Patterson, D. A. and Hennessy, J. L. 1996. Computer Architecture: A Quantitative Approach, 2nd Ed., Morgan Kaufmann, San Matis, CA. Google Scholar
- Rastello, F. and Robert, Y. 1998. Loop partitioning versus tiling for cache-based multiprocessor. In Proc. of International Conference Parallel and Distributed Computing and Systems, Las Vegas, NV.Google Scholar
- Sen, S. and Chatterjee, S. 2000. Towards a theory of cache-efficient algorithms. In Proc. of Symposium on Discrete Algorithms. Google Scholar
- SPIRAL Project. http://www.ece.cmu.edu/~spiral/.Google Scholar
- Tang, X., Ghiya, R., Hendren, L. J., and Gao, G. R. 1997. Heap analysis and optimizations for threaded programs. In Proc. of International Conference on Parallel Architectures and Compilation Techniques, San Francisco, CA (Nov.) 14--25. Google Scholar
- Ullman, J. D. 1983. Computational Aspects of VLSI, Computer Science Press, Rockville, MD. Google Scholar
- Varman, P. J. and Verma, R. M. 1999. Tight bounds for prefetching and buffer management algorithms for parallel I/O systems. IEEE Trans. Parall. Distrib. Syst. 10, 12, 1262--1275. Google Scholar
- Weikle, D. A. B., Mckee, S. A., and Wulf, W. M. A. 2000. Caches as filters: A new approach to cache analysis. In Proc. of Grace Murray Hopper Conference (Sept.). Google Scholar
- Whaley, R. C. and Dongarra, J. J. 1998. Automatically tuned linear algebra software. High Performance Computing and Networking (Nov.). Google Scholar
Index Terms
- Cache-Friendly implementations of transitive closure
Recommendations
Cache-Friendly Implementations of Transitive Closure
PACT '01: Proceedings of the 2001 International Conference on Parallel Architectures and Compilation TechniquesAbstract: In this paper we show cache-friendly implementations of the Floyd-Warshall algorithm for the All-Pairs Shortest-Path problem. We first compare the best commercial compiler optimizations available with standard cache-friendly optimizations and ...
Reducing memory latency using a small software driven array cache
HICSS '95: Proceedings of the 28th Hawaii International Conference on System SciencesFrom the programming viewpoint, data references can be classified into two types: array reference and non-array references. Array references have relatively strong spatial locality while non-array references have relatively strong temporal locality. ...
Location cache: a low-power L2 cache system
ISLPED '04: Proceedings of the 2004 international symposium on Low power electronics and designWhile set-associative caches incur fewer misses than direct-mapped caches, they typically have slower hit times and higher power consumption, when multiple tag and data banks are probed in parallel. This paper presents the location cache structure which ...
Comments