skip to main content
article

Cache-Friendly implementations of transitive closure

Published:09 February 2007Publication History
Skip Abstract Section

Abstract

The topic of cache performance has been well studied in recent years. Compiler optimizations exist and optimizations have been done for many problems. Much of this work has focused on dense linear algebra problems. At first glance, the Floyd--Warshall algorithm appears to fall into this category. In this paper, we begin by applying two standard cache-friendly optimizations to the Floyd--Warshall algorithm and show limited performance improvements. We then discuss the unidirectional space time representation (USTR). We show analytically that the USTR can be used to reduce the amount of processor-memory traffic by a factor of O(√C), where C is the cache size, for a large class of algorithms. Since the USTR leads to a tiled implementation, we develop a tile size selection heuristic to intelligently narrow the search space for the tile size that minimizes total execution time. Using the USTR, we develop a cache-friendly implementation of the Floyd--Warshall algorithm. We show experimentally that this implementation minimizes the level-1 and level-2 cache misses and TLB misses and, therefore, exhibits the best overall performance. Using this implementation, we show a 2x improvement in performance over the best compiler optimized implementation on three different architectures. Finally, we show analytically that our implementation of the Floyd--Warshall algorithm is asymptotically optimal with respect to processor-memory traffic. We show experimental results for the Pentium III, Alpha, and MIPS R12000 machines using problem sizes between 1024 and 2048 vertices. We demonstrate improved cache performance using the Simplescalar simulator.

References

  1. ADVISOR Project. http://advisor.usc.edu/.Google ScholarGoogle Scholar
  2. Burger, D. and Austin, T. M. 1997. The SimpleScalar Tool Set, Version 2.0, University of Wisconsin-Madison Computer Sciences Department Technical Report #1342, June, 1997. Google ScholarGoogle Scholar
  3. Chame, J., Hall, M., and Shin, J. 2000. Compiler transformations for exploiting bandwidth in PIM-based systems. In Proc. of Solving the Memory Wall Workshop (June).Google ScholarGoogle Scholar
  4. Chatterjee, S. and Sen, S. 2000. Cache efficient matrix transposition. In Proc. of International Symposium on High Performance Computer Architecture (Jan.).Google ScholarGoogle Scholar
  5. Chilimbi, T. M., Davidson, B., and Larus, J. R. 1999. Cache-conscious structure definition. ACM SIGPLAN'99 Conference on Programming Language Design and Implementation (May). Google ScholarGoogle Scholar
  6. Chilimbi, T. M., Hill, M. D., and Larus, J. R. 1999. Cache-conscious structure layout. In Proc. of ACM SIGPLAN Conference on Programming Language Design and Implementation (May). Google ScholarGoogle Scholar
  7. Cormen, T. H., Leiserson, C. E., and Rivest, R. L. 1990. Introduction to Algorithms. MIT Press, Cambridge, MA. Google ScholarGoogle Scholar
  8. Cosnard, M., Quinton, P., Robert, Y., and Tchuente, M. (eds.) 1986. Parallel Algorithms and Architectures, North Holland, Amsterdam. Google ScholarGoogle Scholar
  9. Diniz, P. 2001. USC ISI, Personal Communication (March).Google ScholarGoogle Scholar
  10. Frigo, M., Leiserson, C. E., Prokop, H., and Ramachandran, S. 1999. Cache-oblivious algorithms. In Proc. of 40th Annual Symposium on Foundations of Computer Science, 17--18, New York, (Oct.). Google ScholarGoogle Scholar
  11. Hall, M. W., Kogge, P., Koller, J., Diniz, P., Chame, J., Draper, J., Lacoss, J., Brockman, J., Athas, W., Srivastava, A., Freeh, V., Shin, J., and Park, J. 1999. Mapping irregular applications to DIVA, a PIM-based data-intensive architecture. In Proc. of International Conference on Supercomputing (Nov.). Google ScholarGoogle Scholar
  12. Horowitz, E. and Sahni, S. 1978. Fundamentals of Computer Algorithms. Computer Society Press.Google ScholarGoogle Scholar
  13. Hong, J. and Kung, H. 1981. I/O Complexity: The Red Blue Pebble game. In Proc. of ACM Symposium on Theory of Computing. Google ScholarGoogle Scholar
  14. Kallahalla, M. and Varman, P. J. 2001. Optimal prefetching and caching for parallel I/O systems. In Proc. of 13th ACM Symposium on Parallel Algorithms and Architectures. Google ScholarGoogle Scholar
  15. Kwak, H., Lee, B., Hurson, A. R., Yoon, S., and Hahn, W. 1999. Effects of multithreading on cache performance. IEEE Trans. Comput. 48, 2 (Feb.). Google ScholarGoogle Scholar
  16. Lam, M. S., Rothberg, E. E., and Wolf, M. E. 1991. The cache performance and optimizations of blocked algorithms. In Proc. of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, Palo Alto, CA (Apr.). Google ScholarGoogle Scholar
  17. Park, N., Kang, D., Bondalapati, K., and Prasanna, V. K. 2000. Dynamic data layouts for cache-conscious factorization of the DFT. In Proc. of International Parallel and Distributed Processing Symposium (May). Google ScholarGoogle Scholar
  18. Patterson, D. A. and Hennessy, J. L. 1996. Computer Architecture: A Quantitative Approach, 2nd Ed., Morgan Kaufmann, San Matis, CA. Google ScholarGoogle Scholar
  19. Rastello, F. and Robert, Y. 1998. Loop partitioning versus tiling for cache-based multiprocessor. In Proc. of International Conference Parallel and Distributed Computing and Systems, Las Vegas, NV.Google ScholarGoogle Scholar
  20. Sen, S. and Chatterjee, S. 2000. Towards a theory of cache-efficient algorithms. In Proc. of Symposium on Discrete Algorithms. Google ScholarGoogle Scholar
  21. SPIRAL Project. http://www.ece.cmu.edu/~spiral/.Google ScholarGoogle Scholar
  22. Tang, X., Ghiya, R., Hendren, L. J., and Gao, G. R. 1997. Heap analysis and optimizations for threaded programs. In Proc. of International Conference on Parallel Architectures and Compilation Techniques, San Francisco, CA (Nov.) 14--25. Google ScholarGoogle Scholar
  23. Ullman, J. D. 1983. Computational Aspects of VLSI, Computer Science Press, Rockville, MD. Google ScholarGoogle Scholar
  24. Varman, P. J. and Verma, R. M. 1999. Tight bounds for prefetching and buffer management algorithms for parallel I/O systems. IEEE Trans. Parall. Distrib. Syst. 10, 12, 1262--1275. Google ScholarGoogle Scholar
  25. Weikle, D. A. B., Mckee, S. A., and Wulf, W. M. A. 2000. Caches as filters: A new approach to cache analysis. In Proc. of Grace Murray Hopper Conference (Sept.). Google ScholarGoogle Scholar
  26. Whaley, R. C. and Dongarra, J. J. 1998. Automatically tuned linear algebra software. High Performance Computing and Networking (Nov.). Google ScholarGoogle Scholar

Index Terms

  1. Cache-Friendly implementations of transitive closure

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader