article

Cache-Friendly implementations of transitive closure

Authors:
Michael Penner

University of Southern California, Los Angeles, California

University of Southern California, Los Angeles, California
View Profile

,
Viktor K. Prasanna

University of Southern California, Los Angeles, California

University of Southern California, Los Angeles, California
View Profile

ACM Journal of Experimental Algorithmics Volume 11pp 1.3–eshttps://doi.org/10.1145/1187436.1210586

Published:09 February 2007Publication History

ACM Journal of Experimental Algorithmics

Abstract

The topic of cache performance has been well studied in recent years. Compiler optimizations exist and optimizations have been done for many problems. Much of this work has focused on dense linear algebra problems. At first glance, the Floyd--Warshall algorithm appears to fall into this category. In this paper, we begin by applying two standard cache-friendly optimizations to the Floyd--Warshall algorithm and show limited performance improvements. We then discuss the unidirectional space time representation (USTR). We show analytically that the USTR can be used to reduce the amount of processor-memory traffic by a factor of O(√C), where C is the cache size, for a large class of algorithms. Since the USTR leads to a tiled implementation, we develop a tile size selection heuristic to intelligently narrow the search space for the tile size that minimizes total execution time. Using the USTR, we develop a cache-friendly implementation of the Floyd--Warshall algorithm. We show experimentally that this implementation minimizes the level-1 and level-2 cache misses and TLB misses and, therefore, exhibits the best overall performance. Using this implementation, we show a 2x improvement in performance over the best compiler optimized implementation on three different architectures. Finally, we show analytically that our implementation of the Floyd--Warshall algorithm is asymptotically optimal with respect to processor-memory traffic. We show experimental results for the Pentium III, Alpha, and MIPS R12000 machines using problem sizes between 1024 and 2048 vertices. We demonstrate improved cache performance using the Simplescalar simulator.

References

ADVISOR Project. http://advisor.usc.edu/.Google Scholar
Burger, D. and Austin, T. M. 1997. The SimpleScalar Tool Set, Version 2.0, University of Wisconsin-Madison Computer Sciences Department Technical Report &num;1342, June, 1997. Google Scholar
Chame, J., Hall, M., and Shin, J. 2000. Compiler transformations for exploiting bandwidth in PIM-based systems. In Proc. of Solving the Memory Wall Workshop (June).Google Scholar
Chatterjee, S. and Sen, S. 2000. Cache efficient matrix transposition. In Proc. of International Symposium on High Performance Computer Architecture (Jan.).Google Scholar
Chilimbi, T. M., Davidson, B., and Larus, J. R. 1999. Cache-conscious structure definition. ACM SIGPLAN'99 Conference on Programming Language Design and Implementation (May). Google Scholar
Chilimbi, T. M., Hill, M. D., and Larus, J. R. 1999. Cache-conscious structure layout. In Proc. of ACM SIGPLAN Conference on Programming Language Design and Implementation (May). Google Scholar
Cormen, T. H., Leiserson, C. E., and Rivest, R. L. 1990. Introduction to Algorithms. MIT Press, Cambridge, MA. Google Scholar
Cosnard, M., Quinton, P., Robert, Y., and Tchuente, M. (eds.) 1986. Parallel Algorithms and Architectures, North Holland, Amsterdam. Google Scholar
Diniz, P. 2001. USC ISI, Personal Communication (March).Google Scholar
Frigo, M., Leiserson, C. E., Prokop, H., and Ramachandran, S. 1999. Cache-oblivious algorithms. In Proc. of 40th Annual Symposium on Foundations of Computer Science, 17--18, New York, (Oct.). Google Scholar
Hall, M. W., Kogge, P., Koller, J., Diniz, P., Chame, J., Draper, J., Lacoss, J., Brockman, J., Athas, W., Srivastava, A., Freeh, V., Shin, J., and Park, J. 1999. Mapping irregular applications to DIVA, a PIM-based data-intensive architecture. In Proc. of International Conference on Supercomputing (Nov.). Google Scholar
Horowitz, E. and Sahni, S. 1978. Fundamentals of Computer Algorithms. Computer Society Press.Google Scholar
Hong, J. and Kung, H. 1981. I/O Complexity: The Red Blue Pebble game. In Proc. of ACM Symposium on Theory of Computing. Google Scholar
Kallahalla, M. and Varman, P. J. 2001. Optimal prefetching and caching for parallel I/O systems. In Proc. of 13th ACM Symposium on Parallel Algorithms and Architectures. Google Scholar
Kwak, H., Lee, B., Hurson, A. R., Yoon, S., and Hahn, W. 1999. Effects of multithreading on cache performance. IEEE Trans. Comput. 48, 2 (Feb.). Google Scholar
Lam, M. S., Rothberg, E. E., and Wolf, M. E. 1991. The cache performance and optimizations of blocked algorithms. In Proc. of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems, Palo Alto, CA (Apr.). Google Scholar
Park, N., Kang, D., Bondalapati, K., and Prasanna, V. K. 2000. Dynamic data layouts for cache-conscious factorization of the DFT. In Proc. of International Parallel and Distributed Processing Symposium (May). Google Scholar
Patterson, D. A. and Hennessy, J. L. 1996. Computer Architecture: A Quantitative Approach, 2nd Ed., Morgan Kaufmann, San Matis, CA. Google Scholar
Rastello, F. and Robert, Y. 1998. Loop partitioning versus tiling for cache-based multiprocessor. In Proc. of International Conference Parallel and Distributed Computing and Systems, Las Vegas, NV.Google Scholar
Sen, S. and Chatterjee, S. 2000. Towards a theory of cache-efficient algorithms. In Proc. of Symposium on Discrete Algorithms. Google Scholar
SPIRAL Project. http://www.ece.cmu.edu/~spiral/.Google Scholar
Tang, X., Ghiya, R., Hendren, L. J., and Gao, G. R. 1997. Heap analysis and optimizations for threaded programs. In Proc. of International Conference on Parallel Architectures and Compilation Techniques, San Francisco, CA (Nov.) 14--25. Google Scholar
Ullman, J. D. 1983. Computational Aspects of VLSI, Computer Science Press, Rockville, MD. Google Scholar
Varman, P. J. and Verma, R. M. 1999. Tight bounds for prefetching and buffer management algorithms for parallel I/O systems. IEEE Trans. Parall. Distrib. Syst. 10, 12, 1262--1275. Google Scholar
Weikle, D. A. B., Mckee, S. A., and Wulf, W. M. A. 2000. Caches as filters: A new approach to cache analysis. In Proc. of Grace Murray Hopper Conference (Sept.). Google Scholar
Whaley, R. C. and Dongarra, J. J. 1998. Automatically tuned linear algebra software. High Performance Computing and Networking (Nov.). Google Scholar

Index Terms

Cache-Friendly implementations of transitive closure
1. Mathematics of computing
  1. Mathematical analysis
    1. Numerical analysis
2. Theory of computation
  1. Design and analysis of algorithms
  2. Randomness, geometry and discrete structures

Recommendations

Cache-Friendly Implementations of Transitive Closure
PACT '01: Proceedings of the 2001 International Conference on Parallel Architectures and Compilation Techniques

Abstract: In this paper we show cache-friendly implementations of the Floyd-Warshall algorithm for the All-Pairs Shortest-Path problem. We first compare the best commercial compiler optimizations available with standard cache-friendly optimizations and ...
Read More
Reducing memory latency using a small software driven array cache
HICSS '95: Proceedings of the 28th Hawaii International Conference on System Sciences

From the programming viewpoint, data references can be classified into two types: array reference and non-array references. Array references have relatively strong spatial locality while non-array references have relatively strong temporal locality. ...
Read More
Location cache: a low-power L2 cache system
ISLPED '04: Proceedings of the 2004 international symposium on Low power electronics and design

While set-associative caches incur fewer misses than direct-mapped caches, they typically have slower hit times and higher power consumption, when multiple tag and data banks are probed in parallel. This paper presents the location cache structure which ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in

ACM Journal of Experimental Algorithmics Volume 11, Issue
2006
355 pages
ISSN:1084-6654
EISSN:1084-6654
DOI:10.1145/1187436
Issue’s Table of Contents

Copyright © 2007 ACM
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 9 February 2007
Published in jea Volume 11, Issue
Author Tags
Data structures
Floyd--Warshall algorithm
systolic array algorithms
Qualifiers
- article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 731
  Total Downloads
- Downloads (Last 12 months)3
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Cache-Friendly implementations of transitive closure

ACM Journal of Experimental Algorithmics

Abstract

References

Cited By

Index Terms

Recommendations

Cache-Friendly Implementations of Transitive Closure

Reducing memory latency using a small software driven array cache

Location cache: a low-power L2 cache system

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Cache-Friendly implementations of transitive closure

ACM Journal of Experimental Algorithmics

Abstract

References

Cited By

Index Terms

Recommendations

Cache-Friendly Implementations of Transitive Closure

Reducing memory latency using a small software driven array cache

Location cache: a low-power L2 cache system

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media