Skip to main content
Log in

Exploiting Network Locality for CC-NUMA Multiprocessors

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Rapid advances in interconnection networks in multiprocessors are closing the gap between computation and communication. Given this trend, how can we utilize fast interconnects? This study proposes an enhanced CC-NUMA architecture, called Depot-NUMA, which views the congregation of the private caches in all nodes as a large remote access cache. Fast interconnects allow a missing block to be fetched from the private caches of other sharing nodes rather than from the home node. Issues involved in designing Depot-NUMA are also discussed, and a novel routing scheme, called multi-hop, is proposed to communicate between the potential sharers and fetch a missing block from their private caches. The sharers are specified based on a stride function to exploit network locality in the system. The proposed Depot-NUMA design requires only modest modification to the node controller and coherence protocol. Additionally, the interconnect fabric can be constructed using existing and unmodified commodity interconnects. Furthermore, the application-driven study reveals that Depot-Numa can reduce the read stall time by up to 41%percnt; and is competitive compared to a CC-NUMA with a large local cache.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. G. A. Abandah and E. S. Davidson. A comparative study of cache-coherent nonuniform memory access systems. In International Symposium of High Performance Computing Systems and Applications, May 1998.

  2. G. Astfalk and T. Brewer. An overview of the HP/Convex Exemplar hardware. Technical report, Hewlett-Packard Co., 1997. http://www.hp.com/wsg/tech/technical.html.

  3. H. Bao, J. Bielak, O. Ghattas, D. R. O'Hallaron, L. F. Kallivokas, J. R. Shewchuk, and J. Xu. Earthquake ground motion modeling on parallel computers. In Proceedings of the 10th ACM International Conference on Supercomputing, May 1996.

  4. E. E. Bilir, R. M. Dickson, Y. Hu, M. Plakal, D. J. Sorin, M. D. Hill, and D. A. Wood. Multicast snooping: a new coherence method using a multicast address network. In Proceedings of the International symposium on Computer Architecture, 1999.

  5. S. Cameron, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH-2-programs: characterization and methodological considers. In Proceedings of the International Symposium on Computer Architecture, pp. 24-36, 1995.

  6. J. Carbonaro and F. Verhoorn. Cavallino: The Teraflops router and NIC. In Proceedings of the International Symposium on High Performance Interconnects (Hot Interconnects 4), 1996.

  7. J. B. Carter, C. C. Kuo, R. Kuramkote, and M. Swanson. Design alternatives for shared memory multiprocessors. In International Conference on High Performance Computing, 1998.

  8. D. Chaiken, J. Laudon, K. Gharachorloo, A. Gupta, W. Weber, J. Hennessey, M. Horowitz, and M. S. Lam. The Stanford Dash multiprocessor. IEEE Computer, pp. 63-79, March 1992.

  9. C. M. Chiang and L. M. Ni. Multi-address encoding for multicast. In Proceedings of Parallel Computing Routing and Communication Workshop, pp. 146-160, 1994.

  10. D. Culler, J. P. Singh, and A. Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann, San Francisco, 1998.

    Google Scholar 

  11. D. Dai and D. K. Panda. Reducing cache invalidations overheads in wormhole routed DSMS using multidestinations message passing. In Proceedings of the 1996 International Conference on Parallel Processing, 1996.

  12. J. Duato, S. Yalamanchili, and L. M. Ni. Interconnection Networks: An Engineering Approach. Computer Society Press, 1997.

  13. A. Agarwal, et al. The MIT Alewife machine: Architecture and performance. In Proceedings of the International Symposium on Computer Architecture, pp. 2-13, June, 1995.

  14. B. Falsa and D. A. Wood. Reactive NUMA: a design for unifying S-COMA with CC-NUMA. In Proceedings of the International Symposium on Computer Architecture, 1997.

  15. Fujitsu's Synfinity interconnect technology greatly improves server scalability, 1998. http://www.fujitsu.co.jp/hypertext/news/1998/June/25-e.html.

  16. M. Galles. Scalable pipelined interconnect for distributed endpoint routing: The SGI Spider chip. In Proceedings of the International Symposium on High Performance Interconnects (Hot Interconnects 4), 1996.

  17. K. Gharachorloo, A. Gupta, and J. Hennessy. Performance evaluation of memory consistency model for shared-memory multiprocessors. In Proceedings of the Architectural Support for Programming Languages and Operating Systems, pp. 245-257, 1990.

  18. H. C. Hsiao and C. T. King. Supporting invalidation traffic on DSM multiprocessors—smart interface and dumb network? Technical report, National Tsing Hua University, Taiwan, 1998.

    Google Scholar 

  19. H. C. Hsiao and C. T. King. Design and implementation directory hints in DSMS. Technical report, National Tsing Hua University, Taiwan, 1999.

    Google Scholar 

  20. IEEE. IEEE Standard for Scalable Coherent Interface (SCI), 1993.

  21. R. Iyer and L. N. Bhuyan. Switch cache: A framework for improving the remote memory access latency of CC-NUMA multiprocessors. In Proceedings of the International Symposium on High Performance Computer Architectures, 1999.

  22. A. Kagi, D. Burger, and J. R. Goodman. Efficient synchronization: Let them eat QOLB. in Proceedings of the International Symposium on Computer Architecture, May 1997.

  23. S. K. Kaxiras and J. R. Goodman. The glow cache coherence extensions for widely shared data. In Proceedings of the 10th ACM International Conference on Supercomputing, May 1996.

  24. C. C. Kuo, J. B. Carter, R. Kuramkote, and M. Swanson. As-coma: an adaptive hybrid shared memory architecture. In Proceedings of the International Conference on Parallel Processing, 1998.

  25. R. P. Larowe and C. S. Ellis. Experimental comparisons of memory management polices for NUMA multiprocessors. In ACM Transactions on Computer Systems, pp. 319-323, Nov. 1991.

  26. J. Laudon and D. Lenoski. The SGI Origin: A CCNUMA highly scalable server. In Proceedings of the International Symposium on Computer Architecture, pp. 241-251, May 1997.

  27. X. Lin, P. K. Mckinley, and L. M. Ni. Deadlock-free multicast wormhole routing in 2D-mesh multi-computers. In IEEE Transactions on Parallel and Distributed Systems, August 1994.

  28. T. Lovett and R. Clapp. Sting: A CC-NUMA computer system for commercial markplace. In Proceedings of the International Symposium on Computer Architecture, pp. 308-317, 1996.

  29. D. Magdic. Limes: a multiprocessor simulation environment. In IEEE Computer Technical Committee on Computer Architecture Newsletter, pp. 68-71, March 1997.

  30. M. P. Malumbres, J. Duato, and J. Torrellas. An efficient implementation of tree-based multicast routing for distributed shared-memory multiprocessors. In Proceedings of the 8th IEEE International Symposium on Parallel and Distributed Processing, October 1996.

  31. A. Moga and M. Dubois. The effectiveness of SRAM network caches in clustered DSMS. In Proceedings of the International Symposium on High Performance Computer Architecture, 1998.

  32. A. Nowatzyk, M. Browne, E. Kelly, and M. Parkin. S-connect: from networks of workstations to supercomputer performance. In Proceedings of the International Symposium on Computer Architecture, 1995.

  33. D. K. Panda, S. Singal, and P. Prabhakaran. Multidestination message passing mechanism conforming to base wormhole routing scheme. In Proceedings of the Parallel Computing Routing and Communication Workshop, pp. 131-145, 1994.

  34. Rambus Technology Overview, 1999. http://www.rambus.com/docs/techover.pdf.

  35. S. Saulsbury, T. Wilkinson, J. Carter, and A. Landin. An argument for simple COMA. In Proceedings of the International Symposium on High Performance Computer Architecture, pp. 276-285, 1995.

  36. H. D. Schwetman. Using CSIM to model complex systems. In Proceedings of the Winter Simulation Conference, 1988.

  37. M. L. Scott and J. M. Mellor-Crummey. Fast, contention-free combining tree barriers. International Journal of Parallel Programming, 1994.

  38. V. Soundararajan, M. Heinrich, B. Verghese, K. Gharachorloo, A. Gupta, and J. Hennessy. Flexible use of memory for replication/migration in cache-coherent DSM multiprocessors. In Proceedings of the International Symposium on Computer Architecture, pp. 342-355, 1998.

  39. P. Stenstorm, T. Joe, and A. Gupta. Comparative performance evaluation of cache-coherent NUMA and COMA architectures. In Proceedings of the International Symposium on Computer Architecture, pp. 80-91, 1992.

  40. R. Thekkath and S. J. Eaggers. The Presto application suite. Technical report, Department of Computer Science and Engineering, University of Washington, 1994. http://www.cs.washington.edu/ research/project/parsw/benchmarks/presto/www/index.html.

  41. Z. Zhang and J. Torrellas. Reducing remote con ict misses: NUMA with remote cache versus COMA. In Proceeding of High-Performance Computer Architecture, pp. 272-281, 1997.

Download references

Author information

Authors and Affiliations

Authors

Rights and permissions

Reprints and permissions

About this article

Cite this article

Hsiao, HC., King, CT. Exploiting Network Locality for CC-NUMA Multiprocessors. The Journal of Supercomputing 18, 63–87 (2001). https://doi.org/10.1023/A:1008115125409

Download citation

  • Issue Date:

  • DOI: https://doi.org/10.1023/A:1008115125409

Navigation