Abstract
In distributed shared-memory (DSM) multiprocessors, a write operation requires multiple messages to invalidate the nodes which share and cache the memory block to being written. The consequent write stall time impedes the performance of such systems. An effective means of achieving efficient invalidation is to employ multicast messages to reach the sharing nodes. This study evaluates two multicast-based invalidation schemes, dual-path and pruning, by performing application-driven simulation. The experimental settings used herein find that multicasts improve invalidation traffic for four of the six evaluated real applications. The remaining two applications are computationally intensive, and multicast-based invalidation is less effective. However, since multicasts encourage bursty communication, our results indicate that they help relieve network congestion during these periods. Dual-path performs slightly better than pruning, because it is less sensitive to routing delay in the routers. Our results further demonstrate that cache size is an important design parameter for multicast-based invalidation, and is highly effective for DSM multiprocessors with larger caches.
Similar content being viewed by others
References
G. A. Abandah and E. S. Davidson. Origin 2000 design enhancements for communication intensive applications. In International Conference on Parallel Architectures and Compilation Techniques, 1998.
A. Agarwal, R. Simoni, J. Hennessy, and M. Horowitz. An evaluation of directory schemes for cache coherence. In Proceedings of International Symposium on Computer Architecture, 1988.
G. Astfalk and T. Brewer. An overview of the HP/convex exemplar hardware. Technical report, Hewlett-Packard Co., 1997. http://www.hp.com/wsg/tech/technical.html.
T. M. Austin and G. S. Sohi. Zero-cycle loads: microarchitecture support for reducing load latency. In 28th Annual International Symposium on Microarchitecture (MICRO-28), 1995.
H. Bao, J. Bielak, O. Ghattas, D. R. O'Hallaron, L. F. Kallivokas, J. R. Shewchuk, and J. Xu. Earthquake ground motion modeling on parallel computers. In Proceedings of 10th ACM International Conference on Supercomputing, May 1996.
S. Cameron, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The SPLASH–2 programs: characterization and methodological considerations. In Proceedings of International Symposium on Computer Architecture, pp. 24-36, 1995.
J. Carbonaro and F. Verhoorn. Cavallino: The TeraFlops router and Nic. In Proceedings of International Symposium on High Performance Interconnects (Hot Interconnects 4), 1996.
D. Chaiken, J. Laudon, K. Gharachorloo, A. Gupta, W. Weber, J. Hennessey, M. Horowitz, and M. S. Lam. The Stanford Dash multiprocessor. IEEE Computer, pp. 63-79, March 1992.
H. L. Chen and C. T. King. Dynamic processor allocation in scalable multiprocessors using boolean. International Journal of Computer Mathematics, 1997.
D. Culler, J. P. Singh, and A. Gupta. Parallel Computer Architecture: A Hardware/Software Approach. Morgan Kaufmann, 1998.
D. Dai and D. K. Panda. Reducing cache invalidations overheads in wormhole routed DSMs using multidestination message passing. In Proceedings of 1996 International Conference on Parallel Processing, 1996.
D. Dai and D. K. Panda. How much network contention affect distributed shared memory performance? In Proceedings of 1997 International Conference on Parallel Processing, pp. 454-461, August 1997.
J. Duato, S. Yalamanchili, and L. M. Ni. Interconnection Networks: An Engineering Approach. Computer Society Press, 1997.
A. Agarwal et al. The MIT Alewife machine: architecture and performance. In Proceedings of International Symposium on Computer Architecture, pp. 2-13, June 1995.
D. E. Culler et al. LogP: towards a realistic model of parallel computation. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, February 1993.
J. Kuskin et al. The Stanford FLASH multiprocessor. In Proceedings of International Symposium on Computer Architecture, pp. 302-313, May 1994.
M. Heinrich et al. The performance impact of exibility in the Stanford FLASH multiprocessor. In ASPLOS VI, pp. 274-285, 1994.
K. P. Fang and C. T. King. Turn grouping for supporting efficient multicast in wormhole mesh networks. In Proceedings 6th Symposium on the Frontiers of Massively Parallel Computing (Frontiers'96), October 1996.
M. Galles. Scalable pipelined interconnect for distributed endpoint routing: The SGI SPIDER chip. In Proceedings of International Symposium on High Performance Interconnects (Hot Interconnects 4), 1996.
K. Gharachorloo, D. Lenoski, J. Laudon, P. Gibbons, A. Gupta, and J. Hennessy. Memory consistency and event ordering in scalable shared-memory multiprocessors. In Proceedings of International Symposium on Computer Architecture, pp. 15-26, 1990.
M. D. Hill. Multiprocessors should support simple memory consistency models. In IEEE Computer, 1998.
C. Holt, M. Heinrich, J. P. Singh, E. Rothberg, and J. Hennessey. The effects of latency, occupancy and bandwidth in distributed shared memory multiprocessors. Technical Report TR-95-660, Stanford University, 1995.
A. Kagi, D. Burger, and J. R. Goodman. Efficient synchronization: let them eat QOLB. In Proceedings of International Symposium on Computer Architecture, May 1997.
R. P. Larowe and C. S. Ellis. Experimental comparisons of memory management polices for NUMA multiprocessors. ACM Transactions on Computer Systems, pp. 319–323, November 1991.
J. Laudon and D. Lenoski. The SGI origin: A ccNUMA highly scalable server. In Proceedings of International Symposium on Computer Architecture, pp. 241-251, May 1997.
D. E. Lenoski and W. D. Weber. Scalable Shared-Memory Multiprocessing. Morgan Kaufmann, 1995.
X. Lin and L. M. Ni. Multicast communication in multicomputer networks. IEEE Transactions on Parallel and Distributed Systems, 4:1105-1117, October 1993.
T. Lovett and R. Clapp. STiNG: A CC-NUMA computer system for commercial markplace. In Proceedings of International Symposium on Computer Architecture, pp. 308-0317, 1996.
D. Magdic. Limes: a multiprocessor simulation environment. In IEEE Computer Technical Committee on Computer Architecture Newsletter, pp. 68-71, March 1997.
M. P. Malumbres, J. Duato, and J. Torrellas. An efficient implementation of tree-based multicast routing for distributed shared-memory multiprocessors. In Proceedings of the 8th IEEE International Symposium on Parallel and Distributed Processing, October 1996.
A. Nowatzyk, M. Browne, E. Kelly, and M. Parkin. S-connect: from networks of workstations to supercomputer performance. In Proceedings of International Symposium on Computer Architecture, 1995.
D. K. Panda, S. Singal, and P. Prabhakaran. Multidestination message passing mechanism conforming to base wormhole routing scheme. In Proceedings of Parallel Computing Routing and Communication Workshop, pp. 131-145, 1994.
S. K. Reinhardt, R. W. Pfile, and D. A. Wood. Decoupled hardware support for distributed shared memory. In Proceedings of International Symposium on Computer Architecture, pp. 34-43, 1996.
H. D. Schwetman. Using CSIM to model complex systems. In Proceedings of Winter Simulation Conference, 1988.
M. L. Scott and J. M. Mellor-Crummey. Fast, contention-free combining tree barriers. International Journal of Parallel Programming, 1994.
F. Silla, M. P. Malumbres, J. Duato, D. Dai, and D. K. Panda. Impact of adaptivity on the behavior of networks of workstations under bursty traffic. In Proceedings of 1998 International Conference on Parallel Processing, August 1998.
R. Thekkath and S. J. Eaggers. The Presto application suite. Technical report, Department of Computer Science and Engineering, University of Washington, 1994. http://www.cs.washington.edu/ research/project/parsw/benchmarks/presto/www/index.html.
A. S. Vaidyai, A. Sivasubramaniam, and C. R. Das. Performance benefits of virtual channels and adaptive routing: an application-driven study. In Proceedings of 11th ACM International Conference on Supercomputing, pp. 140-147, July 1997.
W. D. Weber, A. Gupta, W. D. Weber, and T. Mowry. Analysis of cache invalidation patterns in multiprocessors. In Proceedings of Architectural Support for Programming Languages and Operating Systems, pp. 243-256, 1989.
J. S. Yang and C. T. King. Hardware supports for efficient barrier synchronization on 2-D mesh networks. IEEE Transactions on Parallel and Distributed Systems, 1998.
Author information
Authors and Affiliations
Rights and permissions
About this article
Cite this article
Hsiao, HC., King, CT. An Application-Driven Study of Multicast Communication for Write Invalidation. The Journal of Supercomputing 18, 279–304 (2001). https://doi.org/10.1023/A:1008161716113
Issue Date:
DOI: https://doi.org/10.1023/A:1008161716113