ABSTRACT
Latency associated with memory accesses and process communications are one of the most difficult obstacles in constructing a practical massively parallel system. So far, two approaches to hide latencies have been proposed. They are prefetching and multi-threading. An instruction-level data-driven computer is an ideal test-bed for evaluating these latency hiding methods because prefetching and multi-threading are naturally implemented in an instruction-level data-driven computer as unfolding and concurrent execution of multiple contexts. This paper evaluates latency hiding methods on SIGMA-1, a dataflow supercomputer developed in Electrotechnical Laboratory. As a result of evaluation, these methods are effective to hide static latencies but not effective to hide dynamic latencies. Also, concurrent execution of multiple contexts is more effective than prefetching.
- 1.Archbold, J. and Baer, J.-L., " Cache Coherence Protocols: evaluation U~ng a Multiprocessor Simulation Model," ACM Trans. Computer Systems, Vol.4, No. 4, pp. 273-298, 1986. Google ScholarDigital Library
- 2.Sweazey, P. and Smith, A.J., "A Class of Compatb ble Cache Con~stency Protocols and Their Support by IEEE Futurehu~" Proc. l gth Int. ~ymp. on Computer Architecture, pp. 414-423, 1986. Google ScholarDigital Library
- 3.Weber, W.-D. and Gupta, A., "Exploring the Benefits of Multiple Hardware Contexts in a Mul~processor Architecture: Prdiminary Results," Proc. 16th Int. Symp. on Computer Architecture, pp. 273-280, 1989. Google ScholarDigital Library
- 4.Gupta, A., Hennessy, J., Gharachofloo, K., Mowr% T. and Weber, W.-D., "Compara~ve Evaluation of Latency Reducing and Tolerating Techniques," Proc. 18th Int. Symp. on Computer Arch~ecture, pp. 254-263, 1991. Google ScholarDigital Library
- 5.Boothe, B. and Ranade, A., "Improved Mulfithreading Techniques for Hiding Commun~ation Latency in Multiprocessors," Proc. 19th Int. Symp. Computer A~ chitecture, pp. 214-223, 1992. Google ScholarDigital Library
- 6.Arvind and Iannucd, R.A., "A Critique of Multiproces~ng yon Neumann Style ," Proc. 10th Int. Symp. Computer Arch~ecture, pp. 426-436, 1983. Google ScholarDigital Library
- 7.Iannucd, R.A., "Toward a datafiow/von Neumann hybrid architecture," Proc. 15th Int. Symp. Computer Architecture, pp. 131-140, 1988. Google ScholarDigital Library
- 8.Hiraki, K., Shimada, T. and N~hida, K., "A Hardware Design of the SIGMA-1 - A Data Flow Computer for S~enOfic Computations," Proc. Int. ConL Paralld Processing, IEEE, pp. 851-855, 1984.Google Scholar
- 9.Sakai, S., Yamaguchi, Y., Hiraki, K., Kodama, Y. and Yuba, T., " An Arch~ecture of a Dataflow Single Chip Processor," Proc. 16th Int. Symp. Computer Arch~ecture, pp. 46-53, 1989. Google ScholarDigital Library
- 10.Arvind and Thomas, R.E.,"~Structure: An Effective Data Structure for Functional Languages~ MIT, LCS- TM178, Lab. for Computer S~cnce, MIT, 1978.Google Scholar
- 11.Sekiguchi,S., Shimada,T., and Hir~ki,K., "Sequential Description and Paral~l Execu~on Language DFGII for Dataflow Supercomputers," 1991 Internafion~ Conference on Supercomputng, ACM, Cologne, June, pp. 57-66. Google ScholarDigital Library
- 12.Gurd, J., Kirkham, C. C. and Watson, I., "The Manchester Prototype e Dataflow Computer," Commun. ACM, Vol. 28, No. 1, 1985. Google ScholarDigital Library
- 13.Hiraki, K., Sekiguchi, S. and Shimada, T., "Load Scheduling Mechanism Using Inter-PE Network," Trans. of IEGE Japan, (in Japanese), Vol. J69-D, No. 2, pp. 180-189, 1986.Google Scholar
- 14.Shimada, T., Hiraki, K. and Sekiguchi, S., " Performance evalua~on of the dataflow computer SIGMA-1 ," Proc. JSPP92, pp. 345-352, 1992.Google Scholar
- 15.D~ly, W., "A Universal Paral~l Computer Architecture ," Proc. FGCS92, pp. 746-757, Tokyo, 1992. Google ScholarDigital Library
- 16.Shimada, T., Sekiguchi, S. and Hiraki, K., "A dataflow language D FC," Trans. of IECE japan, Vol. J71-D, No.3, 1988.Google Scholar
- 17.Sake. S., Hiraki, K., Yamaguchi, Y., Kodama, Y. and Yuba, T., " Pipeline Optimization of a Data-Flow Machine," in Advanced Topics in Data-flow Computing, Prentice H~I, 1991.Google Scholar
Index Terms
- Empirical study of latency hiding on a fine-grain parallel processor
Recommendations
Fine Grain Cache Partitioning Using Per-Instruction Working Blocks
PACT '15: Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT)A traditional least-recently used (LRU) cache replacement policy fails to achieve the performance of the optimal replacement policy when cache blocks with diverse reuse characteristics interfere with each other. When multiple applications share a cache, ...
Empirical study of parallel trace-driven LRU cache simulators
PADS '95: Proceedings of the ninth workshop on Parallel and distributed simulationThis paper reports on the performance of four parallel algorithms for simulating an associative cache operating under the LRU (Least-Recently-Used) replacement policy. Three of the algorithms are implemented on the MasPar MP-2. Another algorithm is a ...
Improving support for locality and fine-grain sharing in chip multiprocessors
PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniquesBoth commercial and scientific workloads benefit from concurrency and exhibit data sharing across threads/processes. The resulting sharing patterns are often fine-grain, with the modified cache lines still residing in the writer's primary cache when ...
Comments