Article

Free Access

Empirical study of latency hiding on a fine-grain parallel processor

ICS '93: Proceedings of the 7th international conference on SupercomputingAugust 1993Pages 220–229https://doi.org/10.1145/165939.165972

Published:01 August 1993Publication History

ICS '93: Proceedings of the 7th international conference on Supercomputing

Pages 220–229

ABSTRACT

Latency associated with memory accesses and process communications are one of the most difficult obstacles in constructing a practical massively parallel system. So far, two approaches to hide latencies have been proposed. They are prefetching and multi-threading. An instruction-level data-driven computer is an ideal test-bed for evaluating these latency hiding methods because prefetching and multi-threading are naturally implemented in an instruction-level data-driven computer as unfolding and concurrent execution of multiple contexts. This paper evaluates latency hiding methods on SIGMA-1, a dataflow supercomputer developed in Electrotechnical Laboratory. As a result of evaluation, these methods are effective to hide static latencies but not effective to hide dynamic latencies. Also, concurrent execution of multiple contexts is more effective than prefetching.

References

1.Archbold, J. and Baer, J.-L., " Cache Coherence Protocols: evaluation U~ng a Multiprocessor Simulation Model," ACM Trans. Computer Systems, Vol.4, No. 4, pp. 273-298, 1986. Google ScholarDigital Library
2.Sweazey, P. and Smith, A.J., "A Class of Compatb ble Cache Con~stency Protocols and Their Support by IEEE Futurehu~" Proc. l gth Int. ~ymp. on Computer Architecture, pp. 414-423, 1986. Google ScholarDigital Library
3.Weber, W.-D. and Gupta, A., "Exploring the Benefits of Multiple Hardware Contexts in a Mul~processor Architecture: Prdiminary Results," Proc. 16th Int. Symp. on Computer Architecture, pp. 273-280, 1989. Google ScholarDigital Library
4.Gupta, A., Hennessy, J., Gharachofloo, K., Mowr% T. and Weber, W.-D., "Compara~ve Evaluation of Latency Reducing and Tolerating Techniques," Proc. 18th Int. Symp. on Computer Arch~ecture, pp. 254-263, 1991. Google ScholarDigital Library
5.Boothe, B. and Ranade, A., "Improved Mulfithreading Techniques for Hiding Commun~ation Latency in Multiprocessors," Proc. 19th Int. Symp. Computer A~ chitecture, pp. 214-223, 1992. Google ScholarDigital Library
6.Arvind and Iannucd, R.A., "A Critique of Multiproces~ng yon Neumann Style ," Proc. 10th Int. Symp. Computer Arch~ecture, pp. 426-436, 1983. Google ScholarDigital Library
7.Iannucd, R.A., "Toward a datafiow/von Neumann hybrid architecture," Proc. 15th Int. Symp. Computer Architecture, pp. 131-140, 1988. Google ScholarDigital Library
8.Hiraki, K., Shimada, T. and N~hida, K., "A Hardware Design of the SIGMA-1 - A Data Flow Computer for S~enOfic Computations," Proc. Int. ConL Paralld Processing, IEEE, pp. 851-855, 1984.Google Scholar
9.Sakai, S., Yamaguchi, Y., Hiraki, K., Kodama, Y. and Yuba, T., " An Arch~ecture of a Dataflow Single Chip Processor," Proc. 16th Int. Symp. Computer Arch~ecture, pp. 46-53, 1989. Google ScholarDigital Library
10.Arvind and Thomas, R.E.,"~Structure: An Effective Data Structure for Functional Languages~ MIT, LCS- TM178, Lab. for Computer S~cnce, MIT, 1978.Google Scholar
11.Sekiguchi,S., Shimada,T., and Hir~ki,K., "Sequential Description and Paral~l Execu~on Language DFGII for Dataflow Supercomputers," 1991 Internafion~ Conference on Supercomputng, ACM, Cologne, June, pp. 57-66. Google ScholarDigital Library
12.Gurd, J., Kirkham, C. C. and Watson, I., "The Manchester Prototype e Dataflow Computer," Commun. ACM, Vol. 28, No. 1, 1985. Google ScholarDigital Library
13.Hiraki, K., Sekiguchi, S. and Shimada, T., "Load Scheduling Mechanism Using Inter-PE Network," Trans. of IEGE Japan, (in Japanese), Vol. J69-D, No. 2, pp. 180-189, 1986.Google Scholar
14.Shimada, T., Hiraki, K. and Sekiguchi, S., " Performance evalua~on of the dataflow computer SIGMA-1 ," Proc. JSPP92, pp. 345-352, 1992.Google Scholar
15.D~ly, W., "A Universal Paral~l Computer Architecture ," Proc. FGCS92, pp. 746-757, Tokyo, 1992. Google ScholarDigital Library
16.Shimada, T., Sekiguchi, S. and Hiraki, K., "A dataflow language D FC," Trans. of IECE japan, Vol. J71-D, No.3, 1988.Google Scholar
17.Sake. S., Hiraki, K., Yamaguchi, Y., Kodama, Y. and Yuba, T., " Pipeline Optimization of a Data-Flow Machine," in Advanced Topics in Data-flow Computing, Prentice H~I, 1991.Google Scholar

Index Terms

Recommendations

Fine Grain Cache Partitioning Using Per-Instruction Working Blocks
PACT '15: Proceedings of the 2015 International Conference on Parallel Architecture and Compilation (PACT)

A traditional least-recently used (LRU) cache replacement policy fails to achieve the performance of the optimal replacement policy when cache blocks with diverse reuse characteristics interfere with each other. When multiple applications share a cache, ...
Read More
Empirical study of parallel trace-driven LRU cache simulators
PADS '95: Proceedings of the ninth workshop on Parallel and distributed simulation

This paper reports on the performance of four parallel algorithms for simulating an associative cache operating under the LRU (Least-Recently-Used) replacement policy. Three of the algorithms are implemented on the MasPar MP-2. Another algorithm is a ...
Read More
Improving support for locality and fine-grain sharing in chip multiprocessors
PACT '08: Proceedings of the 17th international conference on Parallel architectures and compilation techniques

Both commercial and scientific workloads benefit from concurrency and exhibit data sharing across threads/processes. The resulting sharing patterns are often fine-grain, with the modified cache lines still residing in the writer's primary cache when ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICS '93: Proceedings of the 7th international conference on Supercomputing
August 1993
425 pages
ISBN:089791600X
DOI:10.1145/165939
Chairman:
Yoichi Muraoka
Copyright © 1993 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 1 August 1993
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate584of2,055submissions,28%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 234
  Total Downloads
- Downloads (Last 12 months)15
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Empirical study of latency hiding on a fine-grain parallel processor

ICS '93: Proceedings of the 7th international conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Fine Grain Cache Partitioning Using Per-Instruction Working Blocks

Empirical study of parallel trace-driven LRU cache simulators

Improving support for locality and fine-grain sharing in chip multiprocessors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Empirical study of latency hiding on a fine-grain parallel processor

ICS '93: Proceedings of the 7th international conference on Supercomputing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Fine Grain Cache Partitioning Using Per-Instruction Working Blocks

Empirical study of parallel trace-driven LRU cache simulators

Improving support for locality and fine-grain sharing in chip multiprocessors

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media