research-article

MiSAR: minimalistic synchronization accelerator with resource overflow management

Authors:
Ching-Kai Liang

Georgia Institute of Technology

Georgia Institute of Technology
View Profile

,
Milos Prvulovic

Georgia Institute of Technology

Georgia Institute of Technology
View Profile

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer ArchitectureJune 2015Pages 414–426https://doi.org/10.1145/2749469.2750396

Published:13 June 2015Publication History

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

Pages 414–426

ABSTRACT

While numerous hardware synchronization mechanisms have been proposed, they either no longer function or suffer great performance loss when their hardware resources are exceeded, or they add significant complexity and cost to handle such resource overflows. Additionally, prior hardware synchronization proposals focus on one type (barrier or lock) of synchronization, so several mechanisms are likely to be needed to support real applications, many of which use locks, barriers, and/or condition variables.

This paper proposes MiSAR, a minimalistic synchronization accelerator (MSA) that supports all three commonly used types of synchronization (locks, barriers, and condition variables), and a novel overflow management unit (OMU) that dynamically manages its (very) limited hardware synchronization resources. The OMU allows safe and efficient dynamic transitions between using hardware (MSA) and software synchronization implementations. This allows the MSA's resources to be used only for currently-active synchronization operations, providing significant performance benefits even when the number of synchronization variables used in the program is much larger than the MSA's resources. Because it allows a safe transition between hardware and software synchronization, the OMU also facilitates thread suspend/resume, migration, and other thread-management activities. Finally, the MSA/OMU combination decouples the instruction set support (how the program invokes hardware-supported synchronization) from the actual implementation of the accelerator, allowing different accelerators (or even wholesale removal of the accelerator) in the future without changes to OMU-compatible application or system code. We show that, even with only 2 MSA entries in each tile, the MSA/OMU combination on average performs within 3% of ideal (zero-latency) synchronization, and achieves a speedup of 1.43X over the software (pthreads) implementation.

References

J. Abellán, J. Fernández, and M. Acacio, "A g-line-based network for fast and efficient barrier synchronization in many-core cmps," in Parallel Processing (ICPP), 2010 39th International Conference on, Sept 2010, pp. 267--276. Google ScholarDigital Library
J. Abellán, J. Fernández, and M. Acacio, "Glocks: Efficient support for highly-contended locks in many-core cmps," in Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, may 2011, pp. 893--905. Google ScholarDigital Library
A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson, D. Kranz, J. Kubiatowicz, B.-H. Lim, K. Mackenzie, and D. Yeung, "The mit alewife machine: architecture and performance," in Proceedings of the 22nd annual international symposium on Computer architecture, ser. ISCA '95. New York, NY, USA: ACM, 1995, pp. 2--13. Available: http://doi.acm.org/10.1145/223982.223985 Google ScholarDigital Library
B. S. Akgul, J. Lee, and V. J. Mooney, "A system-on-a-chip lock cache with task preemption support," in Proceedings of the 2001 international conference on Compilers, architecture, and synthesis for embedded systems, ser. CASES '01. New York, NY, USA: ACM, 2001, pp. 149--157. Available: http://doi.acm.org/10.1145/502217.502242 Google ScholarDigital Library
G. Almási, C. Archer, J. G. Castaños, J. A. Gunnels, C. C. Erway, P. Heidelberger, X. Martorell, J. E. Moreira, K. Pinnow, J. Ratterman, B. D. Steinmacher-Burow, W. Gropp, and B. Toonen, "Design and implementation of message-passing services for the blue gene/l supercomputer," IBM Journal of Research and Development, vol. 49, no. 2.3, pp. 393--406, march 2005. Google ScholarDigital Library
R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith, "The tera computer system," in Proceedings of the 4th international conference on Supercomputing, ser. ICS '90. New York, NY, USA: ACM, 1990, pp. 1--6. Available: http://doi.acm.org/10.1145/77726.255132 Google ScholarDigital Library
C. J. Beckmann and C. D. Polychronopoulos, "Fast barrier synchronization hardware," in Proceedings of the 1990 ACM/IEEE conference on Supercomputing, ser. Supercomputing '90. Los Alamitos, CA, USA: IEEE Computer Society Press, 1990, pp. 180--189. Available: http://dl.acm.org/citation.cfm?id=110382.110433 Google ScholarDigital Library
C. Bienia, "Benchmarking modern multiprocessors," Ph.D. dissertation, Princeton University, January 2011. Google ScholarDigital Library
M.-C. Chiang, "Memory system design for bus-based multiprocessors," Ph.D. dissertation, Madison, WI, USA, 1992, uMI Order No. GAX92-09300. Google ScholarDigital Library
W. Dally and B. Towles, Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., 2003. Google ScholarDigital Library
A. Gottlieb, R. Grishman, C. P. Kruskal, K. P. McAuliffe, L. Rudolph, and M. Snir, "The nyu ultracomputer---designing a mimd, shared-memory parallel machine (extended abstract)," in Proceedings of the 9th annual symposium on Computer Architecture, ser. ISCA '82. Los Alamitos, CA, USA: IEEE Computer Society Press, 1982, pp. 27--42. Available: http://dl.acm.org/citation.cfm?id=800048.801711 Google ScholarDigital Library
A. Kägi, D. Burger, and J. R. Goodman, "Efficient synchronization: let them eat qolb," in Proceedings of the 24th annual international symposium on Computer architecture, ser. ISCA '97. New York, NY, USA: ACM, 1997, pp. 170--180. Available: http://doi.acm.org/10.1145/264107.264166 Google ScholarDigital Library
S. Keckler, W. Dally, D. Maskit, N. Carter, A. Chang, and W. Lee, "Exploiting fine-grain thread level parallelism on the mit multi-alu processor," in Computer Architecture, 1998. Proceedings. The 25th Annual International Symposium on, jun-1 jul 1998, pp. 306--317. Google ScholarDigital Library
J. Laudon and D. Lenoski, "The sgi origin: a ccnuma highly scalable server," in Proceedings of the 24th annual international symposium on Computer architecture, ser. ISCA '97. New York, NY, USA: ACM, 1997, pp. 241--251. Available: http://doi.acm.org/10.1145/264107.264206 Google ScholarDigital Library
C. E. Leiserson, Z. S. Abuhamdeh, D. C. Douglas, C. R. Feynman, M. N. Ganmukhi, J. V. Hill, D. Hillis, B. C. Kuszmaul, M. A. St. Pierre, D. S. Wells, M. C. Wong, S.-W. Yang, and R. Zak, "The network architecture of the connection machine cm-5 (extended abstract)," in Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures, ser. SPAA '92. New York, NY, USA: ACM, 1992, pp. 272--285. Available: http://doi.acm.org/10.1145/140901.141883 Google ScholarDigital Library
J. M. Mellor-Crummey and M. L. Scott, "Algorithms for scalable synchronization on shared-memory multiprocessors," ACM Trans. Comput. Syst., vol. 9, no. 1, pp. 21--65, Feb. 1991. Available: http://doi.acm.org/10.1145/103727.103729 Google ScholarDigital Library
J. Oh, M. Prvulovic, and A. Zajic, "Tlsync: Support for multiple fast barriers using on-chip transmission lines," in Computer Architecture (ISCA), 2011 38th Annual International Symposium on, june 2011, pp. 105--115. Google ScholarDigital Library
F. Petrini, J. Fernandez, E. Frachtenberg, and S. Coll, "Scalable collective communication on the asci q machine," in High Performance Interconnects, 2003. Proceedings. 11th Symposium on, aug. 2003, pp. 54--59.Google Scholar
J. Renau, B. Fraguela, J. Tuck, W. Liu, M. Prvulovic, L. Ceze, S. Sarangi, P. Sack, K. Strauss, and P. Montesinos, "Sesc simulator, january 2005."Google Scholar
J. T. Robinson, "A fast general-purpose hardware synchronization mechanism," in Proceedings of the 1985 ACM SIGMOD international conference on Management of data, ser. SIGMOD '85. New York, NY, USA: ACM, 1985, pp. 122--130. Available: http://doi.acm.org/10.1145/318898.318910 Google ScholarDigital Library
J. Sampson, R. González, J.-F. Collard, N. P. Jouppi, M. Schlansker, and B. Calder, "Exploiting fine-grained data parallelism with chip multiprocessors and fast barriers," in Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO 39. Washington, DC, USA: IEEE Computer Society, 2006, pp. 235--246. Available: http://dx.doi.org.www.library.gatech.edu:2048/10.1109/MICRO.2006.23 Google ScholarDigital Library
S. L. Scott, "Synchronization and communication in the t3e multiprocessor," in Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, ser. ASPLOS-VII. New York, NY, USA: ACM, 1996, pp. 26--36. Available: http://doi.acm.org/10.1145/237090.237144 Google ScholarDigital Library
E. Vallejo, R. Beivide, A. Cristal, T. Harris, F. Vallejo, O. Unsal, and M. Valero, "Architectural support for fair reader-writer locking," in Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO '43. Washington, DC, USA: IEEE Computer Society, 2010, pp. 275--286. Available: http://dx.doi.org/10.1109/MICRO.2010.12 Google ScholarDigital Library
S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, "The splash-2 programs: characterization and methodological considerations," in Proceedings of the 22nd annual international symposium on Computer architecture, ser. ISCA '95. New York, NY, USA: ACM, 1995, pp. 24--36. Available: http://doi.acm.org/10.1145/223982.223990 Google ScholarDigital Library
L. Zhang, Z. Fang, and J. Carter, "Highly efficient synchronization based on active memory operations," in Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International, april 2004, p. 58.Google Scholar
W. Zhu, V. C. Sreedhar, Z. Hu, and G. R. Gao, "Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures," in Proceedings of the 34th annual international symposium on Computer architecture, ser. ISCA '07. New York, NY, USA: ACM, 2007, pp. 35--45. Available: http://doi.acm.org/10.1145/1250662.1250668 Google ScholarDigital Library

Index Terms

MiSAR: minimalistic synchronization accelerator with resource overflow management
1. Hardware

Recommendations

MiSAR: minimalistic synchronization accelerator with resource overflow management
ISCA'15

While numerous hardware synchronization mechanisms have been proposed, they either no longer function or suffer great performance loss when their hardware resources are exceeded, or they add significant complexity and cost to handle such resource ...
Read More
WFR-TM

Transactional Memory (TM) is a promising concurrent programming paradigm which employs transactions to achieve synchronization in accessing common data known as transactional variables. A transaction may either commit, making its updates to ...
Read More
Synthesizable Standard Cell FPGA Fabrics Targetable by the Verilog-to-Routing CAD Flow
Special Section on Field Programmable Logic and Applications 2015 and Regular Papers

In this article, we consider implementing field-programmable gate arrays (FPGAs) using a standard cell design methodology and present a framework for the automated generation of synthesizable FPGA fabrics. The open-source Verilog-to-Routing (VTR) FPGA ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture
June 2015
768 pages
ISBN:9781450334020
DOI:10.1145/2749469
General Chair:
Debbie Marr
Intel
,
Program Chair:
David Albonesi
Cornell
ACM SIGARCH Computer Architecture News Volume 43, Issue 3S
ISCA'15
June 2015
745 pages
ISSN:0163-5964
DOI:10.1145/2872887
Editor:
Doug DeGroot
acm dot org
Issue’s Table of Contents
Copyright © 2015 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 June 2015
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate543of3,203submissions,17%
Upcoming Conference
ISCA '24

Sponsor:

sigarch

ISCA '24: The 51st Annual International Symposium on Computer Architecture

June 29 - July 3, 2024

Buenos Aires , Argentina
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 11
  Total Citations
  View Citations
- 347
  Total Downloads
- Downloads (Last 12 months)12
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

MiSAR: minimalistic synchronization accelerator with resource overflow management

ISCA '15: Proceedings of the 42nd Annual International Symposium on Computer Architecture

ABSTRACT

References

Cited By

Index Terms

Recommendations

MiSAR: minimalistic synchronization accelerator with resource overflow management

WFR-TM

Synthesizable Standard Cell FPGA Fabrics Targetable by the Verilog-to-Routing CAD Flow