ABSTRACT
While numerous hardware synchronization mechanisms have been proposed, they either no longer function or suffer great performance loss when their hardware resources are exceeded, or they add significant complexity and cost to handle such resource overflows. Additionally, prior hardware synchronization proposals focus on one type (barrier or lock) of synchronization, so several mechanisms are likely to be needed to support real applications, many of which use locks, barriers, and/or condition variables.
This paper proposes MiSAR, a minimalistic synchronization accelerator (MSA) that supports all three commonly used types of synchronization (locks, barriers, and condition variables), and a novel overflow management unit (OMU) that dynamically manages its (very) limited hardware synchronization resources. The OMU allows safe and efficient dynamic transitions between using hardware (MSA) and software synchronization implementations. This allows the MSA's resources to be used only for currently-active synchronization operations, providing significant performance benefits even when the number of synchronization variables used in the program is much larger than the MSA's resources. Because it allows a safe transition between hardware and software synchronization, the OMU also facilitates thread suspend/resume, migration, and other thread-management activities. Finally, the MSA/OMU combination decouples the instruction set support (how the program invokes hardware-supported synchronization) from the actual implementation of the accelerator, allowing different accelerators (or even wholesale removal of the accelerator) in the future without changes to OMU-compatible application or system code. We show that, even with only 2 MSA entries in each tile, the MSA/OMU combination on average performs within 3% of ideal (zero-latency) synchronization, and achieves a speedup of 1.43X over the software (pthreads) implementation.
- J. Abellán, J. Fernández, and M. Acacio, "A g-line-based network for fast and efficient barrier synchronization in many-core cmps," in Parallel Processing (ICPP), 2010 39th International Conference on, Sept 2010, pp. 267--276. Google ScholarDigital Library
- J. Abellán, J. Fernández, and M. Acacio, "Glocks: Efficient support for highly-contended locks in many-core cmps," in Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, may 2011, pp. 893--905. Google ScholarDigital Library
- A. Agarwal, R. Bianchini, D. Chaiken, K. L. Johnson, D. Kranz, J. Kubiatowicz, B.-H. Lim, K. Mackenzie, and D. Yeung, "The mit alewife machine: architecture and performance," in Proceedings of the 22nd annual international symposium on Computer architecture, ser. ISCA '95. New York, NY, USA: ACM, 1995, pp. 2--13. Available: http://doi.acm.org/10.1145/223982.223985 Google ScholarDigital Library
- B. S. Akgul, J. Lee, and V. J. Mooney, "A system-on-a-chip lock cache with task preemption support," in Proceedings of the 2001 international conference on Compilers, architecture, and synthesis for embedded systems, ser. CASES '01. New York, NY, USA: ACM, 2001, pp. 149--157. Available: http://doi.acm.org/10.1145/502217.502242 Google ScholarDigital Library
- G. Almási, C. Archer, J. G. Castaños, J. A. Gunnels, C. C. Erway, P. Heidelberger, X. Martorell, J. E. Moreira, K. Pinnow, J. Ratterman, B. D. Steinmacher-Burow, W. Gropp, and B. Toonen, "Design and implementation of message-passing services for the blue gene/l supercomputer," IBM Journal of Research and Development, vol. 49, no. 2.3, pp. 393--406, march 2005. Google ScholarDigital Library
- R. Alverson, D. Callahan, D. Cummings, B. Koblenz, A. Porterfield, and B. Smith, "The tera computer system," in Proceedings of the 4th international conference on Supercomputing, ser. ICS '90. New York, NY, USA: ACM, 1990, pp. 1--6. Available: http://doi.acm.org/10.1145/77726.255132 Google ScholarDigital Library
- C. J. Beckmann and C. D. Polychronopoulos, "Fast barrier synchronization hardware," in Proceedings of the 1990 ACM/IEEE conference on Supercomputing, ser. Supercomputing '90. Los Alamitos, CA, USA: IEEE Computer Society Press, 1990, pp. 180--189. Available: http://dl.acm.org/citation.cfm?id=110382.110433 Google ScholarDigital Library
- C. Bienia, "Benchmarking modern multiprocessors," Ph.D. dissertation, Princeton University, January 2011. Google ScholarDigital Library
- M.-C. Chiang, "Memory system design for bus-based multiprocessors," Ph.D. dissertation, Madison, WI, USA, 1992, uMI Order No. GAX92-09300. Google ScholarDigital Library
- W. Dally and B. Towles, Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., 2003. Google ScholarDigital Library
- A. Gottlieb, R. Grishman, C. P. Kruskal, K. P. McAuliffe, L. Rudolph, and M. Snir, "The nyu ultracomputer---designing a mimd, shared-memory parallel machine (extended abstract)," in Proceedings of the 9th annual symposium on Computer Architecture, ser. ISCA '82. Los Alamitos, CA, USA: IEEE Computer Society Press, 1982, pp. 27--42. Available: http://dl.acm.org/citation.cfm?id=800048.801711 Google ScholarDigital Library
- A. Kägi, D. Burger, and J. R. Goodman, "Efficient synchronization: let them eat qolb," in Proceedings of the 24th annual international symposium on Computer architecture, ser. ISCA '97. New York, NY, USA: ACM, 1997, pp. 170--180. Available: http://doi.acm.org/10.1145/264107.264166 Google ScholarDigital Library
- S. Keckler, W. Dally, D. Maskit, N. Carter, A. Chang, and W. Lee, "Exploiting fine-grain thread level parallelism on the mit multi-alu processor," in Computer Architecture, 1998. Proceedings. The 25th Annual International Symposium on, jun-1 jul 1998, pp. 306--317. Google ScholarDigital Library
- J. Laudon and D. Lenoski, "The sgi origin: a ccnuma highly scalable server," in Proceedings of the 24th annual international symposium on Computer architecture, ser. ISCA '97. New York, NY, USA: ACM, 1997, pp. 241--251. Available: http://doi.acm.org/10.1145/264107.264206 Google ScholarDigital Library
- C. E. Leiserson, Z. S. Abuhamdeh, D. C. Douglas, C. R. Feynman, M. N. Ganmukhi, J. V. Hill, D. Hillis, B. C. Kuszmaul, M. A. St. Pierre, D. S. Wells, M. C. Wong, S.-W. Yang, and R. Zak, "The network architecture of the connection machine cm-5 (extended abstract)," in Proceedings of the fourth annual ACM symposium on Parallel algorithms and architectures, ser. SPAA '92. New York, NY, USA: ACM, 1992, pp. 272--285. Available: http://doi.acm.org/10.1145/140901.141883 Google ScholarDigital Library
- J. M. Mellor-Crummey and M. L. Scott, "Algorithms for scalable synchronization on shared-memory multiprocessors," ACM Trans. Comput. Syst., vol. 9, no. 1, pp. 21--65, Feb. 1991. Available: http://doi.acm.org/10.1145/103727.103729 Google ScholarDigital Library
- J. Oh, M. Prvulovic, and A. Zajic, "Tlsync: Support for multiple fast barriers using on-chip transmission lines," in Computer Architecture (ISCA), 2011 38th Annual International Symposium on, june 2011, pp. 105--115. Google ScholarDigital Library
- F. Petrini, J. Fernandez, E. Frachtenberg, and S. Coll, "Scalable collective communication on the asci q machine," in High Performance Interconnects, 2003. Proceedings. 11th Symposium on, aug. 2003, pp. 54--59.Google Scholar
- J. Renau, B. Fraguela, J. Tuck, W. Liu, M. Prvulovic, L. Ceze, S. Sarangi, P. Sack, K. Strauss, and P. Montesinos, "Sesc simulator, january 2005."Google Scholar
- J. T. Robinson, "A fast general-purpose hardware synchronization mechanism," in Proceedings of the 1985 ACM SIGMOD international conference on Management of data, ser. SIGMOD '85. New York, NY, USA: ACM, 1985, pp. 122--130. Available: http://doi.acm.org/10.1145/318898.318910 Google ScholarDigital Library
- J. Sampson, R. González, J.-F. Collard, N. P. Jouppi, M. Schlansker, and B. Calder, "Exploiting fine-grained data parallelism with chip multiprocessors and fast barriers," in Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO 39. Washington, DC, USA: IEEE Computer Society, 2006, pp. 235--246. Available: http://dx.doi.org.www.library.gatech.edu:2048/10.1109/MICRO.2006.23 Google ScholarDigital Library
- S. L. Scott, "Synchronization and communication in the t3e multiprocessor," in Proceedings of the seventh international conference on Architectural support for programming languages and operating systems, ser. ASPLOS-VII. New York, NY, USA: ACM, 1996, pp. 26--36. Available: http://doi.acm.org/10.1145/237090.237144 Google ScholarDigital Library
- E. Vallejo, R. Beivide, A. Cristal, T. Harris, F. Vallejo, O. Unsal, and M. Valero, "Architectural support for fair reader-writer locking," in Proceedings of the 2010 43rd Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO '43. Washington, DC, USA: IEEE Computer Society, 2010, pp. 275--286. Available: http://dx.doi.org/10.1109/MICRO.2010.12 Google ScholarDigital Library
- S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta, "The splash-2 programs: characterization and methodological considerations," in Proceedings of the 22nd annual international symposium on Computer architecture, ser. ISCA '95. New York, NY, USA: ACM, 1995, pp. 24--36. Available: http://doi.acm.org/10.1145/223982.223990 Google ScholarDigital Library
- L. Zhang, Z. Fang, and J. Carter, "Highly efficient synchronization based on active memory operations," in Parallel and Distributed Processing Symposium, 2004. Proceedings. 18th International, april 2004, p. 58.Google Scholar
- W. Zhu, V. C. Sreedhar, Z. Hu, and G. R. Gao, "Synchronization state buffer: supporting efficient fine-grain synchronization on many-core architectures," in Proceedings of the 34th annual international symposium on Computer architecture, ser. ISCA '07. New York, NY, USA: ACM, 2007, pp. 35--45. Available: http://doi.acm.org/10.1145/1250662.1250668 Google ScholarDigital Library
Index Terms
- MiSAR: minimalistic synchronization accelerator with resource overflow management
Recommendations
MiSAR: minimalistic synchronization accelerator with resource overflow management
ISCA'15While numerous hardware synchronization mechanisms have been proposed, they either no longer function or suffer great performance loss when their hardware resources are exceeded, or they add significant complexity and cost to handle such resource ...
Synthesizable Standard Cell FPGA Fabrics Targetable by the Verilog-to-Routing CAD Flow
Special Section on Field Programmable Logic and Applications 2015 and Regular PapersIn this article, we consider implementing field-programmable gate arrays (FPGAs) using a standard cell design methodology and present a framework for the automated generation of synthesizable FPGA fabrics. The open-source Verilog-to-Routing (VTR) FPGA ...
Comments