ABSTRACT
Improvements in parallel computing hardware usually involve increments in the number of available resources for a given application such as the number of computing cores and the amount of memory. In the case of shared-memory computers, the increase in computing resources and available memory is usually constrained by the coherency protocol, whose overhead rises with system size, limiting the scalability of the final system. In this paper we propose an efficient and cost-effective way to increase the memory available for a given application by leveraging free memory in other computers in the cluster.
Our proposal is based on the observation that many applications benefit from having more memory resources but do not require more computing cores, thus reducing the requirements for cache coherency and allowing a simpler implementation and better scalability.
Simulation results show that, when additional mechanisms intended to hide remote memory latency are used, execution time of applications that use our proposal is similar to the time required to execute them in a computer populated with enough local memory, thus validating the feasibility of our proposal. We are currently building a prototype that implements our ideas.
- }}3leaf Systems. http://www.3leafsystems.com.Google Scholar
- }}Dynamic Logical Partitioning. White Paper. http://www.ibm.com/systems/p/hardware/whitepapers/dlpar.html.Google Scholar
- }}Gaussian 03. http://www.gaussian.com.Google Scholar
- }}IBM z Series. http://www.ibm.com/systems/z.Google Scholar
- }}In-Memory Database Systems (IMDSs) Beyond the Terabyte Size Boudary. http://www.mcobject.com/130/EmbeddedDatabaseWhitePapers.htm.Google Scholar
- }}MBA3 NC Series Catalog. http://www.fujitsu.com/global/services/computing/storage/hdd/ehdd/mba3073nc-mba3300nc.html.Google Scholar
- }}NUMAChip. http://www.numachip.com/.Google Scholar
- }}ScaleMP. http://www.scalemp.com.Google Scholar
- }}Violin Memory. http://violin-memory.com.Google Scholar
- }}HyperTransport Technology Consortium. HyperTransport I/O Link Specification Revision 3.10, 2008. available at http://www.hypertransport.org.Google Scholar
- }}A. Acharya and S. Setia. Availability and Utility of Idle Memory in Workstation Clusters. SIGMETRICS Perform. Eval. Rev., 27(1):35--46, 1999. Google ScholarDigital Library
- }}T. Anderson, D. Culler, and D. Patterson. A case for NOW (Networks of Workstations). Micro, IEEE, 15(1):54--64, Feb 1995. Google ScholarDigital Library
- }}C. Bienia, S. Kumar, et al. The parsec benchmark suite: Characterization and architectural implications. In Proceedings of the 17th PACT, October 2008. Google ScholarDigital Library
- }}M. Chapman and G. Heiser. vNUMA: A virtual shared-memory multiprocessor. In Proceedings of the 2009 USENIX Annual Technical Conference, pages 349--362, San Diego, CA, USA, Jun 2009. Google ScholarDigital Library
- }}P. Charles, C. Grothoff, V. Saraswat, et al. X10: an Object-Oriented Approach to Non-Uniform Cluster Computing. SIGPLAN Not., 40(10):519--538, 2005. Google ScholarDigital Library
- }}P. Conway and B. Hughes. The AMD Opteron Northbridge Architecture. IEEE Micro, 27(2):10--21, 2007. Google ScholarDigital Library
- }}P. Conway, N. Kalyanasundharam, G. Donley, et al. Blade Computing with the AMD Opteron Processor (Magny-Cours). Hot chips 21, Aug 2009.Google Scholar
- }}J. Duato, F. Silla, S. Yalamanchili, et al. Extending HyperTransport Protocol for Improved Scalability. First International Workshop on HyperTransport Research and Applications, 2009.Google Scholar
- }}M. J. Feeley, W. E. Morgan, E. P. Pighin, A. R. Karlin, H. M. Levy, and C. A. Thekkath. Implementing global memory management in a workstation cluster. In SOSP '95: Proceedings of the fifteenth ACM symposium on Operating systems principles, pages 201--212, NY, USA, 1995. ACM. Google ScholarDigital Library
- }}H. Garcia-Molina and K. Salem. Main Memory Database Systems: an Overview. Knowledge and Data Engineering, IEEE Transactions on, 4(6):509--516, Dec 1992. Google ScholarDigital Library
- }}J. Gray, D. T. Liu, M. Nieto-Santisteban, et al. Scientific Data Management in the Coming Decade. SIGMOD Rec., 34(4):34--41, 2005. Google ScholarDigital Library
- }}IBM journal of Research and Development staff. Overview of the IBM Blue Gene/P project. IBM J. Res. Dev., 52(1/2):199--220, 2008. Google ScholarDigital Library
- }}C. Keltcher, K. McGrath, A. Ahmed, and P. Conway. The AMD Opteron Processor for Multiprocessor Servers. Micro, IEEE, 23(2):66--76, March-April 2003. Google ScholarDigital Library
- }}S. Kottapalli and J. Baxter. Nehalem-EX CPU Architecture. Hot chips 21, Aug 2009.Google Scholar
- }}S. Liang, R. Noronha, and D. Panda. Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device. In Cluster Computing, 2005. IEEE International, pages 1--10, Sept. 2005.Google ScholarCross Ref
- }}H. Litz, H. Fröening, M. Nuessle, and U. Brüening. A HyperTransport Network Interface Controller for Ultra-low Latency Message Transfers. HyperTransport Consortium White Paper, 2007.Google Scholar
- }}H. Litz, H. Fröening, M. Nuessle, and U. Brüening. VELO: A Novel Communication Engine for Ultra-Low Latency Message Transfers. In Parallel Processing, 2008. ICPP '08. 37th International Conference on, pages 238--245, Sept. 2008. Google ScholarDigital Library
- }}P. Magnusson, M. Christensson, J. Eskilson, et al. Simics: A Full System Simulation Platform. Computer, 35(2):50--58, Feb 2002. Google ScholarDigital Library
- }}M. Martin, D. Sorin, B. Beckmann, et al. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. SIGARCH Comput. Archit. News, 33(4):92--99, 2005. Google ScholarDigital Library
- }}J. D. McCalpin. Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, pages 19--25, Dec. 1995.Google Scholar
- }}M. Oguchi and M. Kitsuregawa. Using Available Remote Memory Dynamically for Parallel Data Mining Application on ATM-connected PC Cluster. In IPDPS 2000. Proceedings. 14th International, pages 411--420, 2000. Google ScholarDigital Library
- }}J. Oleszkiewicz, L. Xiao, and Y. Liu. Parallel Network RAM: Effectively Utilizing Global Cluster Memory for Large Data-Intensive Parallel Programs. In Parallel Processing, 2004. ICPP 2004. International Conference on, pages 353--360 vol. 1, Aug. 2004. Google ScholarDigital Library
- }}M. Ronstrom and L. Thalmann. MySQL Cluster Architecture Overview. Technical White Paper. MySQL, 2004.Google Scholar
- }}D. Slogsnat, A. Giese, M. Nüssle, and U. Brüning. An Open-source HyperTransport Core. ACM Trans. Reconfigurable Technol. Syst., 1(3):1--21, 2008. Google ScholarDigital Library
- }}A. S. Szalay, J. Gray, and J. vandenBerg. Petabyte Scale Data Mining: Dream or Reality? CoRR, cs.DB/0208013, 2002.Google Scholar
- }}J. Tuck, L. Ceze, and J. Torrellas. Scalable Cache Miss Handling for High Memory-Level Parallelism. Microarchitecture, 2006. MICRO-39. 39th Annual IEEE/ACM International Symposium on, Dec 2006. Google ScholarDigital Library
- }}K. Yelick. Computer architecture: Opportunities and challenges for scalable applications. Sandia CSRI Workshop on Next-generation scalable applications: When MPI-only is not enough, Jun 2008.Google Scholar
- }}K. Yelick. Programming models: Opportunities and challenges for scalable applications. Sandia CSRI Workshop on Next-generation scalable applications: When MPI-only is not enough, Jun 2008.Google Scholar
Index Terms
- A practical way to extend shared memory support beyond a motherboard at low cost
Recommendations
Scalable directory architecture for distributed shared memory chip multiprocessors
Traditional Directory-based cache coherence protocol is far from optimal for large-scale cache coherent shared memory multiprocessors due to the increasing latency to access directories stored in DRAM memory. Instead of keeping directories in main ...
WOM-Code Solutions for Low Latency and High Endurance in Phase Change Memory
This paper describes a write-once-memory-code phase change memory (WOM-code PCM) architecture for next-generation non-volatile memory applications. Specifically, we address the long latency of the write operation in PCM—attributed to PCM SET—...
Next high performance and low power flash memory package structure
In general, NAND flash memory has advantages in low power consumption, storage capacity, and fast erase/write performance in contrast to NOR flash. But, main drawback of the NAND flash memory is the slow access time for random read operations. Therefore,...
Comments