research-article

A practical way to extend shared memory support beyond a motherboard at low cost

Authors:
Héctor Montaner

Universitat Politècnica de València, València, Spain

Universitat Politècnica de València, València, Spain
View Profile

,
Federico Silla

Universitat Politècnica de València, València, Spain

Universitat Politècnica de València, València, Spain
View Profile

,
José Duato

Universitat Politècnica de València, València, Spain

Universitat Politècnica de València, València, Spain
View Profile

HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed ComputingJune 2010Pages 155–166https://doi.org/10.1145/1851476.1851495

Published:21 June 2010Publication History

HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing

Pages 155–166

ABSTRACT

Improvements in parallel computing hardware usually involve increments in the number of available resources for a given application such as the number of computing cores and the amount of memory. In the case of shared-memory computers, the increase in computing resources and available memory is usually constrained by the coherency protocol, whose overhead rises with system size, limiting the scalability of the final system. In this paper we propose an efficient and cost-effective way to increase the memory available for a given application by leveraging free memory in other computers in the cluster.

Our proposal is based on the observation that many applications benefit from having more memory resources but do not require more computing cores, thus reducing the requirements for cache coherency and allowing a simpler implementation and better scalability.

Simulation results show that, when additional mechanisms intended to hide remote memory latency are used, execution time of applications that use our proposal is similar to the time required to execute them in a computer populated with enough local memory, thus validating the feasibility of our proposal. We are currently building a prototype that implements our ideas.

References

}}3leaf Systems. http://www.3leafsystems.com.Google Scholar
}}Dynamic Logical Partitioning. White Paper. http://www.ibm.com/systems/p/hardware/whitepapers/dlpar.html.Google Scholar
}}Gaussian 03. http://www.gaussian.com.Google Scholar
}}IBM z Series. http://www.ibm.com/systems/z.Google Scholar
}}In-Memory Database Systems (IMDSs) Beyond the Terabyte Size Boudary. http://www.mcobject.com/130/EmbeddedDatabaseWhitePapers.htm.Google Scholar
}}MBA3 NC Series Catalog. http://www.fujitsu.com/global/services/computing/storage/hdd/ehdd/mba3073nc-mba3300nc.html.Google Scholar
}}NUMAChip. http://www.numachip.com/.Google Scholar
}}ScaleMP. http://www.scalemp.com.Google Scholar
}}Violin Memory. http://violin-memory.com.Google Scholar
}}HyperTransport Technology Consortium. HyperTransport I/O Link Specification Revision 3.10, 2008. available at http://www.hypertransport.org.Google Scholar
}}A. Acharya and S. Setia. Availability and Utility of Idle Memory in Workstation Clusters. SIGMETRICS Perform. Eval. Rev., 27(1):35--46, 1999. Google ScholarDigital Library
}}T. Anderson, D. Culler, and D. Patterson. A case for NOW (Networks of Workstations). Micro, IEEE, 15(1):54--64, Feb 1995. Google ScholarDigital Library
}}C. Bienia, S. Kumar, et al. The parsec benchmark suite: Characterization and architectural implications. In Proceedings of the 17th PACT, October 2008. Google ScholarDigital Library
}}M. Chapman and G. Heiser. vNUMA: A virtual shared-memory multiprocessor. In Proceedings of the 2009 USENIX Annual Technical Conference, pages 349--362, San Diego, CA, USA, Jun 2009. Google ScholarDigital Library
}}P. Charles, C. Grothoff, V. Saraswat, et al. X10: an Object-Oriented Approach to Non-Uniform Cluster Computing. SIGPLAN Not., 40(10):519--538, 2005. Google ScholarDigital Library
}}P. Conway and B. Hughes. The AMD Opteron Northbridge Architecture. IEEE Micro, 27(2):10--21, 2007. Google ScholarDigital Library
}}P. Conway, N. Kalyanasundharam, G. Donley, et al. Blade Computing with the AMD Opteron Processor (Magny-Cours). Hot chips 21, Aug 2009.Google Scholar
}}J. Duato, F. Silla, S. Yalamanchili, et al. Extending HyperTransport Protocol for Improved Scalability. First International Workshop on HyperTransport Research and Applications, 2009.Google Scholar
}}M. J. Feeley, W. E. Morgan, E. P. Pighin, A. R. Karlin, H. M. Levy, and C. A. Thekkath. Implementing global memory management in a workstation cluster. In SOSP '95: Proceedings of the fifteenth ACM symposium on Operating systems principles, pages 201--212, NY, USA, 1995. ACM. Google ScholarDigital Library
}}H. Garcia-Molina and K. Salem. Main Memory Database Systems: an Overview. Knowledge and Data Engineering, IEEE Transactions on, 4(6):509--516, Dec 1992. Google ScholarDigital Library
}}J. Gray, D. T. Liu, M. Nieto-Santisteban, et al. Scientific Data Management in the Coming Decade. SIGMOD Rec., 34(4):34--41, 2005. Google ScholarDigital Library
}}IBM journal of Research and Development staff. Overview of the IBM Blue Gene/P project. IBM J. Res. Dev., 52(1/2):199--220, 2008. Google ScholarDigital Library
}}C. Keltcher, K. McGrath, A. Ahmed, and P. Conway. The AMD Opteron Processor for Multiprocessor Servers. Micro, IEEE, 23(2):66--76, March-April 2003. Google ScholarDigital Library
}}S. Kottapalli and J. Baxter. Nehalem-EX CPU Architecture. Hot chips 21, Aug 2009.Google Scholar
}}S. Liang, R. Noronha, and D. Panda. Swapping to Remote Memory over InfiniBand: An Approach using a High Performance Network Block Device. In Cluster Computing, 2005. IEEE International, pages 1--10, Sept. 2005.Google ScholarCross Ref
}}H. Litz, H. Fröening, M. Nuessle, and U. Brüening. A HyperTransport Network Interface Controller for Ultra-low Latency Message Transfers. HyperTransport Consortium White Paper, 2007.Google Scholar
}}H. Litz, H. Fröening, M. Nuessle, and U. Brüening. VELO: A Novel Communication Engine for Ultra-Low Latency Message Transfers. In Parallel Processing, 2008. ICPP '08. 37th International Conference on, pages 238--245, Sept. 2008. Google ScholarDigital Library
}}P. Magnusson, M. Christensson, J. Eskilson, et al. Simics: A Full System Simulation Platform. Computer, 35(2):50--58, Feb 2002. Google ScholarDigital Library
}}M. Martin, D. Sorin, B. Beckmann, et al. Multifacet's general execution-driven multiprocessor simulator (GEMS) toolset. SIGARCH Comput. Archit. News, 33(4):92--99, 2005. Google ScholarDigital Library
}}J. D. McCalpin. Memory Bandwidth and Machine Balance in Current High Performance Computers. IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, pages 19--25, Dec. 1995.Google Scholar
}}M. Oguchi and M. Kitsuregawa. Using Available Remote Memory Dynamically for Parallel Data Mining Application on ATM-connected PC Cluster. In IPDPS 2000. Proceedings. 14th International, pages 411--420, 2000. Google ScholarDigital Library
}}J. Oleszkiewicz, L. Xiao, and Y. Liu. Parallel Network RAM: Effectively Utilizing Global Cluster Memory for Large Data-Intensive Parallel Programs. In Parallel Processing, 2004. ICPP 2004. International Conference on, pages 353--360 vol. 1, Aug. 2004. Google ScholarDigital Library
}}M. Ronstrom and L. Thalmann. MySQL Cluster Architecture Overview. Technical White Paper. MySQL, 2004.Google Scholar
}}D. Slogsnat, A. Giese, M. Nüssle, and U. Brüning. An Open-source HyperTransport Core. ACM Trans. Reconfigurable Technol. Syst., 1(3):1--21, 2008. Google ScholarDigital Library
}}A. S. Szalay, J. Gray, and J. vandenBerg. Petabyte Scale Data Mining: Dream or Reality? CoRR, cs.DB/0208013, 2002.Google Scholar
}}J. Tuck, L. Ceze, and J. Torrellas. Scalable Cache Miss Handling for High Memory-Level Parallelism. Microarchitecture, 2006. MICRO-39. 39th Annual IEEE/ACM International Symposium on, Dec 2006. Google ScholarDigital Library
}}K. Yelick. Computer architecture: Opportunities and challenges for scalable applications. Sandia CSRI Workshop on Next-generation scalable applications: When MPI-only is not enough, Jun 2008.Google Scholar
}}K. Yelick. Programming models: Opportunities and challenges for scalable applications. Sandia CSRI Workshop on Next-generation scalable applications: When MPI-only is not enough, Jun 2008.Google Scholar

Index Terms

A practical way to extend shared memory support beyond a motherboard at low cost
1. Hardware
  1. Integrated circuits
    1. Semiconductor memory

Recommendations

Scalable directory architecture for distributed shared memory chip multiprocessors

Traditional Directory-based cache coherence protocol is far from optimal for large-scale cache coherent shared memory multiprocessors due to the increasing latency to access directories stored in DRAM memory. Instead of keeping directories in main ...
Read More
WOM-Code Solutions for Low Latency and High Endurance in Phase Change Memory
This paper describes a write-once-memory-code phase change memory (WOM-code PCM) architecture for next-generation non-volatile memory applications. Specifically, we address the long latency of the write operation in PCM—attributed to PCM SET—...
Read More
Next high performance and low power flash memory package structure

In general, NAND flash memory has advantages in low power consumption, storage capacity, and fast erase/write performance in contrast to NOR flash. But, main drawback of the NAND flash memory is the slow access time for random read operations. Therefore,...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing
June 2010
911 pages
ISBN:9781605589428
DOI:10.1145/1851476
General Chairs:
Salim Hariri
University of Arizona
,
Kate Keahey
University of Chicago
Copyright © 2010 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 21 June 2010
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate166of966submissions,17%
Upcoming Conference
HPDC '24

Sponsor:

sigarch

The 33rd International Symposium on High-Performance Parallel and Distributed Computing

June 3 - 7, 2024

Pisa , Italy
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 6
  Total Citations
  View Citations
- 280
  Total Downloads
- Downloads (Last 12 months)8
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

A practical way to extend shared memory support beyond a motherboard at low cost

HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Scalable directory architecture for distributed shared memory chip multiprocessors

WOM-Code Solutions for Low Latency and High Endurance in Phase Change Memory

Next high performance and low power flash memory package structure

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

A practical way to extend shared memory support beyond a motherboard at low cost

HPDC '10: Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing

ABSTRACT

References

Cited By

Index Terms

Recommendations

Scalable directory architecture for distributed shared memory chip multiprocessors

WOM-Code Solutions for Low Latency and High Endurance in Phase Change Memory

Next high performance and low power flash memory package structure

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Qualifiers

Conference

Acceptance Rates

Upcoming Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media