ABSTRACT
Recently, UPMEM has introduced the first commercially available processing in memory (PIM) platform. Its key feature are DRAM memory chips with built-in RISC CPUs for in-memory data processing. Naturally, this has sparked interest in the research community, which previously was limited to PIM simulators and custom FPGA prototypes. One result of this is the PrIM benchmark suite that combines an in-depth analysis of PIM performance with benchmarks that measure the speedup of PIM over processing on conventional CPUs and GPUs [10]. However, the current generation of UPMEM PIM faces limitations such as memory interleaving, and as such does not provide true in-memory computing. Applications must store data in DRAM and transfer it to/from UPMEM modules for processing, which behave just like computational offloading engines from this perspective. This paper examines the ramifications of treating them as such in comparative performance benchmarks. By extending the PrIM suite to address the challenges that computational offloading benchmarks face, we show that such a full-system perspective can drastically alter offloading recommendations, with 9 of 11 previously UPMEM-friendly benchmarks now performing best on a conventional server CPU.
- D. H. Bailey, E. Barszcz, J. T. Barton, D. S. Browning, R. L. Carter, L. Dagum, R. A. Fatoohi, P. O. Frederickson, T. A. Lasinski, R. S. Schreiber, H. D. Simon, V. Venkatakrishnan, and S. K. Weeratunga. 1991. The NAS Parallel Benchmarks---summary and Preliminary Results. In Proceedings of the 1991 ACM/IEEE Conference on Supercomputing (Albuquerque, New Mexico, USA) (Supercomputing '91). Association for Computing Machinery, New York, NY, USA, 158--165. Google ScholarDigital Library
- Alexander Baumstark, Muhammad Attahir Jibril, and Kai-Uwe Sattler. 2023. Accelerating Large Table Scan using Processing-In-Memory Technology. In BTW 2023. Gesellschaft für Informatik e.V., Bonn, 797--814. Google ScholarCross Ref
- Stefano Corda, Madhurya Kumaraswamy, Ahsan Javed Awan, Roel Jordans, Akash Kumar, and Henk Corporaal. 2021. NMPO: Near-Memory Computing Profiling and Offloading. In 2021 24th Euromicro Conference on Digital System Design (DSD). 259--267. Google ScholarCross Ref
- Stefano Corda, Gagandeep Singh, Ahsan Jawed Awan, Roel Jordans, and Henk Corporaal. 2019. Platform Independent Software Analysis for Near Memory Computing. In 2019 22nd Euromicro Conference on Digital System Design (DSD). 606--609. Google ScholarCross Ref
- Andrew Davison. 1995. Twelve Ways to Fool the Masses When Giving Performance Results on Parallel Computers. Supercomputing Review (August 1995), 54--55.Google Scholar
- Fabrice Devaux. 2019. The true Processing In Memory accelerator. In 2019 IEEE Hot Chips 31 Symposium (HCS). 1--24. Google ScholarCross Ref
- François Duhem, Fabrice Muller, and Philippe Lorenzini. 2011. FaRM: Fast Reconfiguration Manager for Reducing Reconfiguration Time Overhead on FPGA. In Reconfigurable Computing: Architectures, Tools and Applications, Andreas Koch, Ram Krishnamurthy, John McAllister, Roger Woods, and Tarek El-Ghazawi (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 253--260.Google Scholar
- Scott Grauer-Gray, Lifan Xu, Robert Searles, Sudhee Ayalasomayajula, and John Cavazos. 2012. Auto-tuning a high-level language targeted to GPU codes. In 2012 Innovative Parallel Computing (InPar). 1--10. Google ScholarCross Ref
- Khronos OpenCL Working Group. 2023. The OpenCL specification version 3.0.14. (2023). https://registry.khronos.org/OpenCL/specs/3.0-unified/pdf/OpenCL_API.pdfGoogle Scholar
- Juan Gómez-Luna, Izzat El Hajj, Ivan Fernandez, Christina Giannoula, Geraldo F. Oliveira, and Onur Mutlu. 2022. Benchmarking a New Paradigm: Experimental Analysis and Characterization of a Real Processing-in-Memory System. IEEE Access 10 (2022), 52565--52608. Google ScholarCross Ref
- Torsten Hoefler and Roberto Belli. 2015. Scientific Benchmarking of Parallel Computing Systems: Twelve Ways to Tell the Masses When Reporting Performance Results. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Austin, Texas) (SC '15). Association for Computing Machinery, New York, NY, USA, Article 73, 12 pages. Google ScholarDigital Library
- Cheol-Ho Hong, Ivor Spence, and Dimitrios S. Nikolopoulos. 2017. GPU Virtualization and Scheduling Methods: A Comprehensive Survey. ACM Comput. Surv. 50, 3, Article 35 (jun 2017), 37 pages. Google ScholarDigital Library
- Nina Ihde, Paula Marten, Ahmed Eleliemy, Gabrielle Poerwawinata, Pedro Silva, Ilin Tolovski, Florina M. Ciorba, and Tilmann Rabl. 2022. A Survey of Big Data, High Performance Computing, and Machine Learning Benchmarks. In Performance Evaluation and Benchmarking, Raghunath Nambiar and Meikel Poess (Eds.). Springer International Publishing, Cham, 98--118.Google Scholar
- Donghun Lee, Andrew Chang, Minseon Ahn, Jongmin Gim, Jungmin Kim, Jaemin Jung, Kang-Woo Choi, Vincent Pham, Oliver Rebholz, Krishna T. Malladi, and Yang-Seok Ki. 2020. Optimizing Data Movement with Near-Memory Acceleration of In-memory DBMS. In Proceedings of the 23rd International Conference on Extending Database Technology, EDBT 2020, Copenhagen, Denmark, March 30 - April 02, 2020, Angela Bonifati, Yongluan Zhou, Marcos Antonio Vaz Salles, Alexander Böhm, Dan Olteanu, George H. L. Fletcher, Arijit Khan, and Bin Yang (Eds.). OpenProceedings.org, 371--374. Google ScholarCross Ref
- Victor W. Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim, Anthony D. Nguyen, Nadathur Satish, Mikhail Smelyanskiy, Srinivas Chennupaty, Per Hammarlund, Ronak Singhal, and Pradeep Dubey. 2010. Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU. SIGARCH Comput. Archit. News 38, 3 (jun 2010), 451--460. Google ScholarDigital Library
- Kyprianos Papadimitriou, Apostolos Dollas, and Scott Hauck. 2011. Performance of Partial Reconfiguration in FPGA Systems: A Survey and a Cost Model. ACM Trans. Reconfigurable Technol. Syst. 4, 4, Article 36 (dec 2011), 24 pages. Google ScholarDigital Library
- Albert Reuther, Peter Michaleas, Michael Jones, Vijay Gadepally, Siddharth Samsi, and Jeremy Kepner. 2019. Survey and Benchmarking of Machine Learning Accelerators. In 2019 IEEE High Performance Extreme Computing Conference (HPEC). 1--9. Google ScholarCross Ref
- Robert Schmid, Max Plauth, Lukas Wenzel, Felix Eberhardt, and Andreas Polze. 2020. Accessible Near-Storage Computing with FPGAs. In Proceedings of the Fifteenth European Conference on Computer Systems (Heraklion, Greece) (EuroSys '20). Association for Computing Machinery, New York, NY, USA, Article 28, 12 pages. Google ScholarDigital Library
- Janet Tseng, Ren Wang, James Tsai, Yipeng Wang, and Tsung-Yuan Charlie Tai. 2017. Accelerating Open VSwitch with Integrated GPU. In Proceedings of the Workshop on Kernel-Bypass Networks (Los Angeles, CA, USA) (KBNets '17). Association for Computing Machinery, New York, NY, USA, 7--12. Google ScholarDigital Library
- Yash Ukidave, Fanny Nina Paravecino, Leiming Yu, Charu Kalra, Amir Momeni, Zhongliang Chen, Nick Materise, Brett Daley, Perhaad Mistry, and David Kaeli. 2015. NUPAR: A Benchmark Suite for Modern GPU Architectures. In Proceedings of the 6th ACM/SPEC International Conference on Performance Engineering (Austin, Texas, USA) (ICPE '15). Association for Computing Machinery, New York, NY, USA, 253--264. Google ScholarDigital Library
- UPMEM. 2023. UPMEM SDK. https://sdk.upmem.com/ version 2023.1.0.Google Scholar
Index Terms
- A Full-System Perspective on UPMEM Performance
Recommendations
Large System Performance of SPEC OMP2001 Benchmarks
ISHPC '02: Proceedings of the 4th International Symposium on High Performance ComputingPerformance characteristics of application programs on large-scale systems are often significantly different from those on smaller systems. SPEC OMP2001 is a benchmark suite intended for measuring performance of modern shared memory parallel systems. ...
Large System Performance of SPEC OMP2001 Benchmarks
ISHPC '02: Proceedings of the 4th International Symposium on High Performance ComputingPerformance characteristics of application programs on large-scale systems are often significantly different from those on smaller systems. SPEC OMP2001 is a benchmark suite intended for measuring performance of modern shared memory parallel systems. ...
Exploring Processing In-Memory for Different Technologies
GLSVLSI '19: Proceedings of the 2019 on Great Lakes Symposium on VLSIThe recent emergence of IoT has led to a substantial increase in the amount of data processed. Today, a large number of applications are data intensive, involving massive data transfers between processing core and memory. These transfers act as a ...
Comments