research-article

ShadowVM: accelerating data plane for data analytics with bare metal CPUs and GPUs

Authors:

Chuliang WengAuthors Info & Claims

PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Pages 147 - 160

https://doi.org/10.1145/3437801.3441595

Published: 17 February 2021 Publication History

Abstract

With the development of the big data ecosystem, large-scale data analytics has become more prevalent in the past few years. Apache Spark, etc., provide a flexible approach for scalable processing upon massive data. However, they are not designed for handling computing-intensive workloads due to the restrictions of JVM runtime. In contrast, GPU has been the de facto accelerator for graphics rendering and deep learning in recent years. Nevertheless, the current architecture makes it difficult to take advantage of GPUs and other accelerators in the big data world.

Now, it is time to break down this obstacle by changing the fundamental architecture. To integrate accelerators efficiently, we decouple the control plane and the data plane within big data systems via action shadowing. The control plane keeps logic information to fit well with the host systems like Spark, while the data plane holds data and performs execution upon bare metal CPUs and GPUs. Under this decoupled architecture, both the control plane and the data plane could leverage the appropriate approaches without breaking existing mechanisms. Based on this idea, we implement an accelerated data plane, namely ShadowVM. In our experiments on the SSB benchmark, ShadowVM lifts the JVM-based Spark with up to 14.7× speedup. Furthermore, ShadowVM could also outperform the GPU-only fashion by adopting mixed CPU-GPU execution.

References

[1]

2020. Accelerator-aware task scheduling for Spark. https://issues.apache.org/jira/browse/SPARK-24615/.

[2]

2020. Aparapi. http://aparapi.github.io/.

[3]

2020. BlazingSQL. https://blazingsql.com/.

[4]

2020. GRPC: A high-performance, open-source universal RPC framework. https://grpc.io/.

[5]

2020. Introducing AresDB: Uber's GPU-Powered Open Source, Realtime Analytics Engine. https://eng.uber.com/aresdb/.

[6]

2020. J9 Virtual Machine (JVM). https://www.ibm.com/support/knowledgecenter/en/SSYKE2_7.0.0/com.ibm.java.lnx.70.doc/user/java_jvm.html.

[7]

2020. Protobuf: A language-neutral, platform-neutral extensible mechanism for serializing structured data. https://developers.google.com/protocol-buffers/.

[8]

2020. RAPIDS. https://rapids.ai.

[9]

2020. Spark 3.0. http://spark.apache.org/news/spark-3.0.0-preview.html.

[10]

2020. Sumatra. http://openjdk.java.net/projects/sumatra/.

[11]

2020. TensorFlow Serving. https://github.com/tensorflow/serving.

[12]

2020. Yahoo Streaming Benchmarks. https://github.com/yahoo/streaming-benchmarks.

[13]

Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek Gordon Murray, Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI) 2016. 265--283. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi

[14]

Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In Proceedings of International Conference on Management of Data (SIGMOD) 2015. 1383--1394.

Digital Library

[15]

Ryo Asai, Masao Okita, Fumihiko Ino, and Kenichi Hagihara. 2018. Transparent Avoidance of Redundant Data Transfer on GPU-enabled Apache Spark. In Proceedings of 11th Workshop on General Purpose Processing using GPUs 2018. 22--30.

Digital Library

[16]

Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. 2014. IX: A Protected Dataplane Operating System for High Throughput and Low Latency. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI) 2014. 49--65. https://www.usenix.org/conference/osdi14/technical-sessions/presentation/belay

[17]

Sebastian BreΒ, Bastian Köcher, Henning Funke, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2018. Generating custom code for efficient query execution on heterogeneous processors. The VLDB Journal 27, 6 (2018), 797--822.

Digital Library

[18]

Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache Flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 36, 4 (2015). http://sites.computer.org/debull/A15dec/p28.pdf

[19]

Cen Chen, Kenli Li, Aijia Ouyang, Zeng Zeng, and Keqin Li. 2018. GFlink: An in-memory computing architecture on heterogeneous CPU-GPU clusters for big data. IEEE Transactions on Parallel and Distributed Systems 29, 6 (2018), 1275--1288.

[20]

Yu-Ting Chen, Jason Cong, Zhenman Fang, Jie Lei, and Peng Wei. 2016. When Spark Meets FPGAs: A Case Study for Next-Generation DNA Sequencing Acceleration. In Proceedings of IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) 2016. 29.

[21]

Zhenhua Chen, Jielong Xu, Jian Tang, Kevin Kwiat, and Charles Kamhoua. 2015. G-Storm: GPU-enabled high-throughput online data processing in Storm. In IEEE International Conference on Big Data (Big Data) 2015. 307--312.

Digital Library

[22]

JeeWhan Choi, Aparna Chandramowlishwaran, Kamesh Madduri, and Richard W. Vuduc. 2014. A CPU:GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method. In Proceedings of Seventh Workshop on General Purpose Processing Using GPUs 2014. 64. https://dl.acm.org/citation.cfm?id=2576787

[23]

Periklis Chrysogelos, Manos Karpathiotakis, Raja Appuswamy, and Anastasia Ailamaki. 2019. HetExchange: Encapsulating heterogeneous CPU-GPU parallelism in JIT compiled engines. Proc. VLDB Endow. 12, 5 (2019), 544--556. http://www.vldb.org/pvldb/vol12/p544-chrysogelos.pdf

Digital Library

[24]

Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113.

Digital Library

[25]

Gregory Essertel, Ruby Tahboub, James Decker, Kevin Brown, Kunle Olukotun, and Tiark Rompf. 2018. Flare: Optimizing apache spark with native compilation for scale-up architectures and medium-size data. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI) 2018. 799--815. https://www.usenix.org/conference/osdi18/presentation/essertel

[26]

Juan Fumero, Michail Papadimitriou, Foivos S. Zakkak, Maria Xekalaki, James Clarkson, and Christos Kotselidis. 2019. Dynamic application reconfiguration on heterogeneous hardware. In Proceedings of International Conference on Virtual Execution Environments (VEE) 2019. 165--178.

Digital Library

[27]

Ionel Gog, Malte Schwarzkopf, Adam Gleave, Robert N. M. Watson, and Steven Hand. 2016. Firmament: Fast, Centralized Cluster Scheduling at Scale. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI) 2016. 99--115. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/gog

[28]

Michael Gowanlock, Ben Karsin, Zane Fink, and Jordan Wright. 2019. Accelerating the Unacceleratable: Hybrid CPU/GPU Algorithms for Memory-Bound Database Primitives. In Proceedings of International Workshop on Data Management on New Hardware (DaMoN) 2019. 7:1--7:11.

Digital Library

[29]

Max Grossman and Vivek Sarkar. 2016. SWAT: A Programmable, In-Memory, Distributed, High-Performance Computing Platform. In Proceedings of International Symposium on High-Performance Parallel and Distributed Computin (HPDC) 2016. 81--92.

Digital Library

[30]

Max Heimel, Michael Saecker, Holger Pirk, Stefan Manegold, and Volker Markl. 2013. Hardware-oblivious parallelism for in-memory column-stores. Proc. VLDB Endow. 6, 9 (2013), 709--720. http://www.vldb.org/pvldb/vol6/p709-heimel.pdf

Digital Library

[31]

Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D Joseph, Randy H Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A platform for fine-grained resource sharing in the data center. In Proceedings of USENIX Symposium on Networked Systems Design and Implementation (NSDI) 2011. https://www.usenix.org/conference/nsdi11/mesos-platform-fine-grained-resource-sharing-data-center

[32]

Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of ACM international conference on Multimedia (MM) 2014. 675--678.

Digital Library

[33]

Jaehoon Jung, Daeyoung Park, Youngdong Do, Jungho Park, and Jaejin Lee. 2020. Overlapping host-to-device copy and computation using hidden unified memory. In Proceedings of Symposium on Principles and Practice of Parallel Programming (PPoPP) 2020. 321--335.

Digital Library

[34]

Tim Kaldewey, Guy Lohman, Rene Mueller, and Peter Volk. 2012. GPU join processing revisited. In Proceedings of International Workshop on Data Management on New Hardware (DaMoN) 2012. 55--62.

Digital Library

[35]

Seon Wook Kim, Chong-liang Ooi, Rudolf Eigenmann, Babak Falsafi, and T. N. Vijaykumar. 2001. Reference idempotency analysis: a framework for optimizing speculative execution. In Proceedings of Symposium on Principles and Practice of Parallel Programming (PPoPP) 2001. 2--11.

Digital Library

[36]

Hao Li, Di Yu, Anand Kumar, and Yi-Cheng Tu. 2014. Performance modeling in CUDA streams - A means for high-throughput data processing. In Proceedings of International Conference on Big Data 2014. 301--310.

[37]

Peilong Li, Yan Luo, Ning Zhang, and Yu Cao. 2015. HeteroSpark: A heterogeneous CPU/GPU Spark platform for machine learning algorithms. In Proceedings of IEEE International Conference on Networking, Architecture and Storage (NAS) 2015. 347--348.

[38]

Zhifang Li, Beicheng Peng, and Chuliang Weng. 2020. XeFlow: Streamlining Inter-Processor Pipeline Execution for the Discrete CPU-GPU Platform. IEEE Trans. Computers 69, 6 (2020), 819--831.

[39]

Jianqiao Liu, Nikhil Hegde, and Milind Kulkarni. 2016. Hybrid CPU-GPU scheduling and execution of tree traversals. In Proceedings of Symposium on Principles and Practice of Parallel Programming (PPoPP) 2016. 41:1--41:2.

Digital Library

[40]

Yang Liu, Jianguo Wang, and Steven Swanson. 2018. Griffin: uniting CPU and GPU in information retrieval systems for intra-query parallelism. In Proceedings of Symposium on Principles and Practice of Parallel Programming (PPoPP) 2018. 327--337.

Digital Library

[41]

Edmund B. Nightingale, Peter M. Chen, and Jason Flinn. 2006. Speculative execution in a distributed file system. ACM Trans. Comput. Syst. 24, 4 (2006), 361--392.

Digital Library

[42]

NVIDIA. 2020. CUDA C programming guide. (2020). https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html

[43]

Yasuhiro Ohno, Shin Morishima, and Hiroki Matsutani. 2016. Accelerating Spark RDD Operations with Local and Remote GPU Devices. In Proceedings of International Conference on Parallel and Distributed Systems (ICPADS) 2016. 791--799.

[44]

Patrick E O'Neil, Elizabeth J O'Neil, and Xuedong Chen. 2007. The star schema benchmark (SSB). Pat 200, 0 (2007), 50. https://www.cs.umb.edu/~xuedchen/research/publications/StarSchemaB.PDF

[45]

Shoumik Palkar, James J. Thomas, Deepak Narayanan, Pratiksha Thaker, Rahul Palamuttam, Parimarjan Negi, Anil Shanbhag, Malte Schwarzkopf, Holger Pirk, Saman P. Amarasinghe, Samuel Madden, and Matei Zaharia. 2018. Evaluating End-to-End Optimization for Data Analytics Applications in Weld. Proc. VLDB Endow. 11, 9 (2018), 1002--1015. http://www.vldb.org/pvldb/vol11/p1002-palkar.pdf

Digital Library

[46]

Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: an efficient dynamic resource scheduler for deep learning clusters. In Proceedings of EuroSys Conference (EuroSys) 2018. 3:1--3:14.

Digital Library

[47]

Christopher Root and Todd Mostak. 2016. MapD: a GPU-powered big data analytics and visualization platform. In Proceedings of Special Interest Group on Computer Graphics and Interactive Techniques Conference 2016. 73:1--73:2.

Digital Library

[48]

Christopher J Rossbach, Yuan Yu, Jon Currey, Jean-Philippe Martin, and Dennis Fetterly. 2013. Dandelion: a compiler and runtime for heterogeneous systems. In Proceedings of Symposium on Operating Systems Principles (SOSP) 2013. 49--68.

Digital Library

[49]

Anil Shanbhag, Samuel Madden, and Xiangyao Yu. 2020. A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics. In Proceedings of International Conference on Management of Data (SIGMOD) 2020. 1617--1632.

Digital Library

[50]

Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, Arvind Krishnamurthy, and Ravi Sundaram. 2019. Nexus: a GPU cluster engine for accelerating DNN-based video analysis. In Proceedings of Symposium on Operating Systems Principles (SOSP) 2019. 322--337.

Digital Library

[51]

Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. 2013. Apache Hadoop YARN: Yet another resource negotiator. In Proceedings of ACM Symposium on Cloud Computing (SOCC) 2013. 5:1--5:16.

Digital Library

[52]

Haicheng Wu, Gregory Frederick Diamos, Srihari Cadambi, and Sudhakar Yalamanchili. 2012. Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation. In Proceedings of International Symposium on Microarchitecture (MICRO) 2012. 107--118.

Digital Library

[53]

Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, and Lidong Zhou. 2018. Gandiva: Introspective Cluster Scheduling for Deep Learning. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI) 2018. 595--610. https://www.usenix.org/conference/osdi18/presentation/xiao

[54]

Yonghong Yan, Max Grossman, and Vivek Sarkar. 2009. JCUDA: A programmer-friendly interface for accelerating Java programs with CUDA. In Proceedings of European Conference on Parallel Processing. Springer, 887--899.

Digital Library

[55]

Ting-An Yeh, Hung-Hsin Chen, and Jerry Chou. 2020. KubeShare: A Framework to Manage GPUs as First-Class and Shared Resources in Container Cloud. In Proceedings of International Symposium on High-Performance Parallel and Distributed Computing (HPDC) 2020. 173--184.

Digital Library

[56]

Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. 2008. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI) 2008. 1--14. https://www.usenix.org/legacy/events/osdi08/tech/full_papers/yu_y/yu_y.pdf

[57]

Yuan Yuan, Rubao Lee, and Xiaodong Zhang. 2013. The Yin and Yang of processing data warehousing queries on GPU devices. Proc. VLDB Endow. 6, 10 (2013), 817--828. http://www.vldb.org/pvldb/vol6/p817-yuan.pdf

Digital Library

[58]

Yuan Yuan, Meisam Fathi Salmi, Yin Huai, Kaibo Wang, Rubao Lee, and Xiaodong Zhang. 2016. Spark-GPU: An accelerated in-memory data processing engine on clusters. In Proceedings of IEEE International Conference on Big Data (BigData) 2016. 273--283.

[59]

Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Proceedings of USENIX Symposium on Networked Systems Design and Implementation (NSDI) 2012. 15--28. https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia

[60]

Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. HotCloud 10, 10--10 (2010), 95. https://www.usenix.org/conference/hotcloud-10/spark-cluster-computing-working-sets

Digital Library

Cited By

Huang YFan XYan SWeng C(2024)Neos: A NVMe-GPUs Direct Vector Service Buffer in User Space2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00289(3767-3781)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00289
Yousef MKhoriba GArafa T(2024)Optimizing Real-Time Data Processing in Resource-Constrained Environments: a Spark and Gpu-Driven Workflow for Large Language Models2024 IEEE International Conference on Future Machine Learning and Data Science (FMLDS)10.1109/FMLDS63805.2024.00032(116-121)Online publication date: 20-Nov-2024
https://doi.org/10.1109/FMLDS63805.2024.00032
Xekalaki MFumero JStratikopoulos ADoka KKatsakioris CBitsakos CKoziris NKotselidis C(2022)Enabling Transparent Acceleration of Big Data Frameworks Using Heterogeneous HardwareProceedings of the VLDB Endowment10.14778/3565838.356584215:13(3869-3882)Online publication date: 1-Sep-2022
https://dl.acm.org/doi/10.14778/3565838.3565842

Index Terms

ShadowVM: accelerating data plane for data analytics with bare metal CPUs and GPUs
1. Computer systems organization
  1. Architectures
    1. Other architectures
      1. Heterogeneous (hybrid) systems
2. Information systems
  1. Information systems applications
    1. Decision support systems
      1. Data analytics

Recommendations

A peta-scalable CPU-GPU algorithm for global atmospheric simulations
PPoPP '13

Developing highly scalable algorithms for global atmospheric modeling is becoming increasingly important as scientists inquire to understand behaviors of the global atmosphere at extreme scales. Nowadays, heterogeneous architecture based on both ...
An efficient scheduling scheme using estimated execution time for heterogeneous computing systems

Computing systems should be designed to exploit parallelism in order to improve performance. In general, a GPU (Graphics Processing Unit) can provide more parallelism than a CPU (Central Processing Unit), resulting in the wide usage of heterogeneous ...
Evaluation of XcalableACC with tightly coupled accelerators/InfiniBand hybrid communication on accelerated cluster

Accelerated clusters, which are cluster systems equipped with accelerators, are one of the most common systems in parallel computing. In order to exploit the performance of such systems, it is important to reduce communication latency between accelerator ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 2021

507 pages

ISBN:9781450382946

DOI:10.1145/3437801

General Chair:
Jaejin Lee
Seoul National University, South Korea
,
Program Chair:
Erez Petrank
Technion, Israel

Copyright © 2021 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 February 2021

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

Conference

PPoPP '21

Sponsor:

PPoPP '21: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

February 27, 2021

Virtual Event, Republic of Korea

Acceptance Rates

PPoPP '21 Paper Acceptance Rate 31 of 150 submissions, 21%;

Overall Acceptance Rate 230 of 1,014 submissions, 23%

Upcoming Conference

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
667
Total Downloads

Downloads (Last 12 months)52
Downloads (Last 6 weeks)2

Reflects downloads up to 23 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Huang YFan XYan SWeng C(2024)Neos: A NVMe-GPUs Direct Vector Service Buffer in User Space2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00289(3767-3781)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00289
Yousef MKhoriba GArafa T(2024)Optimizing Real-Time Data Processing in Resource-Constrained Environments: a Spark and Gpu-Driven Workflow for Large Language Models2024 IEEE International Conference on Future Machine Learning and Data Science (FMLDS)10.1109/FMLDS63805.2024.00032(116-121)Online publication date: 20-Nov-2024
https://doi.org/10.1109/FMLDS63805.2024.00032
Xekalaki MFumero JStratikopoulos ADoka KKatsakioris CBitsakos CKoziris NKotselidis C(2022)Enabling Transparent Acceleration of Big Data Frameworks Using Heterogeneous HardwareProceedings of the VLDB Endowment10.14778/3565838.356584215:13(3869-3882)Online publication date: 1-Sep-2022
https://dl.acm.org/doi/10.14778/3565838.3565842

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten