skip to main content
10.1145/3437801.3441595acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections
research-article

ShadowVM: accelerating data plane for data analytics with bare metal CPUs and GPUs

Published: 17 February 2021 Publication History

Abstract

With the development of the big data ecosystem, large-scale data analytics has become more prevalent in the past few years. Apache Spark, etc., provide a flexible approach for scalable processing upon massive data. However, they are not designed for handling computing-intensive workloads due to the restrictions of JVM runtime. In contrast, GPU has been the de facto accelerator for graphics rendering and deep learning in recent years. Nevertheless, the current architecture makes it difficult to take advantage of GPUs and other accelerators in the big data world.
Now, it is time to break down this obstacle by changing the fundamental architecture. To integrate accelerators efficiently, we decouple the control plane and the data plane within big data systems via action shadowing. The control plane keeps logic information to fit well with the host systems like Spark, while the data plane holds data and performs execution upon bare metal CPUs and GPUs. Under this decoupled architecture, both the control plane and the data plane could leverage the appropriate approaches without breaking existing mechanisms. Based on this idea, we implement an accelerated data plane, namely ShadowVM. In our experiments on the SSB benchmark, ShadowVM lifts the JVM-based Spark with up to 14.7× speedup. Furthermore, ShadowVM could also outperform the GPU-only fashion by adopting mixed CPU-GPU execution.

References

[1]
2020. Accelerator-aware task scheduling for Spark. https://issues.apache.org/jira/browse/SPARK-24615/.
[2]
2020. Aparapi. http://aparapi.github.io/.
[3]
2020. BlazingSQL. https://blazingsql.com/.
[4]
2020. GRPC: A high-performance, open-source universal RPC framework. https://grpc.io/.
[5]
2020. Introducing AresDB: Uber's GPU-Powered Open Source, Realtime Analytics Engine. https://eng.uber.com/aresdb/.
[6]
2020. J9 Virtual Machine (JVM). https://www.ibm.com/support/knowledgecenter/en/SSYKE2_7.0.0/com.ibm.java.lnx.70.doc/user/java_jvm.html.
[7]
2020. Protobuf: A language-neutral, platform-neutral extensible mechanism for serializing structured data. https://developers.google.com/protocol-buffers/.
[8]
2020. RAPIDS. https://rapids.ai.
[9]
2020. Spark 3.0. http://spark.apache.org/news/spark-3.0.0-preview.html.
[10]
2020. Sumatra. http://openjdk.java.net/projects/sumatra/.
[11]
2020. TensorFlow Serving. https://github.com/tensorflow/serving.
[12]
2020. Yahoo Streaming Benchmarks. https://github.com/yahoo/streaming-benchmarks.
[13]
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek Gordon Murray, Benoit Steiner, Paul A. Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2016. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI) 2016. 265--283. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/abadi
[14]
Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In Proceedings of International Conference on Management of Data (SIGMOD) 2015. 1383--1394.
[15]
Ryo Asai, Masao Okita, Fumihiko Ino, and Kenichi Hagihara. 2018. Transparent Avoidance of Redundant Data Transfer on GPU-enabled Apache Spark. In Proceedings of 11th Workshop on General Purpose Processing using GPUs 2018. 22--30.
[16]
Adam Belay, George Prekas, Ana Klimovic, Samuel Grossman, Christos Kozyrakis, and Edouard Bugnion. 2014. IX: A Protected Dataplane Operating System for High Throughput and Low Latency. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI) 2014. 49--65. https://www.usenix.org/conference/osdi14/technical-sessions/presentation/belay
[17]
Sebastian BreΒ, Bastian Köcher, Henning Funke, Steffen Zeuch, Tilmann Rabl, and Volker Markl. 2018. Generating custom code for efficient query execution on heterogeneous processors. The VLDB Journal 27, 6 (2018), 797--822.
[18]
Paris Carbone, Asterios Katsifodimos, Stephan Ewen, Volker Markl, Seif Haridi, and Kostas Tzoumas. 2015. Apache Flink: Stream and batch processing in a single engine. Bulletin of the IEEE Computer Society Technical Committee on Data Engineering 36, 4 (2015). http://sites.computer.org/debull/A15dec/p28.pdf
[19]
Cen Chen, Kenli Li, Aijia Ouyang, Zeng Zeng, and Keqin Li. 2018. GFlink: An in-memory computing architecture on heterogeneous CPU-GPU clusters for big data. IEEE Transactions on Parallel and Distributed Systems 29, 6 (2018), 1275--1288.
[20]
Yu-Ting Chen, Jason Cong, Zhenman Fang, Jie Lei, and Peng Wei. 2016. When Spark Meets FPGAs: A Case Study for Next-Generation DNA Sequencing Acceleration. In Proceedings of IEEE Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM) 2016. 29.
[21]
Zhenhua Chen, Jielong Xu, Jian Tang, Kevin Kwiat, and Charles Kamhoua. 2015. G-Storm: GPU-enabled high-throughput online data processing in Storm. In IEEE International Conference on Big Data (Big Data) 2015. 307--312.
[22]
JeeWhan Choi, Aparna Chandramowlishwaran, Kamesh Madduri, and Richard W. Vuduc. 2014. A CPU:GPU Hybrid Implementation and Model-Driven Scheduling of the Fast Multipole Method. In Proceedings of Seventh Workshop on General Purpose Processing Using GPUs 2014. 64. https://dl.acm.org/citation.cfm?id=2576787
[23]
Periklis Chrysogelos, Manos Karpathiotakis, Raja Appuswamy, and Anastasia Ailamaki. 2019. HetExchange: Encapsulating heterogeneous CPU-GPU parallelism in JIT compiled engines. Proc. VLDB Endow. 12, 5 (2019), 544--556. http://www.vldb.org/pvldb/vol12/p544-chrysogelos.pdf
[24]
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113.
[25]
Gregory Essertel, Ruby Tahboub, James Decker, Kevin Brown, Kunle Olukotun, and Tiark Rompf. 2018. Flare: Optimizing apache spark with native compilation for scale-up architectures and medium-size data. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI) 2018. 799--815. https://www.usenix.org/conference/osdi18/presentation/essertel
[26]
Juan Fumero, Michail Papadimitriou, Foivos S. Zakkak, Maria Xekalaki, James Clarkson, and Christos Kotselidis. 2019. Dynamic application reconfiguration on heterogeneous hardware. In Proceedings of International Conference on Virtual Execution Environments (VEE) 2019. 165--178.
[27]
Ionel Gog, Malte Schwarzkopf, Adam Gleave, Robert N. M. Watson, and Steven Hand. 2016. Firmament: Fast, Centralized Cluster Scheduling at Scale. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI) 2016. 99--115. https://www.usenix.org/conference/osdi16/technical-sessions/presentation/gog
[28]
Michael Gowanlock, Ben Karsin, Zane Fink, and Jordan Wright. 2019. Accelerating the Unacceleratable: Hybrid CPU/GPU Algorithms for Memory-Bound Database Primitives. In Proceedings of International Workshop on Data Management on New Hardware (DaMoN) 2019. 7:1--7:11.
[29]
Max Grossman and Vivek Sarkar. 2016. SWAT: A Programmable, In-Memory, Distributed, High-Performance Computing Platform. In Proceedings of International Symposium on High-Performance Parallel and Distributed Computin (HPDC) 2016. 81--92.
[30]
Max Heimel, Michael Saecker, Holger Pirk, Stefan Manegold, and Volker Markl. 2013. Hardware-oblivious parallelism for in-memory column-stores. Proc. VLDB Endow. 6, 9 (2013), 709--720. http://www.vldb.org/pvldb/vol6/p709-heimel.pdf
[31]
Benjamin Hindman, Andy Konwinski, Matei Zaharia, Ali Ghodsi, Anthony D Joseph, Randy H Katz, Scott Shenker, and Ion Stoica. 2011. Mesos: A platform for fine-grained resource sharing in the data center. In Proceedings of USENIX Symposium on Networked Systems Design and Implementation (NSDI) 2011. https://www.usenix.org/conference/nsdi11/mesos-platform-fine-grained-resource-sharing-data-center
[32]
Yangqing Jia, Evan Shelhamer, Jeff Donahue, Sergey Karayev, Jonathan Long, Ross Girshick, Sergio Guadarrama, and Trevor Darrell. 2014. Caffe: Convolutional architecture for fast feature embedding. In Proceedings of ACM international conference on Multimedia (MM) 2014. 675--678.
[33]
Jaehoon Jung, Daeyoung Park, Youngdong Do, Jungho Park, and Jaejin Lee. 2020. Overlapping host-to-device copy and computation using hidden unified memory. In Proceedings of Symposium on Principles and Practice of Parallel Programming (PPoPP) 2020. 321--335.
[34]
Tim Kaldewey, Guy Lohman, Rene Mueller, and Peter Volk. 2012. GPU join processing revisited. In Proceedings of International Workshop on Data Management on New Hardware (DaMoN) 2012. 55--62.
[35]
Seon Wook Kim, Chong-liang Ooi, Rudolf Eigenmann, Babak Falsafi, and T. N. Vijaykumar. 2001. Reference idempotency analysis: a framework for optimizing speculative execution. In Proceedings of Symposium on Principles and Practice of Parallel Programming (PPoPP) 2001. 2--11.
[36]
Hao Li, Di Yu, Anand Kumar, and Yi-Cheng Tu. 2014. Performance modeling in CUDA streams - A means for high-throughput data processing. In Proceedings of International Conference on Big Data 2014. 301--310.
[37]
Peilong Li, Yan Luo, Ning Zhang, and Yu Cao. 2015. HeteroSpark: A heterogeneous CPU/GPU Spark platform for machine learning algorithms. In Proceedings of IEEE International Conference on Networking, Architecture and Storage (NAS) 2015. 347--348.
[38]
Zhifang Li, Beicheng Peng, and Chuliang Weng. 2020. XeFlow: Streamlining Inter-Processor Pipeline Execution for the Discrete CPU-GPU Platform. IEEE Trans. Computers 69, 6 (2020), 819--831.
[39]
Jianqiao Liu, Nikhil Hegde, and Milind Kulkarni. 2016. Hybrid CPU-GPU scheduling and execution of tree traversals. In Proceedings of Symposium on Principles and Practice of Parallel Programming (PPoPP) 2016. 41:1--41:2.
[40]
Yang Liu, Jianguo Wang, and Steven Swanson. 2018. Griffin: uniting CPU and GPU in information retrieval systems for intra-query parallelism. In Proceedings of Symposium on Principles and Practice of Parallel Programming (PPoPP) 2018. 327--337.
[41]
Edmund B. Nightingale, Peter M. Chen, and Jason Flinn. 2006. Speculative execution in a distributed file system. ACM Trans. Comput. Syst. 24, 4 (2006), 361--392.
[42]
NVIDIA. 2020. CUDA C programming guide. (2020). https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html
[43]
Yasuhiro Ohno, Shin Morishima, and Hiroki Matsutani. 2016. Accelerating Spark RDD Operations with Local and Remote GPU Devices. In Proceedings of International Conference on Parallel and Distributed Systems (ICPADS) 2016. 791--799.
[44]
Patrick E O'Neil, Elizabeth J O'Neil, and Xuedong Chen. 2007. The star schema benchmark (SSB). Pat 200, 0 (2007), 50. https://www.cs.umb.edu/~xuedchen/research/publications/StarSchemaB.PDF
[45]
Shoumik Palkar, James J. Thomas, Deepak Narayanan, Pratiksha Thaker, Rahul Palamuttam, Parimarjan Negi, Anil Shanbhag, Malte Schwarzkopf, Holger Pirk, Saman P. Amarasinghe, Samuel Madden, and Matei Zaharia. 2018. Evaluating End-to-End Optimization for Data Analytics Applications in Weld. Proc. VLDB Endow. 11, 9 (2018), 1002--1015. http://www.vldb.org/pvldb/vol11/p1002-palkar.pdf
[46]
Yanghua Peng, Yixin Bao, Yangrui Chen, Chuan Wu, and Chuanxiong Guo. 2018. Optimus: an efficient dynamic resource scheduler for deep learning clusters. In Proceedings of EuroSys Conference (EuroSys) 2018. 3:1--3:14.
[47]
Christopher Root and Todd Mostak. 2016. MapD: a GPU-powered big data analytics and visualization platform. In Proceedings of Special Interest Group on Computer Graphics and Interactive Techniques Conference 2016. 73:1--73:2.
[48]
Christopher J Rossbach, Yuan Yu, Jon Currey, Jean-Philippe Martin, and Dennis Fetterly. 2013. Dandelion: a compiler and runtime for heterogeneous systems. In Proceedings of Symposium on Operating Systems Principles (SOSP) 2013. 49--68.
[49]
Anil Shanbhag, Samuel Madden, and Xiangyao Yu. 2020. A Study of the Fundamental Performance Characteristics of GPUs and CPUs for Database Analytics. In Proceedings of International Conference on Management of Data (SIGMOD) 2020. 1617--1632.
[50]
Haichen Shen, Lequn Chen, Yuchen Jin, Liangyu Zhao, Bingyu Kong, Matthai Philipose, Arvind Krishnamurthy, and Ravi Sundaram. 2019. Nexus: a GPU cluster engine for accelerating DNN-based video analysis. In Proceedings of Symposium on Operating Systems Principles (SOSP) 2019. 322--337.
[51]
Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. 2013. Apache Hadoop YARN: Yet another resource negotiator. In Proceedings of ACM Symposium on Cloud Computing (SOCC) 2013. 5:1--5:16.
[52]
Haicheng Wu, Gregory Frederick Diamos, Srihari Cadambi, and Sudhakar Yalamanchili. 2012. Kernel Weaver: Automatically Fusing Database Primitives for Efficient GPU Computation. In Proceedings of International Symposium on Microarchitecture (MICRO) 2012. 107--118.
[53]
Wencong Xiao, Romil Bhardwaj, Ramachandran Ramjee, Muthian Sivathanu, Nipun Kwatra, Zhenhua Han, Pratyush Patel, Xuan Peng, Hanyu Zhao, Quanlu Zhang, Fan Yang, and Lidong Zhou. 2018. Gandiva: Introspective Cluster Scheduling for Deep Learning. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI) 2018. 595--610. https://www.usenix.org/conference/osdi18/presentation/xiao
[54]
Yonghong Yan, Max Grossman, and Vivek Sarkar. 2009. JCUDA: A programmer-friendly interface for accelerating Java programs with CUDA. In Proceedings of European Conference on Parallel Processing. Springer, 887--899.
[55]
Ting-An Yeh, Hung-Hsin Chen, and Jerry Chou. 2020. KubeShare: A Framework to Manage GPUs as First-Class and Shared Resources in Container Cloud. In Proceedings of International Symposium on High-Performance Parallel and Distributed Computing (HPDC) 2020. 173--184.
[56]
Yuan Yu, Michael Isard, Dennis Fetterly, Mihai Budiu, Úlfar Erlingsson, Pradeep Kumar Gunda, and Jon Currey. 2008. DryadLINQ: A System for General-Purpose Distributed Data-Parallel Computing Using a High-Level Language. In Proceedings of USENIX Symposium on Operating Systems Design and Implementation (OSDI) 2008. 1--14. https://www.usenix.org/legacy/events/osdi08/tech/full_papers/yu_y/yu_y.pdf
[57]
Yuan Yuan, Rubao Lee, and Xiaodong Zhang. 2013. The Yin and Yang of processing data warehousing queries on GPU devices. Proc. VLDB Endow. 6, 10 (2013), 817--828. http://www.vldb.org/pvldb/vol6/p817-yuan.pdf
[58]
Yuan Yuan, Meisam Fathi Salmi, Yin Huai, Kaibo Wang, Rubao Lee, and Xiaodong Zhang. 2016. Spark-GPU: An accelerated in-memory data processing engine on clusters. In Proceedings of IEEE International Conference on Big Data (BigData) 2016. 273--283.
[59]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauly, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. In Proceedings of USENIX Symposium on Networked Systems Design and Implementation (NSDI) 2012. 15--28. https://www.usenix.org/conference/nsdi12/technical-sessions/presentation/zaharia
[60]
Matei Zaharia, Mosharaf Chowdhury, Michael J Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster computing with working sets. HotCloud 10, 10--10 (2010), 95. https://www.usenix.org/conference/hotcloud-10/spark-cluster-computing-working-sets

Cited By

View all
  • (2024)Neos: A NVMe-GPUs Direct Vector Service Buffer in User Space2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00289(3767-3781)Online publication date: 13-May-2024
  • (2024)Optimizing Real-Time Data Processing in Resource-Constrained Environments: a Spark and Gpu-Driven Workflow for Large Language Models2024 IEEE International Conference on Future Machine Learning and Data Science (FMLDS)10.1109/FMLDS63805.2024.00032(116-121)Online publication date: 20-Nov-2024
  • (2022)Enabling Transparent Acceleration of Big Data Frameworks Using Heterogeneous HardwareProceedings of the VLDB Endowment10.14778/3565838.356584215:13(3869-3882)Online publication date: 1-Sep-2022

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PPoPP '21: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
February 2021
507 pages
ISBN:9781450382946
DOI:10.1145/3437801
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 February 2021

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU
  2. big data processing
  3. heterogeneous system

Qualifiers

  • Research-article

Funding Sources

Conference

PPoPP '21

Acceptance Rates

PPoPP '21 Paper Acceptance Rate 31 of 150 submissions, 21%;
Overall Acceptance Rate 230 of 1,014 submissions, 23%

Upcoming Conference

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)52
  • Downloads (Last 6 weeks)2
Reflects downloads up to 23 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Neos: A NVMe-GPUs Direct Vector Service Buffer in User Space2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00289(3767-3781)Online publication date: 13-May-2024
  • (2024)Optimizing Real-Time Data Processing in Resource-Constrained Environments: a Spark and Gpu-Driven Workflow for Large Language Models2024 IEEE International Conference on Future Machine Learning and Data Science (FMLDS)10.1109/FMLDS63805.2024.00032(116-121)Online publication date: 20-Nov-2024
  • (2022)Enabling Transparent Acceleration of Big Data Frameworks Using Heterogeneous HardwareProceedings of the VLDB Endowment10.14778/3565838.356584215:13(3869-3882)Online publication date: 1-Sep-2022

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media