PipeBERT: High-throughput BERT Inference for ARM Big.LITTLE Multi-core Processors

Chang, Hung-Yang; Mozafari, Seyyed Hasan; Chen, Cheng; Clark, James J.; Meyer, Brett H.; Gross, Warren J.

doi:10.1007/s11265-022-01814-y

PipeBERT: High-throughput BERT Inference for ARM Big.LITTLE Multi-core Processors

Published: 08 October 2022

Volume 95, pages 877–894, (2023)
Cite this article

Journal of Signal Processing Systems Aims and scope Submit manuscript

Hung-Yang Chang ORCID: orcid.org/0000-0001-9231-5613¹,
Seyyed Hasan Mozafari¹,
Cheng Chen¹,
James J. Clark¹,
Brett H. Meyer¹ &
…
Warren J. Gross¹

778 Accesses
3 Citations
Explore all metrics

Abstract

Transformer-based models such as BERT model have achieved state-of-the-art accuracy in the natural language processing (NLP) tasks. Nevertheless, these models are extremely cumbersome and have low throughput in NLP inference. This is more challenging for edge inference due to the limited memory size and computational power of edge devices. Therefore, we aim to improve the edge inference throughput of transformer-based models, which is critical for real-life applications that process multiple independent tasks concurrently on resource-constrained devices to provide a better user experience. Pipelining a deep neural network (DNN) model across heterogeneous processing elements has been shown to significantly improve throughput. However, existing deep learning (DL) frameworks do not support pipeline inference, and previous works dedicated to pipelining lack full support for BERT models. In this work, we propose a heterogeneous pipelining framework (PipeBERT), built on TVM, for BERT models to utilize all available heterogeneous resources present in the ARM big.LITTLE architecture, which is common in modern edge devices. PipeBERT is the first pipelining framework that fully supports BERT operations, and improve overall throughput by employing heterogeneous ARM CPU clusters concurrently. PipeBERT splits BERT model into subgraphs, then maps subgraphs onto either ARM big or LITTLE cluster. To efficiently find pipeline configurations that balance the workload between heterogeneous clusters, we propose an improved binary search algorithm, which uses hardware performance metric feedback to find the best split configurations faster. Our search algorithm finds the best split point on average 1.2x and 165x faster than baseline binary search and exhaustive search, respectively. On the HiKey970 embedded platform and for BERT models, PipeBERT demonstrates on average 48.6% of higher inference throughput than running on four big cores (i.e., ARM big CPU cluster), and an average 61% of lower energy-delay product (EDP) than the best homogeneous inference.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

EdgeSP: Scalable Multi-device Parallel DNN Inference on Heterogeneous Edge Clusters

DPP-Net: Device-Aware Progressive Search for Pareto-Optimal Neural Architectures

Combining Task- and Data-Level Parallelism for High-Throughput CNN Inference on Embedded CPUs-GPUs MPSoCs

Notes

We compare with Pipe-all because it is the only open-source pipelining framework.

References

Pacheco, A., et al. (2018). A smart classroom based on deep learning and osmotic iot computing. In 2018 Congreso Internacional de Innovación y Tendencias en Ingeniería (CONIITI).
Amazon Alexa. Retrieved Oct 5, 2022, from https://developer.amazon.com/en-US/alexa
Google Home Nest. Retrieved Oct 5, 2022, from https://store.google.com/product/nest_hub_2nd_gen?hl=en-GB
Palanica, A., & Fossat, Y. (2021). Medication name comprehension of intelligent virtual assistants: A comparison of amazon alexa, google assistant, and apple siri between 2019 and 2021. Frontiers in Digital Health, 3, 48.
Article Google Scholar
Iandola, F. N., et al. (2020). SqueezeBERT: What can computer vision teach nlp about efficient neural networks? http://arxiv.org/abs/2006.11316
Wu, C. -J., et al. (2019). Machine learning at facebook: Understanding inference at the edge. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).
Xiao, Z., et al. (2013). Security and privacy in cloud computing. IEEE Communication Surveys and Tutorials, 15(2), 843–859.
Article Google Scholar
Pires, T., Schlinger, E., & Garrette, D. (2019). How multilingual is multilingual bert? http://arxiv.org/abs/1906.01502
Google, Google pixel 6 live translation. Retrieved Oct 5, 2022, from https://support.google.com/pixelphone/answer/11209263?hl=ena
Geekbench 5. Retrieved Oct 5, 2022, from https://browser.geekbench.com/
Arm big.LITTLE. Retrieved Oct 5, 2022, from https://www.arm.com/why-arm/technologies/big-little
Jouppi, N. P., et al. (2017). In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ACM ISCA).
Wang, S., et al. (2019). High-throughput CNN inference on embedded ARM big.little multi-core processors. IEEE TCAD, 39(10), 2254–2267.
Wang, S., et al. (2020). Neural network inference on mobile SoCs. IEEE Design & Test, 37(5), 50–57.
Aghapour, E., et al. (2021) Integrated ARM big.Little-Mali Pipeline for High-Throughput CNN Inference. TechRxiv. https://doi.org/10.36227/techrxiv.14994885.v2
Hikey970. (2018). Retrieved Oct 5, 2022, from https://www.96boards.org/product/hikey970/
Braun, T., et al. (2001). A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems. Elsevier Journal of Parallel and Distributed Computing, 61(6), 810–837.
MATH Google Scholar
Han, S., et al. (2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. http://arxiv.org/abs/1510.00149
Zhang, D., et al. (2018). Lq-nets: Learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV).
Hinton, G. et al. (2015). Distilling the knowledge in a neural network.
Sanh, V., et al. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. http://arxiv.org/abs/1910.01108
Kim, Y., et al. (2019). μlayer: Low latency on-device inference using cooperative single-layer acceleration and processor-friendly quantization. In Proceedings of the Fourteenth EuroSys Conference 2019.
Soomro, P. N., et al. (2021). An online guided tuning approach to run cnn pipelines on edge devices. In Proceedings of the 18th ACM International Conference on Computing Frontiers.
Krizhevsky, A., et al. (2012). Imagenet classification with deep convolutional neural networks. In NeurIPS.
Szegedy, C. A. O. (2015). Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Howard, A. G., et al. (2017) MobileNets: Efficient convolutional neural networks for mobile vision applications. http://arxiv.org/abs/1704.04861
He, K., et al. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
Iandola, F. N., et al. (2016) SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and 0.5 mb model size. http://arxiv.org/abs/1602.07360
Chen, T., et al. (2018). TVM: end-to-end optimization stack for deep learning. http://arxiv.org/abs/1802.04799
Lan, Z., et al. (2020). ALBERT: A lite BERT for self-supervised learning of language representations. In Submitted to International Conference on Learning Representations.
Sun, Z., et al. (2020). MobileBERT: a compact task-agnostic BERT for resource-limited devices. http://arxiv.org/abs/2004.02984
Lane, N., et al. (2016). Deepx: A software accelerator for low-power deep learning inference on mobile devices. In 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN).
Kang, W., et al. (2021). Lalarand: Flexible layer-by-layer cpu/gpu scheduling for real-time dnn tasks. In 2021 IEEE Real-Time Systems Symposium (RTSS).
Minakova, S., et al. (2020). Combining task-and data-level parallelism for high-throughput cnn inference on embedded cpus-gpus mpsocs. In Springer SAMOS.
Bilsen, G., et al. (1995). Cyclo-static data flow. in IEEE ICASSP.
Nvidia jetson tx2. (2017). Retrieved Oct 5, 2022, from https://developer.nvidia.com/embedded/jetson-tx2
Kang, D., et al. (2020). Scheduling of deep learning applications onto heterogeneous processors in an embedded device. IEEE Access, 8, 43 980–43 991.
Devlin, J., et al. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. http://arxiv.org/abs/1810.04805
Vaswani, A., et al. Attention is all you need. http://arxiv.org/abs/1706.03762
Wang, A., et al. (2019). Glue: A multi-task benchmark and analysis platform for natural language understanding. http://arxiv.org/abs/1804.07461
Bhandare, A., et al. (2019). Efficient 8-bit quantization of transformer neural machine language translation model. http://arxiv.org/abs/1906.00532
Zafrir, O., et al. (2019). Q8BERT: Quantized 8bit BERT. http://arxiv.org/abs/1910.06188
Kim, S., et al. (2021). I-BERT: Integer-only BERT quantization. http://arxiv.org/abs/2101.01321
Gordon, M. A., et al. (2020). Compressing BERT: Studying the effects of weight pruning on transfer learning.
Dehghani, M., et al. (2018). Universal transformers. http://arxiv.org/abs/1807.03819?context=cs
Tambe, T., et al. (2021). EdgeBERT: Sentence-level energy optimizations for latency-aware multi-task NLP inference. in IEEE/ACM MICRO.
Kwon, H., et al. (2019). Understanding reuse, performance, and hardware cost of dnn dataflow: A data-centric approach. in IEEE/ACM MICRO.
Zhou, L., et al. (2019). Adaptive parallel execution of deep neural networks on heterogeneous edge devices. in ACM/IEEE SEC.
Zeng, L., Chen, X., Zhou, Z., Yang, L., & Zhang, J. (2021). Coedge: Cooperative dnn inference with adaptive workload partitioning over heterogeneous edge devices. IEEE/ACM TON, 29(2), 595–608.
Article Google Scholar
Compute library: A software library for computer vision and machine learning. Retrieved Oct 5, 2022, from https://developer.arm.com/ip-products/processors/machine-learning/compute-library
Ignatov, A., et al. (2019). Ai benchmark: All about deep learning on smartphones in 2019. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).
Roesch, J., et al. (2018). Relay: A new ir for machine learning frameworks. In ACM PLDI.
Lattner, C., & Adve, V. (2004). LLVM: A compilation framework for lifelong program analysis and transformation. In International Symposium on Code Generation and Optimization (CGO).
Tensorflow lite. Retrieved Oct 5, 2022, from https://www.tensorflow.org/lite
Arm neural network: Arm software developer kit. Retrieved Oct 5, 2022, from https://www.arm.com/products/silicon-ip-cpu/ethos/arm-nn
Tensorflow extended. Retrieved Oct 5, 2022, from https://www.tensorflow.org/tfx
Wolf, T., et al. (2019). Huggingface’s transformers: State-of-the-art natural language processing, http://arxiv.org/abs/1910.03771
Torchscript. Retrieved Oct 5, 2022, from https://pytorch.org/docs/stable/jit.html
Wang, S., et al. (2018). Optic: Optimizing collaborative cpu-gpu computing on mobile devices with thermal constraints. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 38(3), 393–406.
Article Google Scholar
Arm streamline performance analyzer. Retrieved Oct 5, 2022, from https://developer.arm.com/tools-and-software/embedded/arm-development-studio/components/streamline-performance-analyzer
Nvidia data center deep learning product performance: Inference. Retrieved Oct 5, 2022, from https://developer.nvidia.com/deep-learning-performance-training-inference
Gibson, P., et al. (2020). Optimizing grouped convolutions on edge devices. In 2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP).
Choudhury, A. R., et al. (2020). Variable batch size across layers for efficient prediction on cnns. In 2020 IEEE 13th International Conference on Cloud Computing (CLOUD).
Zhou, H., et al. (2018). S^ 3dnn: Supervised streaming and scheduling for gpu-accelerated real-time dnn workloads. In 2018 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS).
Kosaian, J., et al. (2021). Boosting the throughput and accelerator utilization of specialized cnn inference beyond increasing batch size. In International Conference on Machine Learning (ICML) PMLR.

Download references

Funding

The author(s) received funding from Huawei Canada.

Author information

Authors and Affiliations

Department of Electrical and Computer Engineering, McGill University, Montréal, QC, Canada
Hung-Yang Chang, Seyyed Hasan Mozafari, Cheng Chen, James J. Clark, Brett H. Meyer & Warren J. Gross

Authors

Hung-Yang Chang
View author publications
You can also search for this author in PubMed Google Scholar
Seyyed Hasan Mozafari
View author publications
You can also search for this author in PubMed Google Scholar
Cheng Chen
View author publications
You can also search for this author in PubMed Google Scholar
James J. Clark
View author publications
You can also search for this author in PubMed Google Scholar
Brett H. Meyer
View author publications
You can also search for this author in PubMed Google Scholar
Warren J. Gross
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Hung-Yang, Dr. Seyyed Hasan, and Prof. Brett developed the methodologies and planned the experiment. Hung-Yang performed experimental implementations, and Cheng designed the searching algorithm. Hung-Yang and Dr. Seyyed Hasan wrote the manuscripts, and all authors provided critical feedback to shape the research, analysis, and manuscript.

Corresponding author

Correspondence to Hung-Yang Chang.

Ethics declarations

Conflicting Interest

We declare that we do not have any potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chang, HY., Mozafari, S.H., Chen, C. et al. PipeBERT: High-throughput BERT Inference for ARM Big.LITTLE Multi-core Processors. J Sign Process Syst 95, 877–894 (2023). https://doi.org/10.1007/s11265-022-01814-y

Download citation

Received: 02 May 2022
Revised: 28 July 2022
Accepted: 08 September 2022
Published: 08 October 2022
Issue Date: July 2023
DOI: https://doi.org/10.1007/s11265-022-01814-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

PipeBERT: High-throughput BERT Inference for ARM Big.LITTLE Multi-core Processors

Abstract

Access this article

Similar content being viewed by others

EdgeSP: Scalable Multi-device Parallel DNN Inference on Heterogeneous Edge Clusters

DPP-Net: Device-Aware Progressive Search for Pareto-Optimal Neural Architectures

Combining Task- and Data-Level Parallelism for High-Throughput CNN Inference on Embedded CPUs-GPUs MPSoCs

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicting Interest

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

PipeBERT: High-throughput BERT Inference for ARM Big.LITTLE Multi-core Processors

Abstract

Access this article

Similar content being viewed by others

EdgeSP: Scalable Multi-device Parallel DNN Inference on Heterogeneous Edge Clusters

DPP-Net: Device-Aware Progressive Search for Pareto-Optimal Neural Architectures

Combining Task- and Data-Level Parallelism for High-Throughput CNN Inference on Embedded CPUs-GPUs MPSoCs

Notes

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflicting Interest

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation