Skip to main content
Log in

PipeBERT: High-throughput BERT Inference for ARM Big.LITTLE Multi-core Processors

  • Published:
Journal of Signal Processing Systems Aims and scope Submit manuscript

Abstract

Transformer-based models such as BERT model have achieved state-of-the-art accuracy in the natural language processing (NLP) tasks. Nevertheless, these models are extremely cumbersome and have low throughput in NLP inference. This is more challenging for edge inference due to the limited memory size and computational power of edge devices. Therefore, we aim to improve the edge inference throughput of transformer-based models, which is critical for real-life applications that process multiple independent tasks concurrently on resource-constrained devices to provide a better user experience. Pipelining a deep neural network (DNN) model across heterogeneous processing elements has been shown to significantly improve throughput. However, existing deep learning (DL) frameworks do not support pipeline inference, and previous works dedicated to pipelining lack full support for BERT models. In this work, we propose a heterogeneous pipelining framework (PipeBERT), built on TVM, for BERT models to utilize all available heterogeneous resources present in the ARM big.LITTLE architecture, which is common in modern edge devices. PipeBERT is the first pipelining framework that fully supports BERT operations, and improve overall throughput by employing heterogeneous ARM CPU clusters concurrently. PipeBERT splits BERT model into subgraphs, then maps subgraphs onto either ARM big or LITTLE cluster. To efficiently find pipeline configurations that balance the workload between heterogeneous clusters, we propose an improved binary search algorithm, which uses hardware performance metric feedback to find the best split configurations faster. Our search algorithm finds the best split point on average 1.2x and 165x faster than baseline binary search and exhaustive search, respectively. On the HiKey970 embedded platform and for BERT models, PipeBERT demonstrates on average 48.6% of higher inference throughput than running on four big cores (i.e., ARM big CPU cluster), and an average 61% of lower energy-delay product (EDP) than the best homogeneous inference.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8
Figure 9
Figure 10

Similar content being viewed by others

Notes

  1. We compare with Pipe-all because it is the only open-source pipelining framework.

References

  1. Pacheco, A., et al. (2018). A smart classroom based on deep learning and osmotic iot computing. In 2018 Congreso Internacional de Innovación y Tendencias en Ingeniería (CONIITI).

  2. Amazon Alexa. Retrieved Oct 5, 2022, from https://developer.amazon.com/en-US/alexa

  3. Google Home Nest. Retrieved Oct 5, 2022, from https://store.google.com/product/nest_hub_2nd_gen?hl=en-GB

  4. Palanica, A., & Fossat, Y. (2021). Medication name comprehension of intelligent virtual assistants: A comparison of amazon alexa, google assistant, and apple siri between 2019 and 2021. Frontiers in Digital Health, 3, 48.

    Article  Google Scholar 

  5. Iandola, F. N., et al. (2020). SqueezeBERT: What can computer vision teach nlp about efficient neural networks? http://arxiv.org/abs/2006.11316

  6. Wu, C. -J., et al. (2019). Machine learning at facebook: Understanding inference at the edge. In 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA).

  7. Xiao, Z., et al. (2013). Security and privacy in cloud computing. IEEE Communication Surveys and Tutorials, 15(2), 843–859.

    Article  Google Scholar 

  8. Pires, T., Schlinger, E., & Garrette, D. (2019). How multilingual is multilingual bert? http://arxiv.org/abs/1906.01502

  9. Google, Google pixel 6 live translation. Retrieved Oct 5, 2022, from https://support.google.com/pixelphone/answer/11209263?hl=ena

  10. Geekbench 5. Retrieved Oct 5, 2022, from https://browser.geekbench.com/

  11. Arm big.LITTLE. Retrieved Oct 5, 2022, from https://www.arm.com/why-arm/technologies/big-little

  12. Jouppi, N. P., et al. (2017). In-datacenter performance analysis of a tensor processing unit. In Proceedings of the 44th Annual International Symposium on Computer Architecture (ACM ISCA).

  13. Wang, S., et al. (2019). High-throughput CNN inference on embedded ARM big.little multi-core processors. IEEE TCAD, 39(10), 2254–2267.

  14. Wang, S., et al. (2020). Neural network inference on mobile SoCs. IEEE Design & Test, 37(5), 50–57.

  15. Aghapour, E., et al. (2021) Integrated ARM big.Little-Mali Pipeline for High-Throughput CNN Inference. TechRxiv. https://doi.org/10.36227/techrxiv.14994885.v2

  16. Hikey970. (2018). Retrieved Oct 5, 2022, from https://www.96boards.org/product/hikey970/

  17. Braun, T., et al. (2001). A comparison of eleven static heuristics for mapping a class of independent tasks onto heterogeneous distributed computing systems. Elsevier Journal of Parallel and Distributed Computing, 61(6), 810–837.

    MATH  Google Scholar 

  18. Han, S., et al. (2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. http://arxiv.org/abs/1510.00149

  19. Zhang, D., et al. (2018). Lq-nets: Learned quantization for highly accurate and compact deep neural networks. In Proceedings of the European Conference on Computer Vision (ECCV).

  20. Hinton, G. et al. (2015). Distilling the knowledge in a neural network.

  21. Sanh, V., et al. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. http://arxiv.org/abs/1910.01108

  22. Kim, Y., et al. (2019). μlayer: Low latency on-device inference using cooperative single-layer acceleration and processor-friendly quantization. In Proceedings of the Fourteenth EuroSys Conference 2019.

  23. Soomro, P. N., et al. (2021). An online guided tuning approach to run cnn pipelines on edge devices. In Proceedings of the 18th ACM International Conference on Computing Frontiers.

  24. Krizhevsky, A., et al. (2012). Imagenet classification with deep convolutional neural networks. In NeurIPS.

  25. Szegedy, C. A. O. (2015). Going deeper with convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

  26. Howard, A. G., et al. (2017) MobileNets: Efficient convolutional neural networks for mobile vision applications. http://arxiv.org/abs/1704.04861

  27. He, K., et al. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.

  28. Iandola, F. N., et al. (2016) SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and 0.5 mb model size. http://arxiv.org/abs/1602.07360

  29. Chen, T., et al. (2018). TVM: end-to-end optimization stack for deep learning. http://arxiv.org/abs/1802.04799

  30. Lan, Z., et al. (2020). ALBERT: A lite BERT for self-supervised learning of language representations. In Submitted to International Conference on Learning Representations.

  31. Sun, Z., et al. (2020). MobileBERT: a compact task-agnostic BERT for resource-limited devices. http://arxiv.org/abs/2004.02984

  32. Lane, N., et al. (2016). Deepx: A software accelerator for low-power deep learning inference on mobile devices. In 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN).

  33. Kang, W., et al. (2021). Lalarand: Flexible layer-by-layer cpu/gpu scheduling for real-time dnn tasks. In 2021 IEEE Real-Time Systems Symposium (RTSS).

  34. Minakova, S., et al. (2020). Combining task-and data-level parallelism for high-throughput cnn inference on embedded cpus-gpus mpsocs. In Springer SAMOS.

  35. Bilsen, G., et al. (1995). Cyclo-static data flow. in IEEE ICASSP.

  36. Nvidia jetson tx2. (2017). Retrieved Oct 5, 2022, from https://developer.nvidia.com/embedded/jetson-tx2

  37. Kang, D., et al. (2020). Scheduling of deep learning applications onto heterogeneous processors in an embedded device. IEEE Access, 8, 43 980–43 991.

  38. Devlin, J., et al. (2018). BERT: Pre-training of deep bidirectional transformers for language understanding. http://arxiv.org/abs/1810.04805

  39. Vaswani, A., et al. Attention is all you need. http://arxiv.org/abs/1706.03762

  40. Wang, A., et al. (2019). Glue: A multi-task benchmark and analysis platform for natural language understanding. http://arxiv.org/abs/1804.07461

  41. Bhandare, A., et al. (2019). Efficient 8-bit quantization of transformer neural machine language translation model. http://arxiv.org/abs/1906.00532

  42. Zafrir, O., et al. (2019). Q8BERT: Quantized 8bit BERT. http://arxiv.org/abs/1910.06188

  43. Kim, S., et al. (2021). I-BERT: Integer-only BERT quantization. http://arxiv.org/abs/2101.01321

  44. Gordon, M. A., et al. (2020). Compressing BERT: Studying the effects of weight pruning on transfer learning.

  45. Dehghani, M., et al. (2018). Universal transformers. http://arxiv.org/abs/1807.03819?context=cs

  46. Tambe, T., et al. (2021). EdgeBERT: Sentence-level energy optimizations for latency-aware multi-task NLP inference. in IEEE/ACM MICRO.

  47. Kwon, H., et al. (2019). Understanding reuse, performance, and hardware cost of dnn dataflow: A data-centric approach. in IEEE/ACM MICRO.

  48. Zhou, L., et al. (2019). Adaptive parallel execution of deep neural networks on heterogeneous edge devices. in ACM/IEEE SEC.

  49. Zeng, L., Chen, X., Zhou, Z., Yang, L., & Zhang, J. (2021). Coedge: Cooperative dnn inference with adaptive workload partitioning over heterogeneous edge devices. IEEE/ACM TON, 29(2), 595–608.

    Article  Google Scholar 

  50. Compute library: A software library for computer vision and machine learning. Retrieved Oct 5, 2022, from https://developer.arm.com/ip-products/processors/machine-learning/compute-library

  51. Ignatov, A., et al. (2019). Ai benchmark: All about deep learning on smartphones in 2019. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW).

  52. Roesch, J., et al. (2018). Relay: A new ir for machine learning frameworks. In ACM PLDI.

  53. Lattner, C., & Adve, V. (2004). LLVM: A compilation framework for lifelong program analysis and transformation. In International Symposium on Code Generation and Optimization (CGO).

  54. Tensorflow lite. Retrieved Oct 5, 2022, from https://www.tensorflow.org/lite

  55. Arm neural network: Arm software developer kit. Retrieved Oct 5, 2022, from https://www.arm.com/products/silicon-ip-cpu/ethos/arm-nn

  56. Tensorflow extended. Retrieved Oct 5, 2022, from https://www.tensorflow.org/tfx

  57. Wolf, T., et al. (2019). Huggingface’s transformers: State-of-the-art natural language processing, http://arxiv.org/abs/1910.03771

  58. Torchscript. Retrieved Oct 5, 2022, from https://pytorch.org/docs/stable/jit.html

  59. Wang, S., et al. (2018). Optic: Optimizing collaborative cpu-gpu computing on mobile devices with thermal constraints. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 38(3), 393–406.

    Article  Google Scholar 

  60. Arm streamline performance analyzer. Retrieved Oct 5, 2022, from https://developer.arm.com/tools-and-software/embedded/arm-development-studio/components/streamline-performance-analyzer

  61. Nvidia data center deep learning product performance: Inference. Retrieved Oct 5, 2022, from https://developer.nvidia.com/deep-learning-performance-training-inference

  62. Gibson, P., et al. (2020). Optimizing grouped convolutions on edge devices. In 2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP).

  63. Choudhury, A. R., et al. (2020). Variable batch size across layers for efficient prediction on cnns. In 2020 IEEE 13th International Conference on Cloud Computing (CLOUD).

  64. Zhou, H., et al. (2018). S^ 3dnn: Supervised streaming and scheduling for gpu-accelerated real-time dnn workloads. In 2018 IEEE Real-Time and Embedded Technology and Applications Symposium (RTAS).

  65. Kosaian, J., et al. (2021). Boosting the throughput and accelerator utilization of specialized cnn inference beyond increasing batch size. In International Conference on Machine Learning (ICML) PMLR.

Download references

Funding

The author(s) received funding from Huawei Canada.

Author information

Authors and Affiliations

Authors

Contributions

Hung-Yang, Dr. Seyyed Hasan, and Prof. Brett developed the methodologies and planned the experiment. Hung-Yang performed experimental implementations, and Cheng designed the searching algorithm. Hung-Yang and Dr. Seyyed Hasan wrote the manuscripts, and all authors provided critical feedback to shape the research, analysis, and manuscript.

Corresponding author

Correspondence to Hung-Yang Chang.

Ethics declarations

Conflicting Interest

We declare that we do not have any potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chang, HY., Mozafari, S.H., Chen, C. et al. PipeBERT: High-throughput BERT Inference for ARM Big.LITTLE Multi-core Processors. J Sign Process Syst 95, 877–894 (2023). https://doi.org/10.1007/s11265-022-01814-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11265-022-01814-y

Keywords

Navigation