An Efficient Transformer Inference Engine on DSP

Chen, Kangkang; Su, Huayou; Liu, Chaorun; Gong, Xiaoli

doi:10.1007/978-3-031-22677-9_29

Kangkang Chen¹¹,
Huayou Su¹¹,
Chaorun Liu¹¹ &
…
Xiaoli Gong¹²

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13777))

Included in the following conference series:

International Conference on Algorithms and Architectures for Parallel Processing

1565 Accesses
1 Altmetric

Abstract

The transformer is one of the most important algorithms in the Natural Language Processing(NLP) field and widely used in computer vision recently. Due to the huge computation requirements, the current transformer acceleration work is mainly focused on GPUs of the data center, away from signal sources such as voice and video. Digital Signal Processor(DSP) is the traditional signal processing device and is usually deployed on the edge. Therefore, it can effectively reduce the processing time of the entire task by deploying deep learning models on edge devices like DSP. However, there are several challenges to deploying transformer models on DSP efficiently. Firstly, the transformer is too computationally intensive for DSP. Secondly, there is a lack of efficient transformer operator libraries on DSP. In addition, the input sequence’s variable length makes it difficult for optimization methods such as batching to work. To solve these challenges, we proposed a DSP accelerated transformer inference engine, which consists of three components, an efficient transformer operator library based on a very long vector and Very Long Instruction Word(VLIW) architecture; an efficient memory optimization strategy to manage large amounts of intermediate results and alleviates data traffic problems due to insufficient on-chip memory; and a sequence warp method that packs varied sequences to a large one based on the sliding window and greedy algorithm. Experimental results show that the proposed DSP transformer engine’s performance is comparable with that of the mainstream NVIDIA GPU, while the DSP’s bandwidth is only 1/20 of that of the GPU.

K. Chen and H. Su—These authors contributed equally to this work.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Abadi, M., et al.: Tensorflow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation, pp. 265–283 (2016)
Google Scholar
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chapter Google Scholar
Chen, T., et al.: TVM: an automated end-to-end optimizing compiler for deep learning. In: Proceedings of the 13th USENIX conference on Operating Systems Design and Implementation, pp. 579–594 (2018)
Google Scholar
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP (2014)
Google Scholar
Dice, D., Kogan, A.: Optimizing inference performance of transformers on CPUs. arXiv preprint arXiv:2102.06621 (2021)
Dosovitskiy, A., et al.: An image is worth 16\(\,\times \,\)16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2020)
Google Scholar
Fang, J., Yu, Y., Zhao, C., Zhou, J.: Turbotransformers: an efficient gpu serving system for transformer models. In: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 389–402 (2021)
Google Scholar
Ganiev, A., Chapin, C., De Andrade, A., Liu, C.: An architecture for accelerated large-scale inference of transformer-based language models. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers, pp. 163–169 (2021)
Google Scholar
Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer, vol. 34, pp. 15908–15919 (2021)
Google Scholar
Hendrycks, D., Gimpel, K.: Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415 (2016)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Kenton, J.D.M.W.C., Toutanova, L.K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)
Google Scholar
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: a lite BERT for self-supervised learning of language representations. In: International Conference on Learning Representations (2019)
Google Scholar
Li, B., et al.: Ftrans: energy-efficient acceleration of transformers using FPGA. In: Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design, pp. 175–180 (2020)
Google Scholar
li, G., et al.: Easy and efficient transformer: Scalable inference solution for large NLP mode. arXiv preprint arXiv:2104.12470 (2021)
Li, Z., et al.: Auto-ViT-Acc: An FPGA-aware automatic acceleration framework for vision transformer with mixed-scheme quantization. arXiv preprint arXiv:2208.05163 (2022)
Liu, Y., et al.: Roberta: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Liu, Z., Li, G., Cheng, J.: Hardware acceleration of fully quantized BERT for efficient natural language processing. In: 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 513–516. IEEE (2021)
Google Scholar
NVIDIA: Nvidia TensorRT (2020)
Google Scholar
NVIDIA: Fastertransformer (2022)
Google Scholar
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library, vol. 32 (2019)
Google Scholar
Qi, P., Song, Y., Peng, H., Huang, S., Zhuge, Q., Sha, E.H.M.: Accommodating transformer onto FPGA: coupling the balanced model compression and FPGA-implementation optimization. In: Proceedings of the 2021 on Great Lakes Symposium on VLSI, pp. 163–168 (2021)
Google Scholar
Vaswani, A., et al.: Attention is all you need, vol. 30 (2017)
Google Scholar
Wang, X., Xiong, Y., Wei, Y., Wang, M., Li, L.: LightSeq: a high performance inference library for transformers. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers, pp. 113–120 (2021)
Google Scholar
Wang, Y., et al.: End-to-end video instance segmentation with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8741–8750 (2021)
Google Scholar
Wang, Y., Wang, Q., Chu, X.: Energy-efficient inference service of transformer-based deep learning models on GPUs. In: 2020 International Conferences on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData) and IEEE Congress on Cybermatics (Cybermatics), pp. 323–331. IEEE (2020)
Google Scholar
Wu, S., Lv, T., Yuan, P., Zhao, P., Ye, J., Lin, H.: Optimization for BERT inference performance on CPU (2021)
Google Scholar
Zhou, L., Zhou, Y., Corso, J.J., Socher, R., Xiong, C.: End-to-end dense video captioning with masked transformer. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 8739–8748 (2018)
Google Scholar

Download references

Acknowledgements

This work has been supported by the National Natural Science Foundation of China No.61872377 and Fund of PDL.

Author information

Authors and Affiliations

National University of Defense and Technology, Changsha, Hunan, China
Kangkang Chen, Huayou Su & Chaorun Liu
Nankai University, Tianjin, China
Xiaoli Gong

Authors

Kangkang Chen
View author publications
You can also search for this author in PubMed Google Scholar
Huayou Su
View author publications
You can also search for this author in PubMed Google Scholar
Chaorun Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xiaoli Gong
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huayou Su .

Editor information

Editors and Affiliations

Technical University of Denmark, Kongens Lyngby, Denmark
Weizhi Meng
University of New Brunswick, Fredericton, NB, Canada
Rongxing Lu
University of Exeter, Exeter, UK
Geyong Min
Rutgers University, Newark, NJ, USA
Jaideep Vaidya

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, K., Su, H., Liu, C., Gong, X. (2023). An Efficient Transformer Inference Engine on DSP. In: Meng, W., Lu, R., Min, G., Vaidya, J. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2022. Lecture Notes in Computer Science, vol 13777. Springer, Cham. https://doi.org/10.1007/978-3-031-22677-9_29

Download citation

DOI: https://doi.org/10.1007/978-3-031-22677-9_29
Published: 11 January 2023
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-22676-2
Online ISBN: 978-3-031-22677-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Efficient Transformer Inference Engine on DSP