Abstract
The transformer is one of the most important algorithms in the Natural Language Processing(NLP) field and widely used in computer vision recently. Due to the huge computation requirements, the current transformer acceleration work is mainly focused on GPUs of the data center, away from signal sources such as voice and video. Digital Signal Processor(DSP) is the traditional signal processing device and is usually deployed on the edge. Therefore, it can effectively reduce the processing time of the entire task by deploying deep learning models on edge devices like DSP. However, there are several challenges to deploying transformer models on DSP efficiently. Firstly, the transformer is too computationally intensive for DSP. Secondly, there is a lack of efficient transformer operator libraries on DSP. In addition, the input sequence’s variable length makes it difficult for optimization methods such as batching to work. To solve these challenges, we proposed a DSP accelerated transformer inference engine, which consists of three components, an efficient transformer operator library based on a very long vector and Very Long Instruction Word(VLIW) architecture; an efficient memory optimization strategy to manage large amounts of intermediate results and alleviates data traffic problems due to insufficient on-chip memory; and a sequence warp method that packs varied sequences to a large one based on the sliding window and greedy algorithm. Experimental results show that the proposed DSP transformer engine’s performance is comparable with that of the mainstream NVIDIA GPU, while the DSP’s bandwidth is only 1/20 of that of the GPU.
K. Chen and H. Su—These authors contributed equally to this work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Abadi, M., et al.: Tensorflow: a system for large-scale machine learning. In: Proceedings of the 12th USENIX conference on Operating Systems Design and Implementation, pp. 265–283 (2016)
Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13
Chen, T., et al.: TVM: an automated end-to-end optimizing compiler for deep learning. In: Proceedings of the 13th USENIX conference on Operating Systems Design and Implementation, pp. 579–594 (2018)
Cho, K., et al.: Learning phrase representations using RNN encoder-decoder for statistical machine translation. In: EMNLP (2014)
Dice, D., Kogan, A.: Optimizing inference performance of transformers on CPUs. arXiv preprint arXiv:2102.06621 (2021)
Dosovitskiy, A., et al.: An image is worth 16\(\,\times \,\)16 words: transformers for image recognition at scale. In: International Conference on Learning Representations (2020)
Fang, J., Yu, Y., Zhao, C., Zhou, J.: Turbotransformers: an efficient gpu serving system for transformer models. In: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 389–402 (2021)
Ganiev, A., Chapin, C., De Andrade, A., Liu, C.: An architecture for accelerated large-scale inference of transformer-based language models. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers, pp. 163–169 (2021)
Han, K., Xiao, A., Wu, E., Guo, J., Xu, C., Wang, Y.: Transformer in transformer, vol. 34, pp. 15908–15919 (2021)
Hendrycks, D., Gimpel, K.: Gaussian error linear units (GELUs). arXiv preprint arXiv:1606.08415 (2016)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Kenton, J.D.M.W.C., Toutanova, L.K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of NAACL-HLT, pp. 4171–4186 (2019)
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: Albert: a lite BERT for self-supervised learning of language representations. In: International Conference on Learning Representations (2019)
Li, B., et al.: Ftrans: energy-efficient acceleration of transformers using FPGA. In: Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design, pp. 175–180 (2020)
li, G., et al.: Easy and efficient transformer: Scalable inference solution for large NLP mode. arXiv preprint arXiv:2104.12470 (2021)
Li, Z., et al.: Auto-ViT-Acc: An FPGA-aware automatic acceleration framework for vision transformer with mixed-scheme quantization. arXiv preprint arXiv:2208.05163 (2022)
Liu, Y., et al.: Roberta: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Liu, Z., Li, G., Cheng, J.: Hardware acceleration of fully quantized BERT for efficient natural language processing. In: 2021 Design, Automation & Test in Europe Conference & Exhibition (DATE), pp. 513–516. IEEE (2021)
NVIDIA: Nvidia TensorRT (2020)
NVIDIA: Fastertransformer (2022)
Paszke, A., et al.: PyTorch: an imperative style, high-performance deep learning library, vol. 32 (2019)
Qi, P., Song, Y., Peng, H., Huang, S., Zhuge, Q., Sha, E.H.M.: Accommodating transformer onto FPGA: coupling the balanced model compression and FPGA-implementation optimization. In: Proceedings of the 2021 on Great Lakes Symposium on VLSI, pp. 163–168 (2021)
Vaswani, A., et al.: Attention is all you need, vol. 30 (2017)
Wang, X., Xiong, Y., Wei, Y., Wang, M., Li, L.: LightSeq: a high performance inference library for transformers. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers, pp. 113–120 (2021)
Wang, Y., et al.: End-to-end video instance segmentation with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8741–8750 (2021)
Wang, Y., Wang, Q., Chu, X.: Energy-efficient inference service of transformer-based deep learning models on GPUs. In: 2020 International Conferences on Internet of Things (iThings) and IEEE Green Computing and Communications (GreenCom) and IEEE Cyber, Physical and Social Computing (CPSCom) and IEEE Smart Data (SmartData) and IEEE Congress on Cybermatics (Cybermatics), pp. 323–331. IEEE (2020)
Wu, S., Lv, T., Yuan, P., Zhao, P., Ye, J., Lin, H.: Optimization for BERT inference performance on CPU (2021)
Zhou, L., Zhou, Y., Corso, J.J., Socher, R., Xiong, C.: End-to-end dense video captioning with masked transformer. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 8739–8748 (2018)
Acknowledgements
This work has been supported by the National Natural Science Foundation of China No.61872377 and Fund of PDL.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 Springer Nature Switzerland AG
About this paper
Cite this paper
Chen, K., Su, H., Liu, C., Gong, X. (2023). An Efficient Transformer Inference Engine on DSP. In: Meng, W., Lu, R., Min, G., Vaidya, J. (eds) Algorithms and Architectures for Parallel Processing. ICA3PP 2022. Lecture Notes in Computer Science, vol 13777. Springer, Cham. https://doi.org/10.1007/978-3-031-22677-9_29
Download citation
DOI: https://doi.org/10.1007/978-3-031-22677-9_29
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-22676-2
Online ISBN: 978-3-031-22677-9
eBook Packages: Computer ScienceComputer Science (R0)