Skip to main content

Input Compression with Positional Consistency for Efficient Training and Inference of Transformer Neural Networks

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases. Research Track (ECML PKDD 2024)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14945))

  • 735 Accesses

Abstract

Transformers have achieved state-of-the-art performance in processing text, images, audio and video. However, they present large computational requirements for both training and inference, and are prone to overfitting on small data sets. To address these challenges, we present Input Compression with Positional Consistency (ICPC), a new data augmentation method that simultaneously improves both generalization and training efficiency. The key insight behind ICPC is that input compression can be used as a data augmentation technique. ICPC applies varying levels of compression to each sample in each epoch. This leads to smaller input sequences being processed by the Transformer, and hence faster training, while also alleviating overfitting by presenting each input with different compression levels. We introduce a consistency-aware position selection method in ICPC that enables accurate processing of compressed inputs without any changes to the underlying Transformer architecture. We detail compression-based augmentation methods for four different modalities – insignificant word pruning for text, resolution modulation for images, spatio-temporal resolution modulation for videos, and spectrogram modulation for audio. In addition to faster training with reduced overfitting, we find that ICPC enhances resilience to input compression during inference. Therefore, we introduce variable-effort inference schemes for accurate and efficient inference. On 9 diverse tasks spanning 4 different modalities, ICPC improves accuracy by up to 1%, while also accelerating training and inference by up to 2.9\(\times \) and 2.6\(\times \), respectively. Code is available at https://github.com/amrnag/ICPC.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    CoLA is excluded, since the task involves testing grammatical correctness of inputs, hence stopwords cannot be pruned.

References

  1. Bird, S.: NLTK: the natural language toolkit. In: ACL 2006, 44th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Sydney, Australia (2006). https://aclanthology.org/P06-4018/

  2. Dehghani, M., et al.: Patch n’ Pack: NaViT, a vision transformer for any aspect ratio and resolution. CoRR (2023). https://doi.org/10.48550/arXiv.2307.06304

  3. Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009 (2009). https://doi.org/10.1109/CVPR.2009.5206848

  4. Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis (2019). https://doi.org/10.18653/v1/n19-1423

  5. Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: Transformers for image recognition at scale. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria (2021). https://openreview.net/forum?id=YicbFdNTTy

  6. Gong, Y., Chung, Y., Glass, J.R.: AST: audio spectrogram transformer. CoRR (2021). https://arxiv.org/abs/2104.01778

  7. Goyal, R., et al.: The “something somethin” video database for learning and evaluating visual common sense. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy (2017). https://doi.org/10.1109/ICCV.2017.622

  8. Han, Y., Huang, G., Song, S., Yang, L., Wang, H., Wang, Y.: Dynamic neural networks: a survey. IEEE Trans. Pattern Anal. Mach. Intell. (2022). https://doi.org/10.1109/TPAMI.2021.3117837

  9. Hayes, P.J., Weinstein, S.P.: CONSTRUE/TIS: a system for content-based indexing of a database of news stories. In: Proceedings of the The Second Conference on Innovative Applications of Artificial Intelligence (IAAI-90), Washington, DC, USA (1990). http://www.aaai.org/Library/IAAI/1990/iaai90-006.php

  10. Hendrycks, D., Mu, N., Cubuk, E.D., Zoph, B., Gilmer, J., Lakshminarayanan, B.: AugMix: a simple data processing method to improve robustness and uncertainty (2020)

    Google Scholar 

  11. Hendrycks, D., Mu, N., Cubuk, E.D., Zoph, B., Gilmer, J., Lakshminarayanan, B.: AugMix: a simple data processing method to improve robustness and uncertainty. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26-30 April 2020 (2020). https://openreview.net/forum?id=S1gmrxHFvB

  12. Kaplan, J., et al.: Scaling laws for neural language models. CoRR (2020). https://arxiv.org/abs/2001.08361

  13. Kay, W., et al.: The kinetics human action video dataset. CoRR (2017). http://arxiv.org/abs/1705.06950

  14. Koutini, K., Schlüter, J., Eghbal-zadeh, H., Widmer, G.: Efficient training of audio transformers with patchout. CoRR (2021). https://arxiv.org/abs/2110.05069

  15. Krizhevsky, A.: Learning Multiple Layers of Features from Tiny Images. Univ. Toronto, Technical report (2009)

    Google Scholar 

  16. Li, K., et al.: Unmasked teacher: towards training-efficient video foundation models. CoRR (2023). https://doi.org/10.48550/arXiv.2303.16058

  17. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. CoRR (2019). http://arxiv.org/abs/1907.11692

  18. Park, D.S., et al: SpecAugment: a simple data augmentation method for automatic speech recognition. In: Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria (2019). https://doi.org/10.21437/Interspeech.2019-2680

  19. Piczak, K.J.: ESC: dataset for environmental sound classification. In: Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, MM 2015, Brisbane, Australia (2015). https://doi.org/10.1145/2733373.2806390

  20. Teerapittayanon, S., McDanel, B., Kung, H.T.: BranchyNet: fast inference via early exiting from deep neural networks. In: 23rd International Conference on Pattern Recognition, ICPR 2016, Cancún, Mexico (2016). https://doi.org/10.1109/ICPR.2016.7900006

  21. Verma, V., et al.: Manifold mixup: better representations by interpolating hidden states (2019)

    Google Scholar 

  22. Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (2018). https://aclanthology.org/W18-5446

  23. Wang, Y., Huang, R., Song, S., Huang, Z., Huang, G.: Not all images are worth 16\(\times \)16 words: dynamic transformers for efficient image recognition. In: Advances in Neural Information Processing Systems, vol. 34: NeurIPS 2021 (2021). https://proceedings.neurips.cc/paper/2021/hash/64517d8435994992e682b3e4aa0a0661-Abstract.html

  24. Warden, P.: Speech commands: a dataset for limited-vocabulary speech recognition. CoRR abs/1804.03209 (2018). http://arxiv.org/abs/1804.03209

  25. Wei, J.W., Zou, K.: EDA: easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, 3-7 November 2019 (2019)

    Google Scholar 

  26. Wolf, T., et al.: Huggingface’s transformers: state-of-the-art natural language processing. CoRR (2019). http://arxiv.org/abs/1910.03771

  27. Xie, E., Wang, W., Yu, Z., Anandkumar, A., Álvarez, J.M., Luo, P.: SegFormer: simple and efficient design for semantic segmentation with transformers. In: Advances in Neural Information Processing Systems, vol. 34: NeurIPS 2021 (2021). https://proceedings.neurips.cc/paper/2021/hash/64f1f27bf1b4ec22924fd0acb550c235-Abstract.html

  28. Yun, S., Han, D., Chun, S., Oh, S.J., Yoo, Y., Choe, J.: Cutmix: regularization strategy to train strong classifiers with localizable features. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South) (2019). https://doi.org/10.1109/ICCV.2019.00612

  29. . Zhang, H., Cissé, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, Conference Track Proceedings (2018). https://openreview.net/forum?id=r1Ddp1-Rb

Download references

Acknowledgement

This work was supported in part by the Center for the Co-Design of Cognitive Systems (CoCoSys), a JUMP2.0 center sponsored by the Semiconductor Research Corporation (SRC) and DARPA, and in part by the National Science Foundation under Award No. 2318101.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Amrit Nagarajan .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Nagarajan, A., Raghunathan, A. (2024). Input Compression with Positional Consistency for Efficient Training and Inference of Transformer Neural Networks. In: Bifet, A., Davis, J., Krilavičius, T., Kull, M., Ntoutsi, E., Žliobaitė, I. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2024. Lecture Notes in Computer Science(), vol 14945. Springer, Cham. https://doi.org/10.1007/978-3-031-70362-1_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-70362-1_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-70361-4

  • Online ISBN: 978-3-031-70362-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics