Input Compression with Positional Consistency for Efficient Training and Inference of Transformer Neural Networks

Nagarajan, Amrit; Raghunathan, Anand

doi:10.1007/978-3-031-70362-1_5

Amrit Nagarajan^13,14 &
Anand Raghunathan¹³

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 14945))

Included in the following conference series:

Joint European Conference on Machine Learning and Knowledge Discovery in Databases

735 Accesses

Abstract

Transformers have achieved state-of-the-art performance in processing text, images, audio and video. However, they present large computational requirements for both training and inference, and are prone to overfitting on small data sets. To address these challenges, we present Input Compression with Positional Consistency (ICPC), a new data augmentation method that simultaneously improves both generalization and training efficiency. The key insight behind ICPC is that input compression can be used as a data augmentation technique. ICPC applies varying levels of compression to each sample in each epoch. This leads to smaller input sequences being processed by the Transformer, and hence faster training, while also alleviating overfitting by presenting each input with different compression levels. We introduce a consistency-aware position selection method in ICPC that enables accurate processing of compressed inputs without any changes to the underlying Transformer architecture. We detail compression-based augmentation methods for four different modalities – insignificant word pruning for text, resolution modulation for images, spatio-temporal resolution modulation for videos, and spectrogram modulation for audio. In addition to faster training with reduced overfitting, we find that ICPC enhances resilience to input compression during inference. Therefore, we introduce variable-effort inference schemes for accurate and efficient inference. On 9 diverse tasks spanning 4 different modalities, ICPC improves accuracy by up to 1%, while also accelerating training and inference by up to 2.9$\times $ and 2.6$\times $, respectively. Code is available at https://github.com/amrnag/ICPC.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 139.99; Price excludes VAT (USA)

Softcover Book: USD 79.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Lossless Compression of Deep Neural Networks

Training Thinner and Deeper Neural Networks: Jumpstart Regularization

Lossless data compression by large models

Article 01 May 2025

Notes

1.
CoLA is excluded, since the task involves testing grammatical correctness of inputs, hence stopwords cannot be pruned.

References

Bird, S.: NLTK: the natural language toolkit. In: ACL 2006, 44th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Sydney, Australia (2006). https://aclanthology.org/P06-4018/
Dehghani, M., et al.: Patch n’ Pack: NaViT, a vision transformer for any aspect ratio and resolution. CoRR (2023). https://doi.org/10.48550/arXiv.2307.06304
Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009 (2009). https://doi.org/10.1109/CVPR.2009.5206848
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis (2019). https://doi.org/10.18653/v1/n19-1423
Dosovitskiy, A., et al.: An image is worth 16$\times $16 words: Transformers for image recognition at scale. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria (2021). https://openreview.net/forum?id=YicbFdNTTy
Gong, Y., Chung, Y., Glass, J.R.: AST: audio spectrogram transformer. CoRR (2021). https://arxiv.org/abs/2104.01778
Goyal, R., et al.: The “something somethin” video database for learning and evaluating visual common sense. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy (2017). https://doi.org/10.1109/ICCV.2017.622
Han, Y., Huang, G., Song, S., Yang, L., Wang, H., Wang, Y.: Dynamic neural networks: a survey. IEEE Trans. Pattern Anal. Mach. Intell. (2022). https://doi.org/10.1109/TPAMI.2021.3117837
Hayes, P.J., Weinstein, S.P.: CONSTRUE/TIS: a system for content-based indexing of a database of news stories. In: Proceedings of the The Second Conference on Innovative Applications of Artificial Intelligence (IAAI-90), Washington, DC, USA (1990). http://www.aaai.org/Library/IAAI/1990/iaai90-006.php
Hendrycks, D., Mu, N., Cubuk, E.D., Zoph, B., Gilmer, J., Lakshminarayanan, B.: AugMix: a simple data processing method to improve robustness and uncertainty (2020)
Google Scholar
Hendrycks, D., Mu, N., Cubuk, E.D., Zoph, B., Gilmer, J., Lakshminarayanan, B.: AugMix: a simple data processing method to improve robustness and uncertainty. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26-30 April 2020 (2020). https://openreview.net/forum?id=S1gmrxHFvB
Kaplan, J., et al.: Scaling laws for neural language models. CoRR (2020). https://arxiv.org/abs/2001.08361
Kay, W., et al.: The kinetics human action video dataset. CoRR (2017). http://arxiv.org/abs/1705.06950
Koutini, K., Schlüter, J., Eghbal-zadeh, H., Widmer, G.: Efficient training of audio transformers with patchout. CoRR (2021). https://arxiv.org/abs/2110.05069
Krizhevsky, A.: Learning Multiple Layers of Features from Tiny Images. Univ. Toronto, Technical report (2009)
Google Scholar
Li, K., et al.: Unmasked teacher: towards training-efficient video foundation models. CoRR (2023). https://doi.org/10.48550/arXiv.2303.16058
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. CoRR (2019). http://arxiv.org/abs/1907.11692
Park, D.S., et al: SpecAugment: a simple data augmentation method for automatic speech recognition. In: Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria (2019). https://doi.org/10.21437/Interspeech.2019-2680
Piczak, K.J.: ESC: dataset for environmental sound classification. In: Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, MM 2015, Brisbane, Australia (2015). https://doi.org/10.1145/2733373.2806390
Teerapittayanon, S., McDanel, B., Kung, H.T.: BranchyNet: fast inference via early exiting from deep neural networks. In: 23rd International Conference on Pattern Recognition, ICPR 2016, Cancún, Mexico (2016). https://doi.org/10.1109/ICPR.2016.7900006
Verma, V., et al.: Manifold mixup: better representations by interpolating hidden states (2019)
Google Scholar
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (2018). https://aclanthology.org/W18-5446
Wang, Y., Huang, R., Song, S., Huang, Z., Huang, G.: Not all images are worth 16$\times $16 words: dynamic transformers for efficient image recognition. In: Advances in Neural Information Processing Systems, vol. 34: NeurIPS 2021 (2021). https://proceedings.neurips.cc/paper/2021/hash/64517d8435994992e682b3e4aa0a0661-Abstract.html
Warden, P.: Speech commands: a dataset for limited-vocabulary speech recognition. CoRR abs/1804.03209 (2018). http://arxiv.org/abs/1804.03209
Wei, J.W., Zou, K.: EDA: easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, 3-7 November 2019 (2019)
Google Scholar
Wolf, T., et al.: Huggingface’s transformers: state-of-the-art natural language processing. CoRR (2019). http://arxiv.org/abs/1910.03771
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Álvarez, J.M., Luo, P.: SegFormer: simple and efficient design for semantic segmentation with transformers. In: Advances in Neural Information Processing Systems, vol. 34: NeurIPS 2021 (2021). https://proceedings.neurips.cc/paper/2021/hash/64f1f27bf1b4ec22924fd0acb550c235-Abstract.html
Yun, S., Han, D., Chun, S., Oh, S.J., Yoo, Y., Choe, J.: Cutmix: regularization strategy to train strong classifiers with localizable features. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South) (2019). https://doi.org/10.1109/ICCV.2019.00612
. Zhang, H., Cissé, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, Conference Track Proceedings (2018). https://openreview.net/forum?id=r1Ddp1-Rb

Download references

Acknowledgement

This work was supported in part by the Center for the Co-Design of Cognitive Systems (CoCoSys), a JUMP2.0 center sponsored by the Semiconductor Research Corporation (SRC) and DARPA, and in part by the National Science Foundation under Award No. 2318101.

Author information

Authors and Affiliations

School of ECE, Purdue University, West Lafayette, IN, USA
Amrit Nagarajan & Anand Raghunathan
IBM T.J. Watson Research Center, Yorktown Heights, NY, USA
Amrit Nagarajan

Authors

Amrit Nagarajan
View author publications
You can also search for this author in PubMed Google Scholar
Anand Raghunathan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Amrit Nagarajan .

Editor information

Editors and Affiliations

LTCI, Télécom Paris, Palaiseau Cedex, France
Albert Bifet
KU Leuven, Leuven, Belgium
Jesse Davis
Faculty of Informatics, Vytautas Magnus University, Akademija, Lithuania
Tomas Krilavičius
Institute of Computer Science, University of Tartu, Tartu, Estonia
Meelis Kull
Department of Computer Science, Bundeswehr University Munich, Munich, Germany
Eirini Ntoutsi
Department of Computer Science, University of Helsinki, Helsinki, Finland
Indrė Žliobaitė

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Nagarajan, A., Raghunathan, A. (2024). Input Compression with Positional Consistency for Efficient Training and Inference of Transformer Neural Networks. In: Bifet, A., Davis, J., Krilavičius, T., Kull, M., Ntoutsi, E., Žliobaitė, I. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2024. Lecture Notes in Computer Science(), vol 14945. Springer, Cham. https://doi.org/10.1007/978-3-031-70362-1_5

Download citation

DOI: https://doi.org/10.1007/978-3-031-70362-1_5
Published: 22 August 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70361-4
Online ISBN: 978-3-031-70362-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

the ECML PKDD community (opens in a new tab)

Input Compression with Positional Consistency for Efficient Training and Inference of Transformer Neural Networks