Abstract
Transformers have achieved state-of-the-art performance in processing text, images, audio and video. However, they present large computational requirements for both training and inference, and are prone to overfitting on small data sets. To address these challenges, we present Input Compression with Positional Consistency (ICPC), a new data augmentation method that simultaneously improves both generalization and training efficiency. The key insight behind ICPC is that input compression can be used as a data augmentation technique. ICPC applies varying levels of compression to each sample in each epoch. This leads to smaller input sequences being processed by the Transformer, and hence faster training, while also alleviating overfitting by presenting each input with different compression levels. We introduce a consistency-aware position selection method in ICPC that enables accurate processing of compressed inputs without any changes to the underlying Transformer architecture. We detail compression-based augmentation methods for four different modalities – insignificant word pruning for text, resolution modulation for images, spatio-temporal resolution modulation for videos, and spectrogram modulation for audio. In addition to faster training with reduced overfitting, we find that ICPC enhances resilience to input compression during inference. Therefore, we introduce variable-effort inference schemes for accurate and efficient inference. On 9 diverse tasks spanning 4 different modalities, ICPC improves accuracy by up to 1%, while also accelerating training and inference by up to 2.9\(\times \) and 2.6\(\times \), respectively. Code is available at https://github.com/amrnag/ICPC.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
CoLA is excluded, since the task involves testing grammatical correctness of inputs, hence stopwords cannot be pruned.
References
Bird, S.: NLTK: the natural language toolkit. In: ACL 2006, 44th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference, Sydney, Australia (2006). https://aclanthology.org/P06-4018/
Dehghani, M., et al.: Patch n’ Pack: NaViT, a vision transformer for any aspect ratio and resolution. CoRR (2023). https://doi.org/10.48550/arXiv.2307.06304
Deng, J., Dong, W., Socher, R., Li, L., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2009), 20-25 June 2009 (2009). https://doi.org/10.1109/CVPR.2009.5206848
Devlin, J., Chang, M., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis (2019). https://doi.org/10.18653/v1/n19-1423
Dosovitskiy, A., et al.: An image is worth 16\(\times \)16 words: Transformers for image recognition at scale. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria (2021). https://openreview.net/forum?id=YicbFdNTTy
Gong, Y., Chung, Y., Glass, J.R.: AST: audio spectrogram transformer. CoRR (2021). https://arxiv.org/abs/2104.01778
Goyal, R., et al.: The “something somethin” video database for learning and evaluating visual common sense. In: IEEE International Conference on Computer Vision, ICCV 2017, Venice, Italy (2017). https://doi.org/10.1109/ICCV.2017.622
Han, Y., Huang, G., Song, S., Yang, L., Wang, H., Wang, Y.: Dynamic neural networks: a survey. IEEE Trans. Pattern Anal. Mach. Intell. (2022). https://doi.org/10.1109/TPAMI.2021.3117837
Hayes, P.J., Weinstein, S.P.: CONSTRUE/TIS: a system for content-based indexing of a database of news stories. In: Proceedings of the The Second Conference on Innovative Applications of Artificial Intelligence (IAAI-90), Washington, DC, USA (1990). http://www.aaai.org/Library/IAAI/1990/iaai90-006.php
Hendrycks, D., Mu, N., Cubuk, E.D., Zoph, B., Gilmer, J., Lakshminarayanan, B.: AugMix: a simple data processing method to improve robustness and uncertainty (2020)
Hendrycks, D., Mu, N., Cubuk, E.D., Zoph, B., Gilmer, J., Lakshminarayanan, B.: AugMix: a simple data processing method to improve robustness and uncertainty. In: 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26-30 April 2020 (2020). https://openreview.net/forum?id=S1gmrxHFvB
Kaplan, J., et al.: Scaling laws for neural language models. CoRR (2020). https://arxiv.org/abs/2001.08361
Kay, W., et al.: The kinetics human action video dataset. CoRR (2017). http://arxiv.org/abs/1705.06950
Koutini, K., Schlüter, J., Eghbal-zadeh, H., Widmer, G.: Efficient training of audio transformers with patchout. CoRR (2021). https://arxiv.org/abs/2110.05069
Krizhevsky, A.: Learning Multiple Layers of Features from Tiny Images. Univ. Toronto, Technical report (2009)
Li, K., et al.: Unmasked teacher: towards training-efficient video foundation models. CoRR (2023). https://doi.org/10.48550/arXiv.2303.16058
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. CoRR (2019). http://arxiv.org/abs/1907.11692
Park, D.S., et al: SpecAugment: a simple data augmentation method for automatic speech recognition. In: Interspeech 2019, 20th Annual Conference of the International Speech Communication Association, Graz, Austria (2019). https://doi.org/10.21437/Interspeech.2019-2680
Piczak, K.J.: ESC: dataset for environmental sound classification. In: Proceedings of the 23rd Annual ACM Conference on Multimedia Conference, MM 2015, Brisbane, Australia (2015). https://doi.org/10.1145/2733373.2806390
Teerapittayanon, S., McDanel, B., Kung, H.T.: BranchyNet: fast inference via early exiting from deep neural networks. In: 23rd International Conference on Pattern Recognition, ICPR 2016, Cancún, Mexico (2016). https://doi.org/10.1109/ICPR.2016.7900006
Verma, V., et al.: Manifold mixup: better representations by interpolating hidden states (2019)
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.: GLUE: a multi-task benchmark and analysis platform for natural language understanding. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP (2018). https://aclanthology.org/W18-5446
Wang, Y., Huang, R., Song, S., Huang, Z., Huang, G.: Not all images are worth 16\(\times \)16 words: dynamic transformers for efficient image recognition. In: Advances in Neural Information Processing Systems, vol. 34: NeurIPS 2021 (2021). https://proceedings.neurips.cc/paper/2021/hash/64517d8435994992e682b3e4aa0a0661-Abstract.html
Warden, P.: Speech commands: a dataset for limited-vocabulary speech recognition. CoRR abs/1804.03209 (2018). http://arxiv.org/abs/1804.03209
Wei, J.W., Zou, K.: EDA: easy data augmentation techniques for boosting performance on text classification tasks. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, 3-7 November 2019 (2019)
Wolf, T., et al.: Huggingface’s transformers: state-of-the-art natural language processing. CoRR (2019). http://arxiv.org/abs/1910.03771
Xie, E., Wang, W., Yu, Z., Anandkumar, A., Álvarez, J.M., Luo, P.: SegFormer: simple and efficient design for semantic segmentation with transformers. In: Advances in Neural Information Processing Systems, vol. 34: NeurIPS 2021 (2021). https://proceedings.neurips.cc/paper/2021/hash/64f1f27bf1b4ec22924fd0acb550c235-Abstract.html
Yun, S., Han, D., Chun, S., Oh, S.J., Yoo, Y., Choe, J.: Cutmix: regularization strategy to train strong classifiers with localizable features. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South) (2019). https://doi.org/10.1109/ICCV.2019.00612
. Zhang, H., Cissé, M., Dauphin, Y.N., Lopez-Paz, D.: mixup: beyond empirical risk minimization. In: 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, Conference Track Proceedings (2018). https://openreview.net/forum?id=r1Ddp1-Rb
Acknowledgement
This work was supported in part by the Center for the Co-Design of Cognitive Systems (CoCoSys), a JUMP2.0 center sponsored by the Semiconductor Research Corporation (SRC) and DARPA, and in part by the National Science Foundation under Award No. 2318101.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Nagarajan, A., Raghunathan, A. (2024). Input Compression with Positional Consistency for Efficient Training and Inference of Transformer Neural Networks. In: Bifet, A., Davis, J., Krilavičius, T., Kull, M., Ntoutsi, E., Žliobaitė, I. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2024. Lecture Notes in Computer Science(), vol 14945. Springer, Cham. https://doi.org/10.1007/978-3-031-70362-1_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-70362-1_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-70361-4
Online ISBN: 978-3-031-70362-1
eBook Packages: Computer ScienceComputer Science (R0)