Abstract
Nowadays, transformers are widely used in NLP applications. However, since the model complexity increases quadratically with sequence length, it is intractable to use transformers on very long documents. In this paper, we propose EP-Transformer, a hierarchical Transformer concentrating on efficient context propagation for the whole long documents. Specifically, two components are designed: similarity-sensitive fusion block and unsupervised contrast learning strategy, the former is dedicated to the multi-level aggregation of representation between segments, and the latter aims to generate more unbiased global features. EP-Transformer not only enables effectively capturing local and global contextual information of long sequence, but also reduces the computational complexity. In order to verify the effectiveness of EP-Transformer on long sequences, we evaluate it on three commonly used classification datasets (IMDB, Hyperpartisan news, and Arxiv) and two QA datasets (WikiHop and TriviaQA). Experimental results demonstrate that EP-Transformer significantly outperforms other baseline models.
Similar content being viewed by others
References
Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems. vol. 30 (2017)
Yang, Z., Dai, Z., Yang, Y., Carbonell, J., Salakhutdinov, R.R., Le, Q.V.: XLNet: generalized autoregressive pretraining for language understanding. In: Advances in Neural Information Processing Systems. vol. 32 (2019)
Lan, Z., Chen, M., Goodman, S., Gimpel, K., Sharma, P., Soricut, R.: ALBERT: a lite BERT for self-supervised learning of language representations. arXiv preprint arXiv:1909.11942 (2019)
Devlin, J., Chang, M.-W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Sun, Y., Wang, S., Li, Y., Feng, S., Tian, H., Wu, H., Wang, H.: ERNIE 2.0: a continual pre-training framework for language understanding. In Proceedings of the AAAI Conference on Artificial Intelligence. vol. 34, pp. 8968–8975 (2020)
Radford, A., Jeffrey, W., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI Blog 1(8), 9 (2019)
Yang, Z., Yang, D., Dyer, C., He, X., Smola, A., Hovy, E.: Hierarchical attention networks for document classification. In: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 1480–1489 (2016)
Abreu, J., Fred, L., Macêdo, D., Zanchettin, C.: Hierarchical attentional hybrid neural networks for document classification. In: Tetko, I.V., Kůrková, V., Karpov, P., Theis, F. (eds.) ICANN 2019. LNCS, vol. 11731, pp. 396–402. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-30493-5_39
Wu, C., Wu, F., Qi, T., Huang, Y.: Hi-transformer: hierarchical interactive transformer for efficient and effective long document modeling. arXiv preprint arXiv:2106.01040 (2021)
Zaheer, M., et al.: Transformers for longer sequences big bird. Adv. Neural. Inf. Process. Syst. 33, 17283–17297 (2020)
Beltagy, I., Peters, M.E., Cohan, A.: Longformer: the long-document transformer. arXiv preprint arXiv:2004.05150 (2020)
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)
Maas, A., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning word vectors for sentiment analysis. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pp. 142–150 (2011)
Kiesel, J., et al.: SemEval-2019 task 4: hyperpartisan news detection. In: Proceedings of the 13th International Workshop on Semantic Evaluation, pp. 829–839 (2019)
He, J., Wang, L., Liu, L., Feng, J., Hao, W.: Long document classification from local word glimpses via recurrent attention learning. IEEE Access 7, 40707–40718 (2019)
Welbl, J., Stenetorp, P., Riedel, S.: Constructing datasets for multi-hop reading comprehension across documents. Trans. Assoc. Comput. Linguist. 6, 287–302 (2018)
Joshi, M., Choi, E., Weld, D.S., Zettlemoyer, L.: TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551 (2017)
Child, R., Gray, S., Radford, A., Sutskever, I.: Generating long sequences with sparse transformers. arXiv preprint arXiv:1904.10509 (2019)
Kitaev, N., Kaiser, Ł., Levskaya, A.: Reformer: the efficient transformer. arXiv preprint arXiv:2001.04451 (2020)
Wang, S., Li, B.Z., Khabsa, M., Fang, H., Ma, H.: Linformer: self-attention with linear complexity. arXiv preprint arXiv:2006.04768 (2020)
Zhou, H., et al.: Informer: beyond efficient transformer for long sequence time-series forecasting. In: Proceedings of AAAI (2021)
Chen, T., et al.: \(\{\)TVM\(\}\): An automated \(\{\)End-to-End\(\}\) optimizing compiler for deep learning. In: 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pp. 578–594 (2018)
Dai, Z., Yang, Z., Yang, Y., Carbonell, J., Le, Q.V., Salakhutdinov, R.: Transformer-XL: attentive language models beyond a fixed-length context. arXiv preprint arXiv:1901.02860 (2019)
Ding, S., et al.: ERNIE-Doc: a retrospective long-document modeling transformer. arXiv preprint arXiv:2012.15688 (2020)
Zhang, X., Wei, F., Zhou, M.: HIBERT: document level pre-training of hierarchical bidirectional transformers for document summarization. arXiv preprint arXiv:1905.06566 (2019)
Hadsell, R., Chopra, S., LeCun, Y.: Dimensionality reduction by learning an invariant mapping. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006). vol. 2, pp. 1735–1742. IEEE (2006)
Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)
Henderson, M., et al.: Efficient natural language response suggestion for smart reply. arXiv preprint arXiv:1705.00652 (2017)
Clark, C., Gardner, M.: Simple and effective multi-paragraph reading comprehension. arXiv preprint arXiv:1710.10723 (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Xie, C. et al. (2023). EP-Transformer: Efficient Context Propagation for Long Document. In: Liu, F., Duan, N., Xu, Q., Hong, Y. (eds) Natural Language Processing and Chinese Computing. NLPCC 2023. Lecture Notes in Computer Science(), vol 14303. Springer, Cham. https://doi.org/10.1007/978-3-031-44696-2_45
Download citation
DOI: https://doi.org/10.1007/978-3-031-44696-2_45
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-44695-5
Online ISBN: 978-3-031-44696-2
eBook Packages: Computer ScienceComputer Science (R0)