Elsevier

Knowledge-Based Systems

Volume 238, 28 February 2022, 107872
Knowledge-Based Systems

Hierarchical BERT with an adaptive fine-tuning strategy for document classification

https://doi.org/10.1016/j.knosys.2021.107872Get rights and content

Highlights

  • The hierarchical BERT model consist of local encoder and global encoder.

  • An adaptive fine-tuning strategy improve the performance of the PLMs.

  • An attention-based gated memory network model global information.

Abstract

Pretrained language models (PLMs) have achieved impressive results and have become vital tools for various natural language processing (NLP) tasks. However, there is a limitation that applying these PLMs to document classification when the document length exceeds the maximum acceptable length of the PLM since the excess portion is truncated in these models. If the keywords are in the truncated part, then the performance of the model declines. To address this problem, this paper proposes a hierarchical BERT with an adaptive fine-tuning strategy (HAdaBERT). It consists of a BERT-based model as the local encoder and an attention-based gated memory network as the global encoder. In contrast to existing PLMs that directly truncate documents, the proposed model uses a part of the document as a region, dividing input document into several containers. This allows the useful information in each container to be extracted by a local encoder and composed by a global encoder according to its contribution to the classification. To further improve the performance of the model, this paper proposes an adaptive fine-tuning strategy, which dynamically decides the layers of BERT to be fine-tuned instead of fine-tuning all layers for each input text. Experimental results on different corpora indicated that this method outperformed existing neural networks for document classification.

Introduction

With the rapid development of the Internet, a variety of social media texts are growing explosively. Classifying such a large volume of documents is essential to make them more manageable and, ultimately, obtain valuable insights that are related to politics, business and manufacturing. Due to the difficulty and inefficiency of human agents in managing the incoming volume of text, document classification techniques can automatically assign each document correctly to one or more categories based on its content and features. It has a wide range of applications in the fields of news topic classification [1], sentiment analysis [2], [3], subject labeling of academic papers [4], and text summarization [5].

Practically, document classification has its own internal structural properties that distinguish it from sentence classification. Documents are often composed of multiple sentences and contain more words than conventional sentences. Due to the use of rhetoric, important information in the document may be distributed in different local positions. There are more complex and ambiguous semantic relationships between sentences, making document classification a challenging task.

Deep learning models have made impressive progress in various NLP tasks. Based on distributed word representation, several neural networks have been proposed for document classification, including convolutional neural networks (CNNs) [6], [7], recurrent neural networks (RNNs) [8], [9], gated recurrent unit (GRU) networks [10] and long short-term memory (LSTM) networks [11]. CNNs have been successful in computer vision and have also been used in document classification. Assuming that each token in a document does not contribute equally to the classification, self-attention [12] and dynamic routing [13], [14] a process to automatically align text and emphasize important tokens have been proposed, further improving the performance of CNNs and GRU and LSTM networks. Moreover, hierarchical attention network [15] been proposed for document classification. It performed semantic modeling with GRU at the sentence level and the document level, respectively. However, this can lead to loss of the syntactic dependencies of tokens in the context of the sentence.

Recently, pretrained language models, including BERT [16], ALBERT [17] and RoBERTa [18], have successfully completed various NLP tasks. These PLMs eliminate the need to build a model from scratch when dealing with downstream tasks and can adopt transformers [19] and self-attention mechanisms [12] to learn high-quality contextual representations of the texts using transfer learning. Typically, PLMs are first fed a large amount of unannotated data and are trained by either a masked language model or next sentence prediction to learn the usage of various words and how the language is written in general. Then, the models are transferred to another task where they are fed another smaller task-specific dataset. For document classification, the input sequence can be very long, but the maximum input length of the BERT model is limited in most cases [20]. Once the documents exceed the preset maximum input length, the excess part of these documents are directly truncated. Fig. 1 shows an example of the DocBERT model that truncates the input documents to 512 to fine-tune the BERT and predict document categories [20]. As shown in Fig. 1, the strongly subjective phrase amongst the best and highly recommended will be truncated, which may lead to an incorrect classification of the model.

Another obvious issue of applying BERT to document classification is computational consumption. That is, as the length of the input sequence increases, the demand for computational resources increase dramatically since the time complexity for fine-tuning each transformer layer is exponential with respect to the input length. Recent studies have indicated that fine-tuning all layers (usually 12 layers in BERT) does not necessarily lead to the best performance but increases the training costs [21]. Based on this, one intuitive method is to select the layers that should be fine-tuned in either the training or inferring phase. However, existing works use only a manual strategy for each sample [22] and fail to dynamically change the layer selection strategy for different input samples.

In this paper, the hierarchical BERT model with an adaptive fine-tuning strategy was proposed to address the aforementioned problems. As shown in Fig. 2, the HAdaBERT model consists of two main parts to model the document representation hierarchically, including both local and global encoders. Considering a document has a natural hierarchical structure, i.e., a document contains multiple sentences, and each sentence contains multiple words, we apply the local encoder to learn the sentence-level features while the global encoder to compose these features as the final representation. In contrast to existing PLMs directly truncating the document to the maximum input length, the proposed HAdaBERT uses part of the document as a region. Instead of using one sentence in each container for the local encoder, we introduced a container division strategy to get effective local information such that the syntactic dependencies are preserved. Then, the key and useful information in the containers can be extracted by the local encoder. Then, the global encoder learns to sequentially compose the syntactic relationships between the containers with an attention-based gated memory network. By using both encoders in a hierarchical architecture, the model effectively captures local information and long-term dependencies in long documents. Furthermore, an adaptive fine-tuning strategy for each input sample is proposed to apply a policy network that adaptively selects the optimal fine-tuned layers of BERT to improve model performance during training and inference. Such a strategy is an automatic and general method that can be extended to other pretrained language models.

The empirical experiments were conducted on several corpora, including the AAPD, Reuters, IMDB, and Yelp-2013 datasets. The comparative results show that the proposed HAdaBERT model outperformed several existing neural networks in document classification. In addition, the model can effectively address the limitations of previous methods for document classification. It is also competitive with existing models for learning the representations of short sentences. Another observation is that the adaptive fine-tuning strategy achieves the best performance improvement by dynamically selecting layers for fine-tuning.

The remainder of this paper is organized as follows. Section 2 presents and reviews the related works on document classification. Section 3 describes the proposed hierarchical BERT model and the strategy of adaptive fine-tuning. Comparative experiments are conducted in Section 4. Finally, conclusions are drawn in Section 5.

Section snippets

Related works

Document-level classification is a fundamental and challenging task in natural language processing. This section presents a brief review of existing methods for document-level classification, including conventional, hierarchical and pretrained neural networks.

Hierarchical BERT with adaptive fine-tuning

In this section, the hierarchical BERT with an adaptive fine-tuning strategy (HAdaBERT) model is described in detail. Fig. 2 shows the overall architecture of the proposed model, which consists of two main parts to model the document representation hierarchically, including both local and global encoders. The input document is first divided into several containers. In each container, a BERT-based model is fine-tuned to extract high-quality local features. Taking the sequential local features as

Experiments

In this section, comparative experiments were conducted to evaluate the performance of hierarchical BERT with an adaptive fine-tuning strategy compared to several existing methods used for document classification.

Conclusion

In this paper, a hierarchical BERT with an adaptive fine-tuning strategy was proposed for document classification. It consists of two parts, including both the local encoder and global encoder, which can effectively capture both the local and global information of the document. To address the limitations of existing fine-tuning strategies, an adaptive fine-tuning strategy was proposed to customize a specific fine-tuning strategy for each input sample and dynamically select the layer of

CRediT authorship contribution statement

Jun Kong: Investigation, Methodology, Software, Formal analysis, Validation, Writing – original draft. Jin Wang: Conceptualization, Software, Formal analysis, Resource and Writing – review & editing, Resources and Funding acquisition. Xuejie Zhang: Project administration, Resources, Supervision and Funding acquisition.

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (NSFC) under Grants Nos. 61702443, 61966038 and 61762091. The authors would like to thank the anonymous reviewers for their constructive comments.

Jun Kong is currently pursuing his Master’s Degree in the School of Information Science and Engineering, Yunnan University, China. He received a Bachelor’s Degree in Information Engineering from Southwest Forestry University, China. His research interests include natural language processing, text mining, and machine learning.

References (53)

  • IrsoyO. et al.

    Opinion mining with deep recurrent neural networks

  • YogatamaD. et al.

    Generative and discriminative text classification with recurrent neural networks

    (2017)
  • ChungJ. et al.

    Empirical evaluation of gated recurrent neural networks on sequence modeling

    (2014)
  • TaiK.S. et al.

    Improved semantic representations from tree-structured long short-term memory networks

  • BahdanauD. et al.

    Neural machine translation by jointly learning to align and translate

    (2014)
  • S. Sabour, N. Frosst, G.E. Hinton, Dynamic routing between capsules, in: Proceedings of Advances in Neural Information...
  • J. Gong, X. Qiu, S. Wang, X. Huang, Information aggregation via dynamic routing for sequence encoding, in: Proceedings...
  • Z. Yang, D. Yang, C. Dyer, X. He, A. Smola, E. Hovy, Hierarchical attention networks for document classification, in:...
  • J. Devlin, M.W. Chang, K. Lee, K. Toutanova, BERT: Pre-training of deep bidirectional transformers for language...
  • LanZ. et al.

    Albert: A lite bert for self-supervised learning of language representations

    (2019)
  • LiuY. et al.

    RoBERTa: A robustly optimized BERT pretraining approach

    (2019)
  • VaswaniA. et al.

    Attention is all you need

  • AdhikariA. et al.

    DocBERT: BERT for document classification

    (2019)
  • JawaharG. et al.

    What does BERT learn about the structure of language?

  • C. Sun, X. Qiu, Y. Xu, X. Huang, How to fine-tune bert for text classification? in: China National Conference on...
  • R. Johnson, Deep pyramid convolutional neural networks for text categorization, in: Proceedings of the 55th Annual...
  • Cited by (0)

    Jun Kong is currently pursuing his Master’s Degree in the School of Information Science and Engineering, Yunnan University, China. He received a Bachelor’s Degree in Information Engineering from Southwest Forestry University, China. His research interests include natural language processing, text mining, and machine learning.

    Jin Wang is an associate professor in the School of Information Science and Engineering, Yunnan University, China. He holds a Ph.D. in Computer Science and Engineering from Yuan Ze University, Taoyuan, Taiwan, and another Ph.D. in Communication and Information Systems from Yunnan University, Kunming, China. His research interests include natural language processing, text mining, and machine learning.

    Xuejie Zhang is a professor in the School of Information Science and Engineering, and Director of High-Performance Computing Center, Yunnan University, China. He received his Ph.D. in Computer Science and Engineering from the Chinese University of Hong Kong in 1998. His research interests include high performance computing, cloud computing, and big data analytics.

    The code for this paper is availabled at: https://github.com/JunKong5/HAdaBERT.

    View full text