Hierarchical BERT with an adaptive fine-tuning strategy for document classification☆
Graphical abstract
Introduction
With the rapid development of the Internet, a variety of social media texts are growing explosively. Classifying such a large volume of documents is essential to make them more manageable and, ultimately, obtain valuable insights that are related to politics, business and manufacturing. Due to the difficulty and inefficiency of human agents in managing the incoming volume of text, document classification techniques can automatically assign each document correctly to one or more categories based on its content and features. It has a wide range of applications in the fields of news topic classification [1], sentiment analysis [2], [3], subject labeling of academic papers [4], and text summarization [5].
Practically, document classification has its own internal structural properties that distinguish it from sentence classification. Documents are often composed of multiple sentences and contain more words than conventional sentences. Due to the use of rhetoric, important information in the document may be distributed in different local positions. There are more complex and ambiguous semantic relationships between sentences, making document classification a challenging task.
Deep learning models have made impressive progress in various NLP tasks. Based on distributed word representation, several neural networks have been proposed for document classification, including convolutional neural networks (CNNs) [6], [7], recurrent neural networks (RNNs) [8], [9], gated recurrent unit (GRU) networks [10] and long short-term memory (LSTM) networks [11]. CNNs have been successful in computer vision and have also been used in document classification. Assuming that each token in a document does not contribute equally to the classification, self-attention [12] and dynamic routing [13], [14] a process to automatically align text and emphasize important tokens have been proposed, further improving the performance of CNNs and GRU and LSTM networks. Moreover, hierarchical attention network [15] been proposed for document classification. It performed semantic modeling with GRU at the sentence level and the document level, respectively. However, this can lead to loss of the syntactic dependencies of tokens in the context of the sentence.
Recently, pretrained language models, including BERT [16], ALBERT [17] and RoBERTa [18], have successfully completed various NLP tasks. These PLMs eliminate the need to build a model from scratch when dealing with downstream tasks and can adopt transformers [19] and self-attention mechanisms [12] to learn high-quality contextual representations of the texts using transfer learning. Typically, PLMs are first fed a large amount of unannotated data and are trained by either a masked language model or next sentence prediction to learn the usage of various words and how the language is written in general. Then, the models are transferred to another task where they are fed another smaller task-specific dataset. For document classification, the input sequence can be very long, but the maximum input length of the BERT model is limited in most cases [20]. Once the documents exceed the preset maximum input length, the excess part of these documents are directly truncated. Fig. 1 shows an example of the DocBERT model that truncates the input documents to 512 to fine-tune the BERT and predict document categories [20]. As shown in Fig. 1, the strongly subjective phrase amongst the best and highly recommended will be truncated, which may lead to an incorrect classification of the model.
Another obvious issue of applying BERT to document classification is computational consumption. That is, as the length of the input sequence increases, the demand for computational resources increase dramatically since the time complexity for fine-tuning each transformer layer is exponential with respect to the input length. Recent studies have indicated that fine-tuning all layers (usually 12 layers in BERT) does not necessarily lead to the best performance but increases the training costs [21]. Based on this, one intuitive method is to select the layers that should be fine-tuned in either the training or inferring phase. However, existing works use only a manual strategy for each sample [22] and fail to dynamically change the layer selection strategy for different input samples.
In this paper, the hierarchical BERT model with an adaptive fine-tuning strategy was proposed to address the aforementioned problems. As shown in Fig. 2, the HAdaBERT model consists of two main parts to model the document representation hierarchically, including both local and global encoders. Considering a document has a natural hierarchical structure, i.e., a document contains multiple sentences, and each sentence contains multiple words, we apply the local encoder to learn the sentence-level features while the global encoder to compose these features as the final representation. In contrast to existing PLMs directly truncating the document to the maximum input length, the proposed HAdaBERT uses part of the document as a region. Instead of using one sentence in each container for the local encoder, we introduced a container division strategy to get effective local information such that the syntactic dependencies are preserved. Then, the key and useful information in the containers can be extracted by the local encoder. Then, the global encoder learns to sequentially compose the syntactic relationships between the containers with an attention-based gated memory network. By using both encoders in a hierarchical architecture, the model effectively captures local information and long-term dependencies in long documents. Furthermore, an adaptive fine-tuning strategy for each input sample is proposed to apply a policy network that adaptively selects the optimal fine-tuned layers of BERT to improve model performance during training and inference. Such a strategy is an automatic and general method that can be extended to other pretrained language models.
The empirical experiments were conducted on several corpora, including the AAPD, Reuters, IMDB, and Yelp-2013 datasets. The comparative results show that the proposed HAdaBERT model outperformed several existing neural networks in document classification. In addition, the model can effectively address the limitations of previous methods for document classification. It is also competitive with existing models for learning the representations of short sentences. Another observation is that the adaptive fine-tuning strategy achieves the best performance improvement by dynamically selecting layers for fine-tuning.
The remainder of this paper is organized as follows. Section 2 presents and reviews the related works on document classification. Section 3 describes the proposed hierarchical BERT model and the strategy of adaptive fine-tuning. Comparative experiments are conducted in Section 4. Finally, conclusions are drawn in Section 5.
Section snippets
Related works
Document-level classification is a fundamental and challenging task in natural language processing. This section presents a brief review of existing methods for document-level classification, including conventional, hierarchical and pretrained neural networks.
Hierarchical BERT with adaptive fine-tuning
In this section, the hierarchical BERT with an adaptive fine-tuning strategy (HAdaBERT) model is described in detail. Fig. 2 shows the overall architecture of the proposed model, which consists of two main parts to model the document representation hierarchically, including both local and global encoders. The input document is first divided into several containers. In each container, a BERT-based model is fine-tuned to extract high-quality local features. Taking the sequential local features as
Experiments
In this section, comparative experiments were conducted to evaluate the performance of hierarchical BERT with an adaptive fine-tuning strategy compared to several existing methods used for document classification.
Conclusion
In this paper, a hierarchical BERT with an adaptive fine-tuning strategy was proposed for document classification. It consists of two parts, including both the local encoder and global encoder, which can effectively capture both the local and global information of the document. To address the limitations of existing fine-tuning strategies, an adaptive fine-tuning strategy was proposed to customize a specific fine-tuning strategy for each input sample and dynamically select the layer of
CRediT authorship contribution statement
Jun Kong: Investigation, Methodology, Software, Formal analysis, Validation, Writing – original draft. Jin Wang: Conceptualization, Software, Formal analysis, Resource and Writing – review & editing, Resources and Funding acquisition. Xuejie Zhang: Project administration, Resources, Supervision and Funding acquisition.
Declaration of Competing Interest
The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Acknowledgments
This work was supported by the National Natural Science Foundation of China (NSFC) under Grants Nos. 61702443, 61966038 and 61762091. The authors would like to thank the anonymous reviewers for their constructive comments.
Jun Kong is currently pursuing his Master’s Degree in the School of Information Science and Engineering, Yunnan University, China. He received a Bachelor’s Degree in Information Engineering from Southwest Forestry University, China. His research interests include natural language processing, text mining, and machine learning.
References (53)
- et al.
Combining attention-based bidirectional gated recurrent neural network and two-dimensional convolutional neural network for document-level sentiment classification
Neurocomputing
(2020) Finding structure in time
Cogn. Sci.
(1990)- et al.
Self-interaction attention mechanism-based text representation for document classification
Appl. Sci.
(2018) - et al.
LSTM With sentence representations for document-level sentiment classification
Neurocomputing
(2018) - et al.
Automated learning of decision rules for text categorization
ACM Trans. Inf. Syst. (TOIS)
(1994) - et al.
Dimensional sentiment analysis using a regional CNN-LSTM model
- et al.
SGM: Sequence generation model for multi-label classification
- K.M. Hermann, T. Kočiský, E. Grefenstette, L. Espeholt, L. Espeholt, W. Kay, M. Suleyman, P. Blunsom, Teaching machines...
Convolutional neural networks for sentence classification
- et al.
Character-level convolutional networks for text classification
Opinion mining with deep recurrent neural networks
Generative and discriminative text classification with recurrent neural networks
Empirical evaluation of gated recurrent neural networks on sequence modeling
Improved semantic representations from tree-structured long short-term memory networks
Neural machine translation by jointly learning to align and translate
Albert: A lite bert for self-supervised learning of language representations
RoBERTa: A robustly optimized BERT pretraining approach
Attention is all you need
DocBERT: BERT for document classification
What does BERT learn about the structure of language?
Cited by (0)
Jun Kong is currently pursuing his Master’s Degree in the School of Information Science and Engineering, Yunnan University, China. He received a Bachelor’s Degree in Information Engineering from Southwest Forestry University, China. His research interests include natural language processing, text mining, and machine learning.
Jin Wang is an associate professor in the School of Information Science and Engineering, Yunnan University, China. He holds a Ph.D. in Computer Science and Engineering from Yuan Ze University, Taoyuan, Taiwan, and another Ph.D. in Communication and Information Systems from Yunnan University, Kunming, China. His research interests include natural language processing, text mining, and machine learning.
Xuejie Zhang is a professor in the School of Information Science and Engineering, and Director of High-Performance Computing Center, Yunnan University, China. He received his Ph.D. in Computer Science and Engineering from the Chinese University of Hong Kong in 1998. His research interests include high performance computing, cloud computing, and big data analytics.
- ☆
The code for this paper is availabled at: https://github.com/JunKong5/HAdaBERT.