Keywords

1 Introduction

MEDLINE®/PubMed® is the National Library of Medicine’s (NLM) premier bibliographic database, and MEDLINE is the indexed subset of the database. MEDLINE currently covers more than 5,200 international journals and contains over 25 million indexed biomedical citations. The database is freely available online and can be searched via the PubMedFootnote 1 web interface. From an information retrieval perspective, the unique value of MEDLINE is that citations are manually indexed with a hierarchical controlled vocabulary called Medical Subject Headings (MeSH®).Footnote 2 The assigned MeSH descriptors can be used in PubMed to define advanced search queries.

Indexing of MEDLINE articles is a time-consuming and highly specialized activity. NLM indexers review the full text of an article and then assign MeSH descriptors that represent the central concepts as well as every other topic that is discussed to a significant extent. Indexers are required to have a working knowledge of the large MeSH vocabulary and also scientific expertise in the subject indexed.

Since 1990, there has been a steady and sizable increase in the number of articles indexed each year for MEDLINE. Between 1990 and 2018 the number of articles indexed per year has increased from about 400,000 to over 900,000, and the NLM expects to index over one million articles annually within a few years. To help indexers cope with their increasing workload, the NLM has developed an automated indexing system called the Medical Text Indexer (MTI) [11]. MTI is a machine learning and rule-based system that takes the article title and abstract as its input and returns predicted MeSH terms as its output. The system improves productivity by providing a pick list of recommended MeSH terms that can be quickly selected by indexers.

Automatic MeSH indexing is a difficult machine learning problem. It is usually treated as a multi-label text classification problem, and the main challenges are the large number of MeSH descriptors and their highly imbalanced frequency distribution. There are over 29,000 MeSH descriptors in the 2019 MeSH vocabulary. At the end of 2018, the most frequent descriptor ‘Humans’ had been indexed more than 17 million times, whereas the 20,000th most frequent descriptor ‘Ananas’ had only been indexed 454 times.

Despite these challenges, effective systems for automatic MeSH indexing have been developed at the NLM and elsewhere. Since 2013, much of the progress in the field has been driven by the large-scale online biomedical semantic indexing task of the BioASQ challenge [14]. For this task, test sets of soon-to-be indexed articles are provided to participants and MeSH descriptor predictions must be submitted within 24 h (i.e. before the articles have been indexed).

Deep learning is a type of machine learning algorithm, based on artificial neural networks, that uses multiple processing layers to learn representations of data with multiple levels of abstraction [7]. In the last few years, deep learning approaches have demonstrated state-of-the-art (SOTA) performance in a wide variety of natural language processing (NLP) tasks, including text classification. Deep learning technologies have also been used to improve automatic MeSH indexing performance. Initially, these technologies were used to enhance existing systems [12], but more recently, end-to-end deep learning models (e.g. [15]) have demonstrated SOTA performance.

This paper presents an end-to-end deep learning model for automatic MeSH indexing that uses a Convolutional Neural Network (CNN) architecture. The model is evaluated by participating in task 7a of the BioASQ 2019 challenge and is shown to have competitive performance - outperforming the current NLM indexing system (MTI) in terms of micro F1 score. The presented CNN architecture has a number of customizations for the MeSH indexing task and these are shown to improve performance in an ablation study. We perform a preliminary analysis comparing the BioASQ challenge results of the CNN and MTI systems and also highlight the advantages of end-to-end deep learning approaches to automatic MeSH indexing.

2 Related Work

Automatic MeSH indexing is a well-studied multi-label classification problem, and since 2013, the BioASQ challenge has provided a useful benchmark for automatic MeSH indexing research. In recent challenges, two high-performing approaches have emerged: learning to rank based approaches (e.g. MTI [11], DeepMeSH [12]) and end-to-end deep learning approaches (e.g. MeSHProbeNet [15], AttentionMeSH [4]).

Learning to rank [10] is a supervised machine learning technique that is used to solve ranking problems. For automatic MeSH indexing the algorithm is used to rank candidate MeSH descriptors by integrating multiple sources of evidence. MTI uses learning to rank to boost its prediction performance [17], and candidate MeSH descriptors are obtained using MetaMap [1], PubMed Related Citations (PRC) [8], and machine learning algorithms. MetaMap maps biomedical text to UMLS® MetathesaurusFootnote 3 concepts. These concepts are then mapped to MeSH descriptors using the Restrict to MeSH [2] algorithm. PRC is a nearest neighbor algorithm that identifies similar articles based on their title and abstracts. MeSH descriptors from similar articles are considered as candidates. There are some special MeSH descriptors called Check Tags and these cover concepts that are mentioned in almost every article (e.g. Human, Animal, Male, Female, Child, etc.). 12 of the 40 Check Tags are identified by individually trained binary classifiers. The final list of MeSH descriptors is obtained after applying indexing rules to the ranked candidate descriptors.

The DeepMeSH system has demonstrated consistently high performance in recent BioASQ challenges. It combines learning to rank with a separate model to predict the number of MeSH descriptors. Like MTI, DeepMeSH uses nearest neighbors and binary classifier algorithms to identify candidate descriptors. A novel aspect of DeepMeSH is that it represents the title and abstract as the concatenation of term frequency inverse document frequency (TFIDF) and document to vector [6] (Doc2Vec) features.

Deep neural networks have been shown to be very effective for many NLP problems. Universal language models (e.g. [3]) are currently the SOTA for many tasks, but convolutional neural networks and recurrent neural networks (RNN) still provide excellent performance, often with lower computational cost (e.g. [5, 16]). MeSHProbeNet was the best performing system in the BioASQ 2018 challenge, and it uses an end-to-end deep learning model with an RNN architecture. Specifically, it uses two bidirectional gated recurrent unit (GRU) layers followed by an attention layer to obtain a fixed length embedding of the concatenated title and abstract. The attention layer is novel because it uses multiple independent query vectors (MeSH probes) to generate different embedded representations (views) of the input text. The network output layer has one node for each MeSH descriptor and uses a sigmoid activation function to generate confidence scores between zero and one.

A CNN is a type of neural network that uses convolution operations to extract salient features. They are most commonly applied to image processing problems, but they have also been shown to be effective for many NLP tasks, including text classification. An effective CNN architecture for text classification is presented by Kim et al. [5] in their paper on sentence classification. This architecture uses a convolution layer followed by a max pooling layer to generate a fixed length text representation. A major advantage of CNNs is that they are quick to train because the CNN architecture is highly parallelizable.

Rios et al. [13] were the first to use a CNN model for MeSH indexing, and they trained independent binary classifiers for 12 Check Tags and another 17 hard-to-classify MeSH descriptors. The paper shows an improvement in performance compared to previous work, but it would not be practical to scale their binary relevance approach to the full MeSH vocabulary. Liu et al. [9] propose a CNN for extreme multi-label text classification, and they use a compression layer to allow the simultaneous prediction of up to 670,000 labels. The presented model demonstrated very competitive performance when evaluated on 6 extreme multi-label text classification benchmarks.

3 Methods

The CNN architecture (Fig. 1) is based on the architecture proposed by Liu et al. [9] for extreme multi-label classification. We customize their architecture with separate text inputs for the title and abstract and by adding journal, publication year, and year indexed inputs to the hidden layer. The outputs of the model are confidence scores for each MeSH descriptor. The following sections describe the different aspects of the model architecture in detail.

Fig. 1.
figure 1

The model architecture

3.1 Title and Abstract Embeddings

The article title and abstract serve very different purposes, and we therefore choose to process them as separate inputs. The idea is to make it easy for the model to learn different rules for the title and abstract, if necessary. For example, one might expect the model to assign higher importance to features detected in the title compared to the abstract.

A CNN component is used to process the title and abstract and generate the fixed length embeddings required by the hidden layer. Let \(\varvec{x}_i \in \mathbb {R}^k\) be the k-dimentional word embedding corresponding to the i-th word in the input text. Input text of length m is represented as the concatenation of word embeddings \(\varvec{x}_{1:m} = [\varvec{x}_1, \varvec{x}_2, ...,\varvec{x}_m] \in \mathbb {R}^{mk}\). Let \(\varvec{x}_{i:i+j}\) refer to the concatenation of words \(\varvec{x}_i, \varvec{x}_{i+1}, ..., \varvec{x}_{i+j}\). A convolution operation applies a filter \(\varvec{w} \in \mathbb {R}^{h k}\) to a window of h words to produce a new feature.Feature \(c_i\) is generated from a window of words \(\varvec{x}_{i:i+h-1}\) by

$$\begin{aligned} c_i = f(\varvec{w}*\varvec{x}_{i:i+h-1} + b), \end{aligned}$$
(1)

where f is the ReLU activation function, * is the convolution operation, and \(b \in \mathbb {R}\) is the bias term. A feature vector is obtained by applying the filter to each possible window position:

$$\begin{aligned} \varvec{c} = [c_1,c_2,...,c_{m-h+1}] \in \mathbb {R}^{m-h+1}. \end{aligned}$$
(2)

Next, a dynamic max pooling operation [9] is used to select r features for this particular filter (assuming m is dividable by r):

$$\begin{aligned} \varvec{\hat{c}} = [ max\{\varvec{c}_{1:\frac{m}{r}}\},..., max\{\varvec{c}_{m-\frac{m}{r}+1:m}\}] \in \mathbb {R}^r. \end{aligned}$$
(3)

The advantage of dynamic max pooling over standard max pooling is that some position information is retained.

This section has described the process by which r features are extracted by one filter. The model uses multiple filters with different window sizes. The title and abstract text are processed using the same word embeddings and convolution layer weights to produce embeddings \(\varvec{e}_{title}\) and \(\varvec{e}_{abstract}\) respectively. Standard max pooling is used for the title due to its short length.

3.2 Journal Embedding

Most MEDLINE journals have a narrow topic and are indexed with a small subset of the MeSH vocabulary. Providing the article journal as a model input was found to improve performance, and we expect this is because the model can learn the MeSH descriptor distribution for each journal.

In the presented architecture, the article journal is treated as a categorical input. As such, it could be represented using a sparse one-hot vector, but instead we represent each journal with a fixed size embedding \(\varvec{e}_{journal} \in \mathbb {R}^d\), where d is the embedding size. We hope that an embedded representation will allow for better generalization between journals. The journal embeddings are randomly initialized and learned during training.

3.3 Year Encoding

One of the challenges of automatic MeSH indexing is that the MEDLINE dataset has significant time-variance. There are many factors that cause time-variance, and these include changes to the MeSH vocabulary, changes to indexing policy, changes to the list of indexed journals, and concept drift due to scientific progress and trends. In order to allow the neural network to effectively model the time-variance of the dataset, both the publication and indexing year are provided as inputs. The indexing year is required to model time-variance resulting from changes to the MeSH vocabulary and indexing policy, while the publication year is required to model time-variance due to concept drift.

The year inputs are represented using a special encoding that is intended to capture the sequential nature of time and to facilitate generalization between years. The encoding for a year \(\varvec{e}_{year} \in \{0,1\}^{s}\) from a consecutive range of s years is defined as

$$\begin{aligned} e_{year}^{i} = \left\{ {\begin{array}{*{20}c} {0} &{} {i > \varDelta } \\ {1} &{} {i \le \varDelta } \\ \end{array} } \right. , \end{aligned}$$
(4)

where \(\varDelta \) is the difference between the year and the minimum year that needs to be encoded. Figure 2 is an illustration of this encoding for years between 2014 and 2018.

Fig. 2.
figure 2

Illustration of the special encoding used for year inputs. The example shows how years between 2014 and 2018 would be encoded.

3.4 Hidden and Classification Layers

The embedded inputs are concatenated to form the input \(\varvec{e}\) to the hidden layer:

$$\begin{aligned} \varvec{e} = [ \varvec{e}_{title}, \varvec{e}_{abstract}, \varvec{e}_{journal}, \varvec{e}_{pub\_year}, \varvec{e}_{year\_indexed}]. \end{aligned}$$
(5)

Next, the hidden layer activations \(\varvec{a}_h\) are computed as

$$\begin{aligned} \varvec{a}_h = f(\varvec{W}_h\varvec{e} + \varvec{b}_h), \end{aligned}$$
(6)

where f is the ReLU activation function, and \(\varvec{W}_h\), \(\varvec{b}_h\) are the hidden layer weights. The final confidence scores \(\varvec{\hat{p}} \in [0, 1]^L\) for each of the L MeSH descriptors are computed as

$$\begin{aligned} \varvec{\hat{p}} = \sigma (\varvec{W}_c\varvec{a}_h + \varvec{b}_c), \end{aligned}$$
(7)

where \(\sigma \) is the sigmoid activation function, and \(\varvec{W}_c\), \(\varvec{b}_c\) are the classification layer weights.

Dropout regularization is implemented to reduce overfitting to the training data, and it is applied to the title and abstract embeddings (\(\varvec{e}_{title}, \varvec{e}_{abstract}\)) and the hidden layer activations (\(\varvec{a}_h\)).

3.5 Optimization

A binary-cross entropy objective function is formulated as

$$\begin{aligned} \min _{\varTheta } - \frac{1}{N}\sum _{n=1}^{N}\sum _{l=1}^{L}[y_{nl}log(\hat{p}_{nl}) + (1-y_{nl})log(1-\hat{p}_{nl})], \end{aligned}$$
(8)

where \(\varTheta \) represents the model parameters, N is the number of training examples, and y are the indexer annotations. This objective was minimized using mini-batch gradient descent and the Adam optimizer. Batch normalization was found to improve the model performance and was implemented for the hidden and convolution layers.

4 Experiments

4.1 Dataset

The dataset is comprised of citation data for MEDLINE articles published from 2004 onward. Only articles with both a title and abstract were included, and fully or semi-automatically indexed articles were excluded. Semi-automatic indexing is when MTI has been used as the “first line indexer,” and the results have later been reviewed (and potentially modified) by human indexers.Footnote 4 Semi-automatically indexed articles were excluded (in addition to fully automatically indexed articles) because we believe that the indexing may be biased towards MTI’s predictions. The indexing method of an article is provided as an attribute in the latest PubMed XML format.Footnote 5 The final dataset contains about 8.5 million articles: 20,000 2018 articles were randomly selected for the validation set, and 40,000 2018 articles were randomly selected for the ablation study test set. Citation data was downloaded from the MEDLINE/PubMed 2019 annual baselineFootnote 6 because the BioASQ training data does not include the indexing year or indexing method.

The model performance was evaluated by participating in the large-scale online biomedical semantic indexing task of the BioASQ 2019 challenge. The challenge required participants to make predictions for 15 test sets of approximately 10,000 articles. The tests sets were released weekly, and the overall challenge was divided into 3 batches of 5 consecutive tests sets. Citation data for the test set articles was downloaded from the MEDLINE/PubMed daily update files.Footnote 7 It would have been possible to use the BioASQ test set citation data, but it was simpler for us to process the daily update files as our system is designed to import data in PubMed XML format.

4.2 Evaluation Metric

Binary MeSH descriptor predictions \(\varvec{\hat{y}} \in \{0,1\}^{L}\) are obtained by applying a single decision threshold to all model outputs. The evaluation metric is the micro F1 score (MiF) and this is defined as the harmonic mean of the micro precision (MiP) and the micro recall (MiR):

$$\begin{aligned} MiF = \frac{2 \cdot MiP \cdot MiR}{MiP + MiR}, \end{aligned}$$
(9)

where

$$\begin{aligned} MiP = \frac{\sum _{n=1}^{N}\sum _{l=1}^{L}y_{nl} \cdot \hat{y}_{nl}}{\sum _{n=1}^{N}\sum _{l=1}^{L}\hat{y}_{nl}}, \end{aligned}$$
(10)
$$\begin{aligned} MiR = \frac{\sum _{n=1}^{N}\sum _{l=1}^{L}y_{nl} \cdot \hat{y}_{nl}}{\sum _{n=1}^{N}\sum _{l=1}^{L}y_{nl}}. \end{aligned}$$
(11)

There is an optimum decision threshold that results in the highest F1 score, and this threshold was determined by a linear search on the validation set.

4.3 Configuration

The model was implemented in Keras (v2.1.6) with a Tensorflow (v1.12.0) backend, and its hyperparamters are listed in Table 1. The word embeddings were randomly initialized and trained with the model. The learning rate was reduced by a factor of 3 if the validation set micro F1 score did not improve by more than 0.01 between epochs, and training was stopped early if the F1 score did not improve by more than 0.01 over two epochs. Training the model takes about 1 day on a single NVIDIA Tesla V100 (16 GB) GPU. Making predictions for a test set of 40,000 articles takes about 30 s.

Table 1. CNN hyperparameters

4.4 Evaluation Results

We participated in task 7a of the BioASQ 2019 challenge with two systems. The first system (‘CNN’) is the model described in this paper, and the second system (‘CNN Ensemble’) is the same model with ensembling. Ensembling was implemented by training 9 separate ‘CNN’ models and then taking the average of their predictions. Predictions from the ‘CNN’ model were submitted for all 15 test sets of the challenge, while predictions from the ‘CNN Ensemble’ model were only submitted for the last 4 test sets.

Table 2 shows the evaluation results for the top performing systems in the challenge. The average and sample standard deviation of the test set micro F1 score is shown for each system and batch. For teams participating with multiple versions of the same system, this analysis considers only the best performing configuration in each test set. The ‘CNN’ system predictions for batch 3 week 1 were erroneous and are therefore ignored in the analysis. The full results of the 2019 challenge are available on the BioASQ website,Footnote 8 and on the website our systems are named ‘ceb 1’ and ‘ceb 1 ensemble’. The challenge results show that, as expected, the CNN model with ensembling outperforms the same model without ensembling. The ‘CNN Ensemble’ system is also found to outperform the current MTI implementations (‘Default MTI’ and ‘MTI First Line Index’) by about 3%. Compared to the other systems in the challenge, the CNN model demonstrated competitive performance, and it was typically the 3rd best performing system across all evaluations.

Table 2. Average and sample standard deviation of test set micro F1 scores for the top performing systems in the BioASQ 2019 challenge.

We were interested in evaluating the performance of a pure deep learning approach to automatic MeSH indexing, and for this reason our challenge systems did not make use of MTI’s predictions. This may have put our systems at a disadvantage because approximately 20% of the articles in the challenge test sets were from semi-automatically indexed journals, and we suspect that the NLM’s semi-automatic indexing methods are biased towards MTI’s predictions. To get an idea of the performance improvement that can be achieved by making use of MTI’s predictions, the performance of a hybrid system was evaluated on the articles in the last four test sets of the challenge. The hybrid system uses ‘MTI First Line Index’ predictions for articles that were semi-automatically indexed (‘Curated’ indexing method in the PubMed XML files) and ‘CNN Ensemble’ predictions for all other articles. The hybrid system is found to achieve a micro F1 score of 0.687, an improvement of approximately 2% over the micro F1 score of the ‘CNN Ensemble’ model.

5 Ablation Study

This section presents the results of an ablation study (Table 3) that explores how different aspects of the model architecture contribute to its overall performance. Five different models with ablations were trained and their performance was evaluated on the ablation test set (see Sect. 4.1). The study considers three ablations concerning task specific architectural features (removing separate title and abstract inputs, removing journal input, removing year inputs) and two ablations concerning more general architectural features (removing batch normalization, removing dynamic max pooling). The study finds that the task specific architectural features increase the model performance by 1.2–1.5%, and batch normalization also increases model performance by 1.1%. Dynamic max pooling offers the smallest performance improvement of 0.4%.

Table 3. Model performance with ablations

Even if the model architecture is unchanged, there will be some variation in performance due to the random initialization of parameters and stochastic gradient descent. To understand the scale of this variation, the micro F1 score performance of 10 independently trained ‘CNN’ models was measured on the ablation test set. The sample standard deviation of the micro F1 score was found to be 0.0004, and this gives us confidence that the ablation study results are significant.

6 Discussion

We were able to perform a preliminary analysis comparing the results of the ‘CNN’ and ‘Default MTI’ systems using the BioASQ results and the final indexed MEDLINE citations. We reviewed the results from 78,574 fully human indexed citations containing 979,014 NLM indexed MeSH descriptors. From this set we can see that the ‘Default MTI’ system had slightly higher recall with 615,347 correct MeSH descriptors versus 603,973 for the ‘CNN’ system. But, the ‘CNN’ system was much more precise with only 288,146 incorrect MeSH descriptors compared to ‘Default MTI’ providing 371,658 incorrect MeSH descriptors. Note that the CNN decision threshold was selected for best micro F1 score, but this can be adjusted to increase recall at the cost of precision. Looking closer at the MeSH descriptors that both systems used, we see that the ‘Default MTI’ system used 21,158 distinct MeSH descriptors, while the ‘CNN’ system only used 19,745 distinct MeSH descriptors. Of these, 18,710 MeSH descriptors were common to both systems, ‘Default MTI’ had 2,448 unique MeSH descriptors which were correct 43.45% of the time, and the ‘CNN’ system had 1,035 unique MeSH descriptors which were correct 34.65% of the time.

Looking at the final indexed MeSH descriptors that both systems missed, we see the usual suspects including age related Check Tags (e.g. Young Adult, Adolescent) and sex related Check Tags (e.g. Female, Male). We looked at a small sample of 10 cases where both systems missed the most MeSH descriptors and found that, in 9 out of the 10 cases, the missing information was only available in the full text of the article. This confirmed our suspicions that the information was just not available to the two systems, since the NLM indexers index the article from the full text, while the automated systems only have access to the title and abstract. Our overall impression is that the ‘CNN’ system tends to predict more general MeSH descriptors, whereas the ‘Default MTI’ system does well on certain important MeSH descriptors that we have specifically focused on detecting in the past.

This paper has shown that the CNN model outperforms the current MTI implementation, and from a software engineering perspective, the deep learning approach also has a number of other advantages. A key advantage is that it is an end-to-end system without any dependencies. This will make it easier to deploy and maintain than a complex multi-component system, like MTI. For example, when the MeSH vocabulary needs to be updated at the end of the year, it will take approximately one day to retrain the CNN model. In comparison, it usually takes about a week to update MTI with a new MeSH vocabulary. Another advantage of the CNN model is that it will be possible to support the NLM indexers with a single model instance running on a GPU enabled server. For comparison, MTI uses parallel processing with 70 clients on 6 servers to achieve the level of required throughput. The close to real-time predictions of the CNN model will also make it possible to develop more interactive indexer tools in the future.

A pure deep learning solution does have some disadvantages. A major disadvantage is that a learning based system is unable to make predictions for new MeSH descriptors or follow new indexing rules. MTI also suffers from this problem because it has machine learning components, however, its MetaMap component is able to detect new descriptors based on the information in the UMLS. MTI also uses a lookup list of new terms and their synonyms to ensure they are recommended, and current indexing policy is enforced using manually coded rules.

Another disadvantage of the CNN model is that it is a black box - it does not provide any explanation for why a particular set of MeSH descriptors were predicted. Having more interpretable predictions would be of great benefit to indexers and is something that we plan to look into in the future. For example, it should be possible to determine which n-grams most influence the CNN model’s predictions, and these n-grams could then be highlighted in the text.

7 Conclusion

This paper has presented a CNN model for automatic MeSH indexing. The model demonstrated competitive performance in the BioASQ 2019 task on large-scale online biomedical semantic indexing, outperforming the NLM’s current automatic indexing system, MTI, by about 3%. The paper has also presented an ablation study highlighting how task specific customizations to the model architecture result in improved performance.

In the future, we will explore the possibility of replacing MTI with a deep learning based system. We plan to complete our analysis comparing the strengths and weaknesses of the CNN and MTI systems, and our goal is to achieve higher performance by combining the best aspects of the two systems. Providing the reasons why a particular MeSH descriptor was recommended would be very useful for indexing staff, and we therefore plan to research deep learning architectures that offer more interpretable predictions.