Keywords

1 Introduction

Deep learning models have been widely used in several NLP tasks since the emergence of large scale labelled datasets. In Question Answering (QA) specifically on open domain, several neural network models have been introduced, such as Convolutional Neural Networks (CNN), Recurrent Neural Networks (RNN) using GRUs or LSTMs and attention mechanisms, Self-attention networks (Transformers), and Pretrained language models like ELMO, BERT which can be fine-tuned to Question Answering task. Several kinds of Question Answering (QA) related tasks are widely studied such as Answer Sentence Selection, Reading Comprehension and Open QA.

Reading Comprehension (RC) is a QA task where a question and a relevant paragraph are given and the goal is to extract the answer string present in the paragraph. The main assumption of this task is that the answer is present in the paragraph, like in SQUAD v1.0 [12]. Variants of this task include unanswerable questions such as in [11]. Answers are usually short phrases or entities. There is a leaderboard on SQUAD datasetFootnote 1 which showcases lot of models built for this task [1, 2, 14].

Open QA is a QA task where a question is given and the goal is to retrieve an answer. An answer has to be retrieved from a set of documents or passages of textual sources as Wikipedia articles or news. Answers are also usually short phrases or entities. In NN approaches for Open QA, generally answers are extracted using a reading comprehension model on the subset of the retrieved documents or passages considered as relevant [3, 4].

One of the main differences between Reading Comprehension (RC) and Open QA tasks is that the answer must be present in the paragraphs (or documents) for Reading Comprehension, but for Open QA this condition might not hold true because the retrieved documents considered to be relevant to the question might not contain the answer. Another characteristic is that in the Open QA task, several paragraphs or documents contain the answer.

The BIOASQ Phase B task provides dataset for biomedical question answering which is a small scale labelled dataset for factoid questions (779 question in BIOASQ 7). Each question is associated with multiple relevant paragraphs, some irrelevant ones, and one or several answers. The work of [15] transforms the BIOASQ Phase B dataset into the format of a Reading Comprehension task where each question has an answer text along with the offset in a paragraph which contains the answer. If a paragraph does not contain an answer, it is discarded. This modification of the BIOASQ dataset enables to use a RC model off-the-shelf.

By using such a model on BIOASQ dataset which is a small scale labelled dataset, it will not result in similar performance as on the large scale open domain datasets due to overfitting. One way of overcoming this problem as reported by [6, 15] is by pre-training a deep learning model on a large scale dataset and fine-tuning the same model to the target small scale dataset. The intuition is that the model learns better representations when learnt on a large scale dataset than having a randomly initialized model trained only on the small scale dataset.

However the BIOASQ task resembles more towards an Open QA task than a RC task because of the existence of paragraphs without answers even though they are considered relevant. Thus we propose a new way to tackle the BIOASQ task by using an Open QA model that takes into account this particularity.

We present a comparison of using different pre-training models (reading comprehension and open QA models) for BIOASQ question answering task and also report the performance of a single model without pre-training and without fine-tuning to show the importance of this process.

We report the performance of our model on different datasets and show that in some cases it outperforms the state-of-the-art systems of BIOASQ [5, 7, 15] in average.

2 State of the Art

Since BIOASQ 5, deep learning methods were introduced by [15] by automatically adapting BIOASQ QA task as Reading Comprehension task and pre-training the model with SQUAD v1.0 dataset. Similar approach of pre-training and fine-tuning are used by [5] who pre-train their models using DRQA [1] and BioBert [7] using Bert [2].

The models discussed in this article for the BIOASQ task use automatic annotations as done by [16] who transform the BIOASQ dataset into reading comprehension dataset by using the gold standard answer strings by searching them in the snippets for exact match and are treated as answers if only they are found in the snippets, i.e., the answer string must be a substring of the snippet.

As the NN models participating to BIOASQ are based on domain adaptation, we first present general approaches used for that purpose before presenting how it is done by the BIOASQ NN models.

2.1 Domain Adaptation

Pre-training is a training process started from randomly initialized model weights. Fine-tuning is also a training process but started from the model weights of pre-trained model and not randomly initialized model weights. Both pre-training and fine-tuning together can be termed as Domain Adaptation when the domain of data used for pre-training and fine-tuning are different. For example, open domain and biomedical domain.

Pre-training and fine-tuning or domain adaptation can be done in several ways. The general approaches are listed below.

Type 1 - The task remains the same for pre-training and fine-tuning. Pre-training should be done on a large scale dataset from random initialization of parameters. Fine-tuning should be done on a small scale dataset by loading the model parameters from pre-trained model rather than random initialization. This approach is used when a target dataset is small scaled and using it to train a deep neural network would result in overfitting. This type of pre-training is common in computer vision field where models are pre-trained on Imagenet [13] and fine-tuned on target image classification datasets.

Type 2 - The tasks are different for pre-training and fine-tuning. Pre-training should be done on a large scale dataset from random initialization of parameters. Fine-tuning should be done on a different model which uses certain parameters from the pre-trained model which are frozen (non-trainable) and learns some parameters which are randomly initialized on a different task. These approaches in NLP were initially proposed for sequence labelling tasks by [9] which were later evolved into ELMO (Embedding Language Models) by [10] which significantly improved the state of the art across a broad range of challenging NLP tasks such as question answering, textual entailment and sentiment analysis. This type of method uses special contextual text embeddings obtained from the pre-trained models that are added as features into downstream models built for another task.

Type 3 - The tasks are different for pre-training and fine-tuning. Pre-training should be done on a large scale dataset from random initialization of parameters. Fine-tuning should be done on the pre-trained model by modifying certain layers to fit to the new task. Newly added layers can be randomly initialised and pre-trained model layers together with newly added ones are trained on the new task. This approach is similar to Type 2 approach with a difference that the reference model can be slightly modified for target task rather than building a different model. This type of approach proposed by [2] is being widely used in NLP tasks such as question answering, textual entailment, sentiment analysis, named entity recognition, relation extraction etc. which are easily done by modifying a final output layer of the original model and fine-tuned. Fine-tuning can be done either by learning the whole model parameters or learning only a part of the model by freezing the rest.

Our work uses Type 1 domain adaptation and compares the results with Type 3 domain adaptation results reported by [7].

2.2 Deep Learning and Domain Adaptation in BIOASQ

The work by [16] comes under Type 1 domain adaptation using SQUAD v1.0 pre-training. Since the introduction of pre-trained language models by [9], works by [2, 10] have been used in several NLP tasks and have been proven to outperform many prior state of the art models. BioBert by [7] have been shown to be useful in biomedical domain tasks such as named entity recognition, relation extraction and BIOASQ question answering. This work belongs to Type 3 domain adaptation methods where the authors use Bert model by [2] and re-train it on the same task but on biomedical domain texts. Later this model is modified for different biomedical tasks and tested. This method, as reported in the paper [7], fetches state of the art scores on BIOASQ QA task which is listed in Table 3 under BioBert column.

3 Question Answering Tasks and Models

In this section, we describe the two kinds of question answering tasks and the related models we used for domain adaptation towards biomedical domain.

3.1 Tasks

Question Answering (QA) is a field of research which lies in the intersection of Natural Language Processing and Information Retrieval disciplines. Several types of tasks exists which are commonly referred as Question Answering.

In this article, we focus on Reading Comprehension (RC) task a.k.a Machine Reading task, and Open Domain Question Answering a.k.a Open QA or Open Question Answering.

Reading Comprehension task contains questions, a relevant paragraph and answers from the paragraph. RC is also called as Answer Extraction because the answer is known to be present in the paragraph.

Open QA task contains questions and their short answers without any paragraphs. Open QA can be formulated as a parent task which involves two child tasks, (1) Retrieving the relevant paragraphs for a question and (2) Extracting a short answer from the paragraphs. In Open QA, the first task is generally referred as paragraph selection or answer sentence selection and the second task is often modelled as Reading Comprehension although there exists several relevant and irrelevant paragraphs. Open QA models should distinguish if the paragraph is relevant and then extract the answer unlike the RC models.

BIOASQ phase B task is a question answering task with questions in biomedical domain. For a question, there are relevant documents, paragraphs, answers given. Below is an example from the dataset.

figure a

The goal of BIOASQ question answering task is to extract the correct answer from supporting data. As shown in the example, one paragraph (P1) has the gold standard answer and the other (P2) does not (i.e. it does not contain the exact match of the answer string). Therefore this resembles more like an Open QA task than a Reading Comprehension task.

The evolution of deep learning methods led to the emergence of large scale datasets for these two types of tasks in open domain Question Answering. We use two models for RC and Open QA which are shown in the Fig. 1 and are described below.

Fig. 1.
figure 1

Left: DRQA - Paragraph Reader (RC task). Right: PSPR - Paragraph Selector and Paragraph Reader model (Open QA task)

3.2 Reading Comprehension - DRQA Model

DRQA’s document reader developed by [1] is a simple LSTM model for Reading Comprehension task which takes as input a question and a paragraph and aims at extracting an answer from the paragraph. As per the assumption of the RC task, the answer is always present inside the paragraph as a substring. An overview of the model can be seen on the left figure of Fig. 1. Both the question and the paragraph are tokenized and their word embeddings are used for the model i.e. question words \(Q = \{q_{1}, .....,q_{m}\}\) and paragraph words \(S = \{s_{1}, .....,s_{n}\}\) are sequences which are encoded using an embedding layer of dimension D.

$$\begin{aligned} E(Q) = \{E(q_{1}),..,E(q_{m})\} \end{aligned}$$
(1)
$$\begin{aligned} E(S) = \{E(s_{1}),..,E(s_{n})\} \end{aligned}$$
(2)

A pre-attention mechanism captures the similarity between paragraph words and question words in the same layer. For this purpose, a feature \(\mathcal {F}align\) shown in Eq. 3 is added as a feature to the LSTM layer.

$$\begin{aligned} \mathcal {F}align(p_{i}) = \varSigma _{j} a_{i,j} E(q_j) \end{aligned}$$
(3)

Where \(a_{i,j}\) is,

$$\begin{aligned} a_{i,j} = \frac{exp \,(\alpha (E(s_{i})) \cdot \alpha (E(q_{j}))}{ \varSigma _{j'} \ exp(\alpha (E(s_{i})) \cdot \alpha (E(q_{j'}))} \end{aligned}$$
(4)

which computes the dot product between nonlinear mappings of word embeddings of question and paragraph.

They are followed by a 3-layer Bidirectional LSTM layers for both question and sentence encodings.

(5)
(6)

These LSTM states are connected to two independent classifiers that use a bilinear term to capture the similarity between paragraph words and question words and compute the probabilities of each token being start or end of the answer span.

$$\begin{aligned} P_{\text {start}}(i) \propto \exp \left( \mathbf {p}_{i} \mathbf {W}_{s} \mathbf {q}\right) \end{aligned}$$
(7)
$$\begin{aligned} P_{\text {end}}(i) \propto \exp \left( \mathbf {p}_{i} \mathbf {W}_{e} \mathbf {q}\right) \end{aligned}$$
(8)

During prediction, we choose the best span from token i to token \(i^{\prime }\) such that i \(\le \) \(i^{\prime }\) \(\le \) \(i+15\) and \(P_{start}(i)\) \(\times \) \(P_{end}\) \((i^\prime )\) is maximized.

To make the scores compatible across paragraphs in one or several retrieved documents, unnormalized exponential is used and an argmax is taken over all considered paragraph spans for the final predictions which are offset to start and end of an answer span in the paragraph.

Answer probability for the answer span can be computed among several other answer spans as shown in Sect. 3.3 in Eq. 9. This probability score can be used to return top 5 answers for BIOASQ task which is explained in Sect. 4.2.

3.3 Open QA - PSPR Model

PSPR model is an Open QA model by [8] whose overview is presented in the right figure of Fig. 1. This model has two parts, namely Paragraph Selector (PS) and Paragraph Reader (PR) in a cascade fashion. Although PSPR model contains a Reading Comprehension task submodule for answer extraction, the main difference comes from learning the answer extraction module using the paragraph probabilities computed by the paragraph selector. Paragraphs for the questions are retrieved using an information retrieval technique. Then the Paragraph Selector model predicts a probability distribution \(Pr\left( p_{i} | q, P\right) \) over all the retrieved paragraphs where P is the set of paragraphs for the question.

The Paragraph Reader model extracts answer spans as shown in the DRQA model and predicts a probability \(Pr\left( a | q, p_{i}\right) \) for each answer span where \(p_{i}\) is \(i^{th}\) paragraph in Paragraph set P.

The reader model gives two probabilities (one for start and one for end token given by two classifiers) as described in Eqs. 7 and 8. The answer probability Pr(a|qP) is computed as shown below:

$$\begin{aligned} Pr\left( a | q, p_{i}\right) =\sum _{j} Pr\left( a_{s}^{j}\right) Pr\left( a_{e}^{j}\right) \end{aligned}$$
(9)

The answer with highest probability is returned as the final prediction.

The Paragraph Selector uses tokenized question words \(Q = \{q_{1},.....,q_{m}\}\) and tokenized paragraph words \(P = \{p_{1},.....,p_{n}\}\) which are encoded using an embedding layer of dimension D.

$$\begin{aligned} E(Q) = \{E(q_{1}),..,E(q_{m})\} \end{aligned}$$
(10)
$$\begin{aligned} E(P) = \{E(p_{1}),..,E(p_{n})\} \end{aligned}$$
(11)

A RNN layer encodes the contextual information of the sequence.

(12)
(13)

Using this hidden representation, a self attention operation is applied to get the question representation q:

$$\begin{aligned} \hat{\mathbf {q}}=\sum _{j} \alpha ^{j} \hat{\mathbf {q}}^{j} \end{aligned}$$
(14)

where \(\alpha ^{j}\) encodes the importance of each question word against the other question words which is calculated as:

$$\begin{aligned} \alpha _{i}=\frac{\exp \left( \mathbf {w}_{b} \mathbf {q}_{i}\right) }{\sum _{j} \exp \left( \mathbf {w} b \mathbf {q}_{j}\right) } \end{aligned}$$
(15)

Where w is the learnt weight vector. Finally, the probability of each paragraph is calculated via a max-pooling and a softmax layer as shown below:

$$\begin{aligned} Pr\left( p_{i} | q, P\right) =softmax\left( \max _{j}\left( \hat{\mathbf {p}}_{i}^{j} \mathbf {W} \mathbf {q}\right) \right) \end{aligned}$$
(16)

where \(\mathbf{W} \) is a learnt weight matrix.

Since not all the paragraphs contain an answer in the Open QA setting, the probability scores from Eq. 16 should indicate if there exists an answer or not.

While training, the paragraphs containing the answer are highlighted as 1 and the rest as 0. And while testing, the paragraph with highest probability is chosen to extract the answer. Combining the two probabilities, the overall answer is chosen by choosing the highest probable answer from Pr(a|qP) for a question q which is calculated as :

$$\begin{aligned} Pr(a | q, P)=\sum _{p_{i} \in P} Pr\left( a | q, p_{i}\right) Pr\left( p_{i} | q, P\right) \end{aligned}$$
(17)

For Paragraph Reader, the DRQA model can be used directly as shown in the Fig. 1 with small differences. During training the reader model extracts the answer only when there is an answer in it. i.e. when the paragraph probability score (Eq. 16) of that paragraph is 1. During testing, the reader model extracts the answer from the paragraph which has the highest probability score.

Only difference while adapting this to BIOASQ is that number of answers to be extracted for BIOASQ is top 5 and not 1. Because of this, instead of choosing from only the top most probable paragraph, we select top 5 answers from combined probability scores in Eq. 17, which might consider 1 or several paragraphs to extract answers from.

3.4 Domain Adaptation for BIOASQ Task

Pre-training and fine-tuning or domain adaptation can be done in several ways, the following is a general abstraction of the two processes:

  • Type 1

    • Build a reference model (Model R) for task A.

    • Train the Model R with a sufficiently large scale dataset on task A.

    • Save the Model R weights and model.

    • DO NOT randomly initialize Model R weights, instead use the saved weights of model R from the previous step.

    • Train the Model R again on a small scale target dataset on task A.

    • Predict and Evaluate.

  • Type 2

    • Build a reference model (Model R) for task A.

    • Train the Model R with a sufficiently large scale dataset on task A.

    • Save the Model R weights and model.

    • Build a model from scratch (Model S) for task B.

    • Model S is built to use some features from model R.

    • Initialize the model S weights randomly (apart from the Model R features which are used right away).

    • Train the Model S on a large or small scale target dataset on task B.

    • Predict and Evaluate.

  • Type 3

    • Build a reference model (Model R) for task A.

    • Train the Model R with a sufficiently large scale dataset on task A.

    • Save the Model R weights and model.

    • Build a Model S, which is built upon Model R (by Adding or Modifying the input or output layers of Model R to work on task B).

    • Initialize the model S weights randomly (All of Model R features can be used right away, or some partially used and partially randomized for training).

    • Train the Model S on a large or small scale target dataset on task B.

    • Predict and Evaluate.

In this work, we apply Type 1 domain adaptation.

The Data for Pre-training. Two datasets correspond to each of the two tasks: SQUAD V1.0 dataset for RC task and QUASAR-T dataset for Open QA task and we show below their differences.

  • QUASAR-T which is based on Trivia questions is generated synthetically, and SQUAD is annotated manually by humans on a crowd-sourcing platform.

  • Each question in QUASAR-T is associated to 100 sentence-level passages retrieved from ClueWeb09 dataset based on Lucene, whereas SQUAD 1.0 has 1 relevant paragraph.

  • Some paragraphs in QUASAR-T do not have an answer.Footnote 2

Comparing the above differences with BIOASQ dataset, the QUASAR-T dataset resembles more closely to BIOASQ than that of SQUAD v1.0 due to the following reasons.

  • BIOASQ data has more than 1 relevant paragraphs per question.

  • Some paragraphs do not have an answer.

4 Experiments and Results

For fine-tuning the models we use BIOASQ datasets. The statistics of different sets are reported in the Table 1.

Table 1. Datasets used in the experiments along with their splits. The numbers represent number of questions.
Table 2. Results reporting the importance of Pre-Training and Fine-Tuning a model. DRQA model by [1] is used for these experiments. S.Acc is Strict Accuracy, L.Acc is Lenient Accuracy (the correct answer is in the top 5) and MRR is Mean Reciprocal Rank.

4.1 Importance of Pre-training and Fine-Tuning

To show the importance of Pre-Training and Fine-Tuning for domain adaptation to biomedical domain, we experimented three approaches on a single model DRQA without altering any hyperparameters. (Default parameters as used by [1]Footnote 3)

(1) No-Pre model is the DRQA model trained on BIOASQ dataset only.

(2) No-Fine model is the DRQA model trained on SQUAD v1.0 dataset only.

(3) Pre+Fine is the DRQA model trained on SQUAD v1.0 dataset and fine-tuned on BIOASQ dataset.

Results are shown in the Table 2 for different BIOASQ test sets. When a model is only trained on a small dataset like BIOASQ, the results are very low as shown in the column No-Pre. When a model is trained on a large dataset like SQUAD v1.0, the model can be straight away used to predict results on the biomedical dataset. No-Fine shows an improvement doing so, against No-Pre. Lastly, Pre+Fine pre-training on a large scale dataset and fine-tuning on the biomedical dataset shows an improvement over the other approaches. This set of experiments show that the best approach is to do pre-training and fine-tuning on smaller domain specific datasets.

4.2 Experiments with the Two QA Modellings: DRQA and PSPR

Table 3. DRQA is the Reading Comprehension model by [1], PSPR is the Open QA model by [8], DRQA+PS is answers chosen with scores by multiplying answer probabilities of DRQA with Paragraph Selector probabilities of PSPR. SOTA scores are reported by [7] who average the best scores from each batch (possibly from multiple different models). Results from BIOASQ 4b, 5b and 6b test sets. 7b test set cannot be evaluated yet due to lack of gold standard answers. S.Acc is Strict Accuracy, L.Acc is Lenient Accuracy and MRR is Mean Reciprocal Rank. Experiments are done with the original BIOASQ data.

In this section, we experiment mainly with the two models for a Reading Comprehension task and an Open QA task.

For studying the modelling of the BIOASQ QA task as a Reading Comprehension task, we use SQUAD v1.0 dataset for pre-training and experiment with the DRQA model. For studying its modelling as an Open QA task, we use QUASAR-T dataset for pre-training and experiment with PSPR model.

We also experiment using the paragraph probabilities for reranking the DRQA answers and choosing the top 5. As the PSPR model is a cascaded model with paragraph selector and paragraph reader, we use the paragraph probabilities predicted by the paragraph selector and multiply them with the answer probabilities obtained using DRQA model to select the top 5 answers which have combined higher probabilities. Note that this approach is not the same as cascaded PSPR model because the PSPR model’s reader model uses the paragraph probabilities to learn the extraction of answers which might have a different impact.

For obtaining top 5 answers with DRQA model, if there are more than 5 paragraphs for the question, we take 1 answer from each paragraph and choose top 5 based on the answer probabilities. We do this to make sure each paragraph’s top answer is contributed towards top 5 answers. If there are less than 5 paragraphs, we take top 5 answers based on answer probabilities to keep it simple.

Results are shown in Table 3 for different BIOASQ test sets. We compare different model results with BioBert scores reported in [7]. The scores from PSPR model shows that the performance is better than BioBert on Strict and Lenient accuracy on 5b and 6b test sets. By taking paragraph probability into account PSPR allows to better rank top 1 correct answer than BioBert which extracts answers only from the longer pre-processed paragraphs which have correct answers.

Although PSPR has a reader model similar to DRQA, considering the paragraph probability seems to improve the answer extraction in PSPR model.

4.3 Experiments with Longer Contexts (Modified BIOASQ Data)

Table 4. Experiments with data containing longer contexts (Document level) by [7]. DRQA is a Reading Comprehension model by [1]. BioBert-Unaltered is the original BIOASQ dataset with questions and paragraphs which contain answers. Biobert by [7] is the modified BIOASQ dataset where the paragraphs are longer paragraphs (documents from respective articles), where all the models are pre-trained on SQUAD v1.0 dataset and finetuned on BIOASQ dataset. SOTA scores are reported by [7] who average the best scores from each batch (possibly from multiple different models). Results from BIOASQ 4b, 5b and 6b test sets. 7b test set cannot be evaluated yet due to lack of gold standard answers. S.Acc is Strict Accuracy, L.Acc is Lenient Accuracy and MRR is Mean Reciprocal Rank.

For the BIOASQ task we noted that the method used by [7] with BioBert modifies the original paragraphs. For computing the BioBert model, the authors have retrained the original Bert model by [2], using Pubmed and PMC articles. For applying it on the BIOASQ task, the authors use longer documents (instead of the actual snippets) from Pubmed corresponding to the data given by BIOASQ in the “documents” field to access the Pubmed documents for each question. Therefore the modification of the dataset leads to different results for BioBert compared to the performance on the regular BIOASQ dataset. The exact pre-processing of BIOASQ dataset in order to do this is not very clear from the paper, however the authors release the modified dataset in their repositoryFootnote 4.

In order to evaluate the importance of this data modification, we did three experiments: (1) DRQA with longer contexts (2) BioBert with unaltered data from BIOASQ (3) BioBert results with modified paragraphs and as reported by [7] in their paper.

The results are shown in Table 4. The results of BioBert is as presented in [7] where the authors have fine-tuned the models first using SQUAD v1.0 dataset and adapted it to BIOASQ data. We use the modified dataset to experiment it with the DRQA model to determine if it would improve the performance of the pre+fine DRQA model as reported in Table 2. We got lower performances to that of the DRQA model trained on the original BIOASQ data.

For comparison, we try the BioBert model on the original BIOASQ data i.e. paragraphs given by BIOASQ data and not pre-processed. The results in Table 4, under the column BioBert-Unaltered represents these results. It is evident that the modification performed on the BIOASQ data fetches better results using BioBert model.

5 Conclusion

In this work we have shown the importance of pre-training and fine-tuning process a.k.a domain adaptation for biomedical domain question answering. We have also compared two QA models based on i.e. (1) Reading Comprehension task (2) Open QA task, and found that the performance is better when using an Open QA model than a Reading Comprehension model.

Based on a different pre-processing done by [7] on the biomedical dataset by using longer contexts from documents than shorter contexts, we found that the Reading Comprehension model performs worse on the pre-processed longer contexts compared to the shorter contexts originally given by BIOASQ data.

On the other end, a large pre-trained language model such as BERT performs much better on the pre-processed longer contexts than shorter contexts. Future work shall focus on training BERT model on Open QA task which will better suit the BIOASQ dataset.