1 Introduction

Natural Language Inference (NLI) is a core challenge for natural language understanding [18]. More specifically, the goal of NLI is to identify the logical relationship (entailment, neutral, or contradiction) between a premise and a corresponding hypothesis. Recently, neural network-based models for NLI have attracted more attention for their powerful ability to learn sentence representation [1, 25]. There are mainly two class of models: sequential models [7, 9, 13, 20, 24,25,26] and tree-structured models [2, 3, 14, 19, 27].

For the first class of models, sentences are regarded as sequences, in which word-level representation is usually used to model interaction between the premise and hypothesis with attention mechanism [7, 20, 25]. These models make no consideration of the syntax, but syntax has been proved to important for natural language sentence understanding [4, 5]. Since the compositional nature of sentence, the same words may produce different semantics because of different word orders or syntactic structures, as shown in Fig. 1. The sentences in Fig. 1(a) have same words but different word orders, and express different meaning. The sentences in Fig. 1(b) have same word orders but different syntactic structures. On the left, “with a telescope” is combined with man, and express that “I saw the man who had a telescope”. On the right, “with a telescope” provides additional information about the action “saw the man”, and express that “I used the telescope to view the man”. Thus, for these language expressions with subtle semantic changes, the sequential models can not always work better than tree-structured models, and syntax is still worth of a further exploration.

Fig. 1.
figure 1

The examples that are difficult for sequential structure to understand.

For the second class of models, tree structures are used to learn semantic composition [2, 3, 19, 27], in which leaf node is word representation and non-leaf node is phrase representation. The final representation of root node is regarded as sentence representation. Recent evidence [3, 8, 19, 23] reveals that tree-structured models with attention can achieve higher accuracy than sequential models on several tasks. However, the potential of the tree-structured network has not been well exploited for NLI, and the performance of tree-structured models still falls below complex sequential models with deeper networks.

Fig. 2.
figure 2

(a): It learns tree-structured semantic composition in which non-leaf nodes are composed following syntactic tree and represent phrase representations. Then syntax-aware attention is performed for phrase-level matching between two sentences. (b): Based on phrase representations composed syntactically, cross attention and self-attention mechanism are performed to extract features. Finally, a classifier is used to predict the semantic relation between the two sentences.

To further explore the potential of tree structure for improving semantic computation, we propose a syntax-aware attention model for NLI, as shown in Fig. 2. It mainly consists of three sub-components: (1) tree-structured composition; (2) cross attention; and (3) self-attention. The tree-structured composition uses syntactic tree to generate phrase representations. Then, we design cross attention to model phrase-level matching that learns interaction between two sentences. A self-attention mechanism is also introduced to enhance semantic representations, which captures the context from syntactic tree within each sentence.

In summary, our contributions are as follows:

  • We propose a syntax-aware attention model for NLI. It learns phrase representations by tree-structured composition based on syntactic structure.

  • We introduce phrase-level matching with cross attention and self-attention mechanism. The cross attention makes interaction between two sentences, and the self-attention enhances semantic representations by capturing the context from syntactic tree within each sentence.

  • We evaluate the proposed model on SNLI and SciTail datasets and the results show that our model has the ability to model NLI more precisely than the previous sequential and tree-structured models.

2 Related Works

Previous work [17, 22, 23] reveals that models using syntactic trees may be more powerful than sequential models in several tasks, such as sentiment classification [23], neural machine translation [8]. For NLI task, Bowman et al. [2] use constituency parser tree, and explore tree-structured Tree-LSTM to improve sequential LSTM. This method is simple and effective, but ignores the interaction between two sentences. Munkhdalai and Yu [19] use full binary tree, and introduce attention mechanism to model the interaction between two sentences by using node-by-node matching. More recently, Chen et al. [3] design enhanced Tree-LSTM. It shows that incorporating tree-structured information can further improve model performance, and the constituency parser tree is more effective than full binary tree. The latent tree structure [27] is also used to improve semantic computation. However, the existing tree-structured models still fall below complex sequential models [3, 7, 9, 24, 25].

In this paper, we focus on how to use syntactic structure to improve semantic computation for complex language understanding. We propose a syntax-aware attention model for NLI, which explores tree-structured semantic composition and implements attention-based phrase-level matching between premise and hypothesis. Experimental results demonstrate the effect of the proposed model.

3 Approach

The model takes two sentences P and Q with syntactic trees as input. Let P = [\(p_1\), \(\cdots \), \(p_{i}\), \(\cdots \), \(p_{m}\)] with m words and Q = [\(q_1\), \(\cdots \), \(q_{j}\), \(\cdots \), \(q_{n}\)] with n words. The goal is to predict label y that indicates the logic relation between P and Q. In this paper, we focus on learning semantic composition over constituency tree. We give an example of binarized constituency parser tree in Fig. 2(a).

3.1 Tree-Structured Composition

We apply tree-structured composition for P and Q. In our model, each non-leaf node has two children nodes: leaf child l and right child r. We initiate leaf nodes with BiLSTM [12]. For non-leaf nodes, we adopt S-LSTM [28] as composition function. Each S-LSTM unit has two vectors: hidden state h and memory cell c.

Let (\(h_t^l\), \(c_t^l\)) and (\(h_t^r\), \(c_t^r\)) represent the left child node l and the right child node r, respectively. We compute a parent node hidden state \(h_{t+1}^p\) and memory cell \(c_{t+1}^p\) as following equations.

$$\begin{aligned}&i_{t+1}\!=\!\sigma (W_{hi}^l h_t^l \!+\! W_{hi}^r h_t^r \!+\! W_{ci}^l c_t^l \!+\! W_{ci}^r c_t^r \!+\! b_i) \end{aligned}$$
(1)
$$\begin{aligned}&f_{t+1}^l\!=\!\sigma (W_{{hf}_l}^l h_t^l \!+\! W_{{hf}_l}^r h_t^r \!+\! W_{{cf}_l}^l c_t^l \!+\! W_{{cf}_l}^r c_t^r \!+\! b_{f_l})\end{aligned}$$
(2)
$$\begin{aligned}&f_{t+1}^r\!=\!\sigma (W_{{hf}_r}^l h_t^l \!+\! W_{{hf}_r}^r h_t^r \!+\! W_{{cf}_r}^l c_t^l \!+\! W_{{cf}_r}^r c_t^r \!+\! b_{f_r})\end{aligned}$$
(3)
$$\begin{aligned}&u_{t+1}\!=\!\mathrm{tanh}(W_{hu}^l h_t^l + W_{hu}^r h_t^r + b_u)\end{aligned}$$
(4)
$$\begin{aligned}&c_{t+1}\!=\!f_{t+1}^l \odot c_t^l + f_{t+1}^r \odot c_t^r + i_{t+1} \odot u_{t+1}\end{aligned}$$
(5)
$$\begin{aligned}&h_{t+1}\!=\!o_{t+1} \odot \mathrm{tanh}({c}_{t+1}) \end{aligned}$$
(6)

where \(\sigma \) denotes the logistic sigmoid function and \(\odot \) denotes element-wise multiplication of two vectors; \(f_l\) and \(f_r\) are the left and right forget gate; i, o are the input gate and output gate; W and b are learnable parameters, respectively.

We use the hidden state of node as phrase representation. Then, two sentences are represented by \({h}_p\) = [\({h}_{p_1}\), \(\cdots \), \({h}_{p_i}\), \(\cdots \), \({h}_{p_{2m-1}}\)] and \({h}_q\) = [\({h}_{q_1}\), \(\cdots \), \({h}_{q_j}\), \(\cdots \), \({h}_{q_{2n-1}}\)]. It noted that there are m-1/n-1 non-leaf nodes composed from the tree for phrase representations and m/n leaf nodes for word representations for P/Q.

3.2 Cross Attention

Cross attention is utilized to capture the phrase-level relevance between two sentences. Give two composed representations based on syntactic trees \({h}_p\) and \({h}_q\) for P and Q, we first compute unnormalized attention weights A for any pair of nodes between P and Q with biaffine attention function [6] as follows:

$$\begin{aligned} {A}_{ij}={{h}_{p_i}}^T {W} {h}_{q_j} + \langle {U}_l,{h}_{p_i} \rangle + \langle {U}_r,{h}_{q_j} \rangle \end{aligned}$$
(7)

where \({W}\in \mathbb {R}^{h \times h}\), \({U}_l \in \mathbb {R}^{h}\), \({U}_r \in \mathbb {R}^{h}\) are learnable parameters, and \(\langle \cdot ,\cdot \rangle \) denotes the inner production operation. \({p}_i\) and \({q}_j\) are the i-th and j-th node in P and Q, respectively. Next, the relevant semantic information for nodes \(p_i\) and \(q_j\) in another sentence is extracted as follows:

$$\begin{aligned}&\widetilde{h}_{p_i}=\sum _{j=1}^{2n-1} \frac{exp(A_{ij})}{\sum _{k=1}^{2n-1} exp(A_{ik})} h_{q_j} \end{aligned}$$
(8)
$$\begin{aligned}&\widetilde{h}_{q_j}=\sum _{i=1}^{2m-1} \frac{exp(A_{ij})}{\sum _{k=1}^{2m-1} exp(A_{kj})}h_{p_i} \end{aligned}$$
(9)

Intuitively, the interaction representation \(\widetilde{h}_{p_i}\) is a weighted summation of \(\{{h}_{q_j}\}_{j=1}^{2n-1}\) that is softly aligned to \({h}_{p_i}\), and the semantics of \({h}_{q_j}\) is more probably selected if it is more related to \({h}_{p_i}\).

To further enrich the interaction, we use a local comparison function ReLU [10].

$$\begin{aligned}&\overline{h}_{p_i}=[h_{p_i}; \widetilde{h}_{p_i}; |h_{p_i}-\widetilde{h}_{p_i}|; h_{p_i}\odot \widetilde{h}_{p_i}] \end{aligned}$$
(10)
$$\begin{aligned}&{h}_{p_i}^c = \mathrm{ReLU} (W_p \overline{h}_{p_i}+b_p) \end{aligned}$$
(11)
$$\begin{aligned}&\overline{h}_{q_j}=[h_{q_j}; \widetilde{h}_{q_j}; |h_{q_j}-\widetilde{h}_{q_j}|; h_{q_j}\odot \widetilde{h}_{q_j}] \end{aligned}$$
(12)
$$\begin{aligned}&{h}_{q_j}^c = \mathrm{ReLU} (W_q \overline{h}_{q_j}+b_q) \end{aligned}$$
(13)

where W, b are learnable parameters. This operation helps the model to further fuse the matching information, and also reduce the dimension of vector representations for less model complexity.

After that, nodes \({p_i}\) and \(q_j\) in P and Q are newly represented by \({h}_{p_i}^c\) and \({h}_{q_j}^c\), respectively.

3.3 Self-attention

We introduce a self-attention layer after cross attention. It captures context from syntactic tree for each sentence and enhances node semantic representations.

For sentence P, we first compute self-attention weights S as Eq. (7).

$$\begin{aligned} {S}_{ij}=\langle {h}_{p_i}^c, {h}_{p_j}^c \rangle \end{aligned}$$
(14)

where, \({S}_{ij}\) indicates the relevance between the i-th node and j-th node in P. Then, we compute the self-attention vector for each node in P as follows:

$$\begin{aligned} \widetilde{h}_{p_i}^c=\sum _{j=1}^{2m-1} \frac{exp({S}_{ij})}{\sum _{k=1}^{2m-1} exp({S}_{ik})} {h}_{p_j}^c \end{aligned}$$
(15)

Intuitively, \(\widetilde{{h}}_{p_i}^c\) augments each node representation with global syntactic context from P also from Q.

Similarly, we compute self-attention vector \(\widetilde{h}_{q_j}^c\) for each node \(q_j\) in Q. Then a comparison function is used to (\({h}_{p_i}^c\), \(\widetilde{h}_{p_i}^c\)) and (\({h}_{p_j}^c\), \(\widetilde{h}_{p_j}^c\)) to get enhanced representations \({h}_{p_i}^s\) and \({h}_{q_j}^s\) as Eqs. (10)–(13).

Finally, we further fuse the above cross attention and the self-attention information as follows:

$$\begin{aligned}&\widehat{h}_{p_i} = {h}_{p_i}^c+{h}_{p_i}^s \end{aligned}$$
(16)
$$\begin{aligned}&\widehat{h}_{q_j} = {h}_{q_j}^c+{h}_{q_j}^s \end{aligned}$$
(17)

The representations \(\widehat{h}_{p_i}\) and \(\widehat{h}_{q_j}\) are learned from cross attention between two syntactically composed trees and then are augmented by self-attention. We then pass them into prediction layer.

3.4 Prediction Layer

We perform mean and max pooling on each sentence as Chen et al. [3], and use two-layer 1024-dimensional MLP with ReLU activation as classifier.

For model training, the object is to minimize the objective function \(\mathcal {J}(\varTheta )\):

$$\begin{aligned} \mathcal {J}(\varTheta ) \!=\! -\frac{1}{N}\sum _{i=1}^N \log P(y^{(i)}|p^{(i)},q^{(i)};\varTheta ) + \frac{1}{2}\lambda \Vert \varTheta \Vert _2^2 \end{aligned}$$
(18)

where \(\varTheta \) denotes all the learnable parameters, N is the number of instances in the training set, \((p^{(i)},q^{(i)})\) are the sentence pairs, and \(y^{(i)}\) denotes the annotated label for the i-th instance.

4 Experiments

4.1 Dataset

We evaluate our model on two datasets: the Stanford Natural Language Inference (SNLI) dataset [1] and the SciTail dataset [14]. The syntactic trees used in this paper are produced by the Stanford PCFG Parser 3.5.3 [16] and they are provided in these datasets.

The detailed statistical information of the two datasets is shown in Table 1.

Table 1. Statistics of datasets: SNLI and SciTail. Avg.L refers to average length of a pair of sentences.

4.2 Implementation Details

Following Tay et al. [24], we learn word embedding by concatenating pre-trained word vector, learnable word vector and POS vector. Then we use a ReLU layer to the concatenated vector. We set word embeddings, the hidden states of S-LSTM and ReLU to 300 dimensions. The pre-trained word vectors are 300-dimensional Glove 840B [21] and fixed during training. The learnable word vectors and POS vectors have 30 dimensions. The batch size is set to 64 for SNLI and 32 for SciTail. We use the Adam method [15] for training, and set the initial learning rate to 5e−4 and \(l_2\) regularizer strength to 6e−5. For ensemble model, we average the probability distributions from three single models as in Duan et al. [7].

Table 2. Comparison results on SNLI dataset.

4.3 Comparison Results on SNLI

The comparison results on SNLI dataset is shown in Table 2.

The first group are sequential models that adopt attention to model word-level matching. SAN [13] is a distance-based self-attention network. BiMPM [25] design a bilateral multi-perspective matching model from both directions. ESIM [3] incorporate the chain LSTM and tree LSTM. CAFE [24] use novel factorization layers compress alignment vectors into scalar valued features. DR-BiLSTM [9] process the hypothesis conditioned on the premise results. DIIN [11] hierarchically extract semantic features using CNN. AF-DMN [7] adopt attention-fused deep matching network.

The second group are tree-structured models, of which SPINN [2] use Tree-LSTM with constituency parser tree, without attention. NTI [19] and syn Tree-LSTM [3] adopt attention for node matching. NTI use full binary tree while syn Tree-LSTM use constituency parser tree. Compared to Chen et al. [3], we use same parser tree but different tree composition function and attention mechanism.

In Table 2, our single and ensemble models achieve 88.8% and 89.5% test accuracy. The comparison results show that our model outperforms not only the existing tree-structured models, but also state-of-the-art achieved by sequential models on SNLI dataset.

Table 3. Comparison results on SciTail dataset.
Table 4. Ablation study on SNLI dev and test sets.

4.4 Comparison Results on SciTail

The comparison results on SciTail dataset is shown in Table 3. SciTail is known to be a more difficult dataset for NLI. The first five models in Table 2 are all implemented in Knot et al. [14]. DGEM is a graph based attention model using syntactic structures. CAFE [24] adopt LSTM and attention for word-level matching. DEISTE [26] propose deep explorations of inter-sentence interaction.

On this dataset, our single model significantly outperforms these previous models, and achieves 85.8% test accuracy.

4.5 Ablation Study

We conduct an ablation study to examine the effect of each key component of our model. As illustrated in Table 4, the first row is the model that uses the representation of root node to represent sentence, without attention. By adding cross attention and self-attention, the model performance is further improved. This proves the effect of our tree-structured composition and matching model.

Fig. 3.
figure 3

The syntactic trees and attention result for sentences P and Q.

4.6 Investigation on Attention

In this section, we investigate what information is captured by the attention, and visualize the cross attention results, as shown in Fig. 3. This is an instance from the test set of SciTail dataset: {P: all living cells have a plasma membrane that encloses their contents. Q: all types of cells are enclosed by a membrane. The label y: entailment.}. From the results, it shows that our syntax-based model can semantically aligns word-level expressions (node 13 “encloses” and node 9 “enclosed”) and phrase-level expressions (node 5 “all living cells” and node 7 “all types of cells”) in the P and Q, respectively. We also observe that attention degree for the phrase expressions is more obvious than the single word that composes the phrase, such as node 17 in the P and node 16 in the Q. An intuitive explantation is that the syntax-based model can capture more rich semantics by using tree-structured composition. Finally, the syntax-based model attends over higher level tree nodes with rich semantics when considering longer phrase or full sentence, such as, the larger sub-trees 20, 22 and 23 in the P is aligned to the root node 19 that represents the whole semantics of the Q. It also indicates the composition function can effectively compute phrase representations.

Table 5. Some complex examples and the classification. The E indicates entailment and the N indicates neutral between P and Q.

4.7 Case Study

We show some examples from SciTail test dataset, as shown in Table 5. We compare the proposed syntax-based model with sequential model. For sequential model, we use the representative CAFE model [24]. We compute the Bleu score of the P with referenced to the Q and use 1-gram length. The Bleu score assumes the more overlapped words between two sentences, the closer the semantics are.

Examples A-B are entailment cases, but each of which has low Bleu score. Thus, it is more difficult to recognize the entailment relation between them. Our syntax-based approach correctly favors entailment in these cases. It indicates that the low lexical similarity challenges the sequential model to extract the related semantics, but it maybe solved by introducing syntactic structures.

The second set of examples C-D are neutral cases. Each of them has high Bleu score, where sequential model trends to misidentify the semantic relation to entailment, but our syntax-based model have the ability to correctly recognize the neutral relation. It indicates syntactic structure is more superior to solve semantic understanding involving structurally complex expressions.

Finally, examples E-F are cases that sequential and syntactic models get wrong. Examples E are entailment relation, but it have low Bleu score. Meanwhile, the word orders and structures (“compose” and “is composed of”) of two sentences are also quite different. It causes models to failure recognizing the entailment relation between them. Example F is neutral relation where two sentences have high lexical overlap and also the similar word orders, which confuses models to misclassify a entailment class. For the difficult cases, sentence semantics suffer more the issues, such as polysemy, ambiguity, as well as fuzziness, in which the model may need more inference information to distinguish these relations and make the correct decision, such as incorporating external knowledge to help model better understanding the lexical and phrasal semantics.

5 Conclusions and Future Work

In this paper, we explore the potential of syntactic tree for semantic computation and present a syntax-aware attention model for NLI. It leverages tree-structured composition and phrase-level matching. Experimental results on SNLI and SciTail datasets show that our model significantly improves the performance, and that the syntactic structure is important for modeling complex semantic relationship. In the future, we will explore the combination of syntax and pre-trained language model technology, to further improve the performance.