Syntax-Aware Attention for Natural Language Inference with Phrase-Level Matching

Liu, Mingtong; Wang, Yasong; Zhang, Yujie; Xu, Jinan; Chen, Yufeng

doi:10.1007/978-3-030-32381-3_13

Syntax-Aware Attention for Natural Language Inference with Phrase-Level Matching

Mingtong Liu¹³,
Yasong Wang¹³,
Yujie Zhang¹³,
Jinan Xu¹³ &
…
Yufeng Chen¹³

Conference paper
First Online: 13 October 2019

4150 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 11856))

Abstract

Natural language inference (NLI) aims to predict whether a premise sentence can infer another hypothesis sentence. Models based on tree structures have shown promising results on this task, but the performance still falls below that of sequential models. In this paper, we present a syntax-aware attention model for NLI, by which phrase-level matching between two sentences is allowed. We design tree-structured semantic composition function that builds phrase representations according to syntactic trees. We then introduce cross sentence attention to learn interaction information based on phrase-level representations between two sentences. Moreover, we additionally explore a self-attention mechanism to enhance semantic representations by capturing the context from syntactic tree. Experimental results on SNLI and SciTail datasets demonstrate that our model has the ability to model NLI more precisely and significantly improves the performance.

Download conference paper PDF

1 Introduction

Natural Language Inference (NLI) is a core challenge for natural language understanding [18]. More specifically, the goal of NLI is to identify the logical relationship (entailment, neutral, or contradiction) between a premise and a corresponding hypothesis. Recently, neural network-based models for NLI have attracted more attention for their powerful ability to learn sentence representation [1, 25]. There are mainly two class of models: sequential models [7, 9, 13, 20, 24,25,26] and tree-structured models [2, 3, 14, 19, 27].

For the first class of models, sentences are regarded as sequences, in which word-level representation is usually used to model interaction between the premise and hypothesis with attention mechanism [7, 20, 25]. These models make no consideration of the syntax, but syntax has been proved to important for natural language sentence understanding [4, 5]. Since the compositional nature of sentence, the same words may produce different semantics because of different word orders or syntactic structures, as shown in Fig. 1. The sentences in Fig. 1(a) have same words but different word orders, and express different meaning. The sentences in Fig. 1(b) have same word orders but different syntactic structures. On the left, “with a telescope” is combined with man, and express that “I saw the man who had a telescope”. On the right, “with a telescope” provides additional information about the action “saw the man”, and express that “I used the telescope to view the man”. Thus, for these language expressions with subtle semantic changes, the sequential models can not always work better than tree-structured models, and syntax is still worth of a further exploration.

For the second class of models, tree structures are used to learn semantic composition [2, 3, 19, 27], in which leaf node is word representation and non-leaf node is phrase representation. The final representation of root node is regarded as sentence representation. Recent evidence [3, 8, 19, 23] reveals that tree-structured models with attention can achieve higher accuracy than sequential models on several tasks. However, the potential of the tree-structured network has not been well exploited for NLI, and the performance of tree-structured models still falls below complex sequential models with deeper networks.

To further explore the potential of tree structure for improving semantic computation, we propose a syntax-aware attention model for NLI, as shown in Fig. 2. It mainly consists of three sub-components: (1) tree-structured composition; (2) cross attention; and (3) self-attention. The tree-structured composition uses syntactic tree to generate phrase representations. Then, we design cross attention to model phrase-level matching that learns interaction between two sentences. A self-attention mechanism is also introduced to enhance semantic representations, which captures the context from syntactic tree within each sentence.

In summary, our contributions are as follows:

We propose a syntax-aware attention model for NLI. It learns phrase representations by tree-structured composition based on syntactic structure.
We introduce phrase-level matching with cross attention and self-attention mechanism. The cross attention makes interaction between two sentences, and the self-attention enhances semantic representations by capturing the context from syntactic tree within each sentence.
We evaluate the proposed model on SNLI and SciTail datasets and the results show that our model has the ability to model NLI more precisely than the previous sequential and tree-structured models.

2 Related Works

Previous work [17, 22, 23] reveals that models using syntactic trees may be more powerful than sequential models in several tasks, such as sentiment classification [23], neural machine translation [8]. For NLI task, Bowman et al. [2] use constituency parser tree, and explore tree-structured Tree-LSTM to improve sequential LSTM. This method is simple and effective, but ignores the interaction between two sentences. Munkhdalai and Yu [19] use full binary tree, and introduce attention mechanism to model the interaction between two sentences by using node-by-node matching. More recently, Chen et al. [3] design enhanced Tree-LSTM. It shows that incorporating tree-structured information can further improve model performance, and the constituency parser tree is more effective than full binary tree. The latent tree structure [27] is also used to improve semantic computation. However, the existing tree-structured models still fall below complex sequential models [3, 7, 9, 24, 25].

In this paper, we focus on how to use syntactic structure to improve semantic computation for complex language understanding. We propose a syntax-aware attention model for NLI, which explores tree-structured semantic composition and implements attention-based phrase-level matching between premise and hypothesis. Experimental results demonstrate the effect of the proposed model.

3 Approach

The model takes two sentences P and Q with syntactic trees as input. Let P = [$p_1$, $\cdots $, $p_{i}$, $\cdots $, $p_{m}$] with m words and Q = [$q_1$, $\cdots $, $q_{j}$, $\cdots $, $q_{n}$] with n words. The goal is to predict label y that indicates the logic relation between P and Q. In this paper, we focus on learning semantic composition over constituency tree. We give an example of binarized constituency parser tree in Fig. 2(a).

3.1 Tree-Structured Composition

We apply tree-structured composition for P and Q. In our model, each non-leaf node has two children nodes: leaf child l and right child r. We initiate leaf nodes with BiLSTM [12]. For non-leaf nodes, we adopt S-LSTM [28] as composition function. Each S-LSTM unit has two vectors: hidden state h and memory cell c.

Let ($h_t^l$, $c_t^l$) and ($h_t^r$, $c_t^r$) represent the left child node l and the right child node r, respectively. We compute a parent node hidden state $h_{t+1}^p$ and memory cell $c_{t+1}^p$ as following equations.

$$\begin{aligned}&i_{t+1}\!=\!\sigma (W_{hi}^l h_t^l \!+\! W_{hi}^r h_t^r \!+\! W_{ci}^l c_t^l \!+\! W_{ci}^r c_t^r \!+\! b_i) \end{aligned}$$

(1)

$$\begin{aligned}&f_{t+1}^l\!=\!\sigma (W_{{hf}_l}^l h_t^l \!+\! W_{{hf}_l}^r h_t^r \!+\! W_{{cf}_l}^l c_t^l \!+\! W_{{cf}_l}^r c_t^r \!+\! b_{f_l})\end{aligned}$$

(2)

$$\begin{aligned}&f_{t+1}^r\!=\!\sigma (W_{{hf}_r}^l h_t^l \!+\! W_{{hf}_r}^r h_t^r \!+\! W_{{cf}_r}^l c_t^l \!+\! W_{{cf}_r}^r c_t^r \!+\! b_{f_r})\end{aligned}$$

(3)

$$\begin{aligned}&u_{t+1}\!=\!\mathrm{tanh}(W_{hu}^l h_t^l + W_{hu}^r h_t^r + b_u)\end{aligned}$$

(4)

$$\begin{aligned}&c_{t+1}\!=\!f_{t+1}^l \odot c_t^l + f_{t+1}^r \odot c_t^r + i_{t+1} \odot u_{t+1}\end{aligned}$$

(5)

$$\begin{aligned}&h_{t+1}\!=\!o_{t+1} \odot \mathrm{tanh}({c}_{t+1}) \end{aligned}$$

(6)

where $\sigma $ denotes the logistic sigmoid function and $\odot $ denotes element-wise multiplication of two vectors; $f_l$ and $f_r$ are the left and right forget gate; i, o are the input gate and output gate; W and b are learnable parameters, respectively.

We use the hidden state of node as phrase representation. Then, two sentences are represented by ${h}_p$ = [${h}_{p_1}$, $\cdots $, ${h}_{p_i}$, $\cdots $, ${h}_{p_{2m-1}}$] and ${h}_q$ = [${h}_{q_1}$, $\cdots $, ${h}_{q_j}$, $\cdots $, ${h}_{q_{2n-1}}$]. It noted that there are m-1/n-1 non-leaf nodes composed from the tree for phrase representations and m/n leaf nodes for word representations for P/Q.

3.2 Cross Attention

Cross attention is utilized to capture the phrase-level relevance between two sentences. Give two composed representations based on syntactic trees ${h}_p$ and ${h}_q$ for P and Q, we first compute unnormalized attention weights A for any pair of nodes between P and Q with biaffine attention function [6] as follows:

$$\begin{aligned} {A}_{ij}={{h}_{p_i}}^T {W} {h}_{q_j} + \langle {U}_l,{h}_{p_i} \rangle + \langle {U}_r,{h}_{q_j} \rangle \end{aligned}$$

(7)

where ${W}\in \mathbb {R}^{h \times h}$, ${U}_l \in \mathbb {R}^{h}$, ${U}_r \in \mathbb {R}^{h}$ are learnable parameters, and $\langle \cdot ,\cdot \rangle $ denotes the inner production operation. ${p}_i$ and ${q}_j$ are the i-th and j-th node in P and Q, respectively. Next, the relevant semantic information for nodes $p_i$ and $q_j$ in another sentence is extracted as follows:

$$\begin{aligned}&\widetilde{h}_{p_i}=\sum _{j=1}^{2n-1} \frac{exp(A_{ij})}{\sum _{k=1}^{2n-1} exp(A_{ik})} h_{q_j} \end{aligned}$$

(8)

$$\begin{aligned}&\widetilde{h}_{q_j}=\sum _{i=1}^{2m-1} \frac{exp(A_{ij})}{\sum _{k=1}^{2m-1} exp(A_{kj})}h_{p_i} \end{aligned}$$

(9)

Intuitively, the interaction representation $\widetilde{h}_{p_i}$ is a weighted summation of $\{{h}_{q_j}\}_{j=1}^{2n-1}$ that is softly aligned to ${h}_{p_i}$, and the semantics of ${h}_{q_j}$ is more probably selected if it is more related to ${h}_{p_i}$.

To further enrich the interaction, we use a local comparison function ReLU [10].

$$\begin{aligned}&\overline{h}_{p_i}=[h_{p_i}; \widetilde{h}_{p_i}; |h_{p_i}-\widetilde{h}_{p_i}|; h_{p_i}\odot \widetilde{h}_{p_i}] \end{aligned}$$

(10)

$$\begin{aligned}&{h}_{p_i}^c = \mathrm{ReLU} (W_p \overline{h}_{p_i}+b_p) \end{aligned}$$

(11)

$$\begin{aligned}&\overline{h}_{q_j}=[h_{q_j}; \widetilde{h}_{q_j}; |h_{q_j}-\widetilde{h}_{q_j}|; h_{q_j}\odot \widetilde{h}_{q_j}] \end{aligned}$$

(12)

$$\begin{aligned}&{h}_{q_j}^c = \mathrm{ReLU} (W_q \overline{h}_{q_j}+b_q) \end{aligned}$$

(13)

where W, b are learnable parameters. This operation helps the model to further fuse the matching information, and also reduce the dimension of vector representations for less model complexity.

After that, nodes ${p_i}$ and $q_j$ in P and Q are newly represented by ${h}_{p_i}^c$ and ${h}_{q_j}^c$, respectively.

3.3 Self-attention

We introduce a self-attention layer after cross attention. It captures context from syntactic tree for each sentence and enhances node semantic representations.

For sentence P, we first compute self-attention weights S as Eq. (7).

$$\begin{aligned} {S}_{ij}=\langle {h}_{p_i}^c, {h}_{p_j}^c \rangle \end{aligned}$$

(14)

where, ${S}_{ij}$ indicates the relevance between the i-th node and j-th node in P. Then, we compute the self-attention vector for each node in P as follows:

$$\begin{aligned} \widetilde{h}_{p_i}^c=\sum _{j=1}^{2m-1} \frac{exp({S}_{ij})}{\sum _{k=1}^{2m-1} exp({S}_{ik})} {h}_{p_j}^c \end{aligned}$$

(15)

Intuitively, $\widetilde{{h}}_{p_i}^c$ augments each node representation with global syntactic context from P also from Q.

Similarly, we compute self-attention vector $\widetilde{h}_{q_j}^c$ for each node $q_j$ in Q. Then a comparison function is used to (${h}_{p_i}^c$, $\widetilde{h}_{p_i}^c$) and (${h}_{p_j}^c$, $\widetilde{h}_{p_j}^c$) to get enhanced representations ${h}_{p_i}^s$ and ${h}_{q_j}^s$ as Eqs. (10)–(13).

Finally, we further fuse the above cross attention and the self-attention information as follows:

$$\begin{aligned}&\widehat{h}_{p_i} = {h}_{p_i}^c+{h}_{p_i}^s \end{aligned}$$

(16)

$$\begin{aligned}&\widehat{h}_{q_j} = {h}_{q_j}^c+{h}_{q_j}^s \end{aligned}$$

(17)

The representations $\widehat{h}_{p_i}$ and $\widehat{h}_{q_j}$ are learned from cross attention between two syntactically composed trees and then are augmented by self-attention. We then pass them into prediction layer.

3.4 Prediction Layer

We perform mean and max pooling on each sentence as Chen et al. [3], and use two-layer 1024-dimensional MLP with ReLU activation as classifier.

For model training, the object is to minimize the objective function $\mathcal {J}(\varTheta )$:

$$\begin{aligned} \mathcal {J}(\varTheta ) \!=\! -\frac{1}{N}\sum _{i=1}^N \log P(y^{(i)}|p^{(i)},q^{(i)};\varTheta ) + \frac{1}{2}\lambda \Vert \varTheta \Vert _2^2 \end{aligned}$$

(18)

where $\varTheta $ denotes all the learnable parameters, N is the number of instances in the training set, $(p^{(i)},q^{(i)})$ are the sentence pairs, and $y^{(i)}$ denotes the annotated label for the i-th instance.

4 Experiments

4.1 Dataset

We evaluate our model on two datasets: the Stanford Natural Language Inference (SNLI) dataset [1] and the SciTail dataset [14]. The syntactic trees used in this paper are produced by the Stanford PCFG Parser 3.5.3 [16] and they are provided in these datasets.

The detailed statistical information of the two datasets is shown in Table 1.

Table 1. Statistics of datasets: SNLI and SciTail. Avg.L refers to average length of a pair of sentences.

Full size table

4.2 Implementation Details

Following Tay et al. [24], we learn word embedding by concatenating pre-trained word vector, learnable word vector and POS vector. Then we use a ReLU layer to the concatenated vector. We set word embeddings, the hidden states of S-LSTM and ReLU to 300 dimensions. The pre-trained word vectors are 300-dimensional Glove 840B [21] and fixed during training. The learnable word vectors and POS vectors have 30 dimensions. The batch size is set to 64 for SNLI and 32 for SciTail. We use the Adam method [15] for training, and set the initial learning rate to 5e−4 and $l_2$ regularizer strength to 6e−5. For ensemble model, we average the probability distributions from three single models as in Duan et al. [7].

Table 2. Comparison results on SNLI dataset.

Full size table

4.3 Comparison Results on SNLI

The comparison results on SNLI dataset is shown in Table 2.

The first group are sequential models that adopt attention to model word-level matching. SAN [13] is a distance-based self-attention network. BiMPM [25] design a bilateral multi-perspective matching model from both directions. ESIM [3] incorporate the chain LSTM and tree LSTM. CAFE [24] use novel factorization layers compress alignment vectors into scalar valued features. DR-BiLSTM [9] process the hypothesis conditioned on the premise results. DIIN [11] hierarchically extract semantic features using CNN. AF-DMN [7] adopt attention-fused deep matching network.

The second group are tree-structured models, of which SPINN [2] use Tree-LSTM with constituency parser tree, without attention. NTI [19] and syn Tree-LSTM [3] adopt attention for node matching. NTI use full binary tree while syn Tree-LSTM use constituency parser tree. Compared to Chen et al. [3], we use same parser tree but different tree composition function and attention mechanism.

In Table 2, our single and ensemble models achieve 88.8% and 89.5% test accuracy. The comparison results show that our model outperforms not only the existing tree-structured models, but also state-of-the-art achieved by sequential models on SNLI dataset.

Table 3. Comparison results on SciTail dataset.

Full size table

Table 4. Ablation study on SNLI dev and test sets.

Full size table

4.4 Comparison Results on SciTail

The comparison results on SciTail dataset is shown in Table 3. SciTail is known to be a more difficult dataset for NLI. The first five models in Table 2 are all implemented in Knot et al. [14]. DGEM is a graph based attention model using syntactic structures. CAFE [24] adopt LSTM and attention for word-level matching. DEISTE [26] propose deep explorations of inter-sentence interaction.

On this dataset, our single model significantly outperforms these previous models, and achieves 85.8% test accuracy.

4.5 Ablation Study

We conduct an ablation study to examine the effect of each key component of our model. As illustrated in Table 4, the first row is the model that uses the representation of root node to represent sentence, without attention. By adding cross attention and self-attention, the model performance is further improved. This proves the effect of our tree-structured composition and matching model.

4.6 Investigation on Attention

In this section, we investigate what information is captured by the attention, and visualize the cross attention results, as shown in Fig. 3. This is an instance from the test set of SciTail dataset: {P: all living cells have a plasma membrane that encloses their contents. Q: all types of cells are enclosed by a membrane. The label y: entailment.}. From the results, it shows that our syntax-based model can semantically aligns word-level expressions (node 13 “encloses” and node 9 “enclosed”) and phrase-level expressions (node 5 “all living cells” and node 7 “all types of cells”) in the P and Q, respectively. We also observe that attention degree for the phrase expressions is more obvious than the single word that composes the phrase, such as node 17 in the P and node 16 in the Q. An intuitive explantation is that the syntax-based model can capture more rich semantics by using tree-structured composition. Finally, the syntax-based model attends over higher level tree nodes with rich semantics when considering longer phrase or full sentence, such as, the larger sub-trees 20, 22 and 23 in the P is aligned to the root node 19 that represents the whole semantics of the Q. It also indicates the composition function can effectively compute phrase representations.

Table 5. Some complex examples and the classification. The E indicates entailment and the N indicates neutral between P and Q.

Full size table

4.7 Case Study

We show some examples from SciTail test dataset, as shown in Table 5. We compare the proposed syntax-based model with sequential model. For sequential model, we use the representative CAFE model [24]. We compute the Bleu score of the P with referenced to the Q and use 1-gram length. The Bleu score assumes the more overlapped words between two sentences, the closer the semantics are.

Examples A-B are entailment cases, but each of which has low Bleu score. Thus, it is more difficult to recognize the entailment relation between them. Our syntax-based approach correctly favors entailment in these cases. It indicates that the low lexical similarity challenges the sequential model to extract the related semantics, but it maybe solved by introducing syntactic structures.

The second set of examples C-D are neutral cases. Each of them has high Bleu score, where sequential model trends to misidentify the semantic relation to entailment, but our syntax-based model have the ability to correctly recognize the neutral relation. It indicates syntactic structure is more superior to solve semantic understanding involving structurally complex expressions.

Finally, examples E-F are cases that sequential and syntactic models get wrong. Examples E are entailment relation, but it have low Bleu score. Meanwhile, the word orders and structures (“compose” and “is composed of”) of two sentences are also quite different. It causes models to failure recognizing the entailment relation between them. Example F is neutral relation where two sentences have high lexical overlap and also the similar word orders, which confuses models to misclassify a entailment class. For the difficult cases, sentence semantics suffer more the issues, such as polysemy, ambiguity, as well as fuzziness, in which the model may need more inference information to distinguish these relations and make the correct decision, such as incorporating external knowledge to help model better understanding the lexical and phrasal semantics.

5 Conclusions and Future Work

In this paper, we explore the potential of syntactic tree for semantic computation and present a syntax-aware attention model for NLI. It leverages tree-structured composition and phrase-level matching. Experimental results on SNLI and SciTail datasets show that our model significantly improves the performance, and that the syntactic structure is important for modeling complex semantic relationship. In the future, we will explore the combination of syntax and pre-trained language model technology, to further improve the performance.

References

Bowman, S.R., Angeli, G., Potts, C., Manning, C.D.: A large annotated corpus for learning natural language inference. arXiv preprint arXiv:1508.05326 (2015)
Bowman, S.R., Gauthier, J., Rastogi, A., Gupta, R., Manning, C.D., Potts, C.: A fast unified model for parsing and sentence understanding. arXiv preprint arXiv:1603.06021 (2016)
Chen, Q., Zhu, X., Ling, Z.H., Wei, S., Jiang, H., Inkpen, D.: Enhanced LSTM for natural language inference. In: Proceedings of ACL (2017)
Google Scholar
Chomsky, N.: Syntactic Structures. Mouton, The Hague (1965). Aspects of the Theory of Syntax. MIT Press, Cambridge (1981). Lectures on Government and Binding. Foris, Dordrecht (1982). Some concepts and consequences of the theory of government and binding. LI Monographs 6, 1–52 (1957)
Google Scholar
Dowty, D.: Compositionality as an empirical problem. In: Direct Compositionality, vol. 14, pp. 23–101 (2007)
Google Scholar
Dozat, T., Manning, C.D.: Deep biaffine attention for neural dependency parsing (2016)
Google Scholar
Duan, C., Cui, L., Chen, X., Wei, F., Zhu, C., Zhao, T.: Attention-fused deep matching network for natural language inference. In: IJCAI, pp. 4033–4040 (2018)
Google Scholar
Eriguchi, A., Hashimoto, K., Tsuruoka, Y.: Tree-to-sequence attentional neural machine translation. arXiv preprint arXiv:1603.06075 (2016)
Ghaeini, R., et al.: DR-BiLSTM: dependent reading bidirectional LSTM for natural language inference (2018)
Google Scholar
Glorot, X., Bordes, A., Bengio, Y.: Deep sparse rectifier neural networks. In: Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, pp. 315–323 (2011)
Google Scholar
Gong, Y., Luo, H., Zhang, J.: Natural language inference over interaction space. arXiv preprint arXiv:1709.04348 (2017)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Article Google Scholar
Im, J., Cho, S.: Distance-based self-attention network for natural language inference (2017)
Google Scholar
Khot, T., Sabharwal, A., Clark, P.: SciTail: a textual entailment dataset from science question answering. In: Proceedings of AAAI (2018)
Google Scholar
Kingma, D., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Klein, D., Manning, C.D.: Accurate unlexicalized parsing. In: Proceedings of the 41st Annual Meeting on Association for Computational Linguistics, vol. 1, pp. 423–430. Association for Computational Linguistics (2003)
Google Scholar
Li, J., Luong, T., Jurafsky, D., Hovy, E.: When are tree structures necessary for deep learning of representations? In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2304–2314 (2015)
Google Scholar
MacCartney, B., Manning, C.D.: Modeling semantic containment and exclusion in natural language inference. In: Proceedings of the 22nd International Conference on Computational Linguistics, vol. 1, pp. 521–528. Association for Computational Linguistics (2008)
Google Scholar
Munkhdalai, T., Yu, H.: Neural tree indexers for text understanding. In: Proceedings of the Conference. Association for Computational Linguistics. Meeting, vol. 1, p. 11. NIH Public Access (2017)
Google Scholar
Parikh, A.P., Täckström, O., Das, D., Uszkoreit, J.: A decomposable attention model for natural language inference. arXiv preprint arXiv:1606.01933 (2016)
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Google Scholar
Socher, R., Huval, B., Manning, C.D., Ng, A.Y.: Semantic compositionality through recursive matrix-vector spaces. In: Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pp. 1201–1211. Association for Computational Linguistics (2012)
Google Scholar
Tai, K.S., Socher, R., Manning, C.D.: Improved semantic representations from tree-structured long short-term memory networks. arXiv preprint arXiv:1503.00075 (2015)
Tay, Y., Tuan, L.A., Hui, S.C.: A compare-propagate architecture with alignment factorization for natural language inference. arXiv preprint arXiv:1801.00102 (2017)
Wang, Z., Hamza, W., Florian, R.: Bilateral multi-perspective matching for natural language sentences. Corr abs/1702.03814 (2017)
Google Scholar
Yin, W., Dan, R., Schütze, H.: End-task oriented textual entailment via deep exploring inter-sentence interactions (2018)
Google Scholar
Yogatama, D., Blunsom, P., Dyer, C., Grefenstette, E., Ling, W.: Learning to compose words into sentences with reinforcement learning. arXiv preprint arXiv:1611.09100 (2016)
Zhu, X., Sobhani, P., Guo, H.: Long short-term memory over recursive structures. In: International Conference on International Conference on Machine Learning (2015)
Google Scholar

Download references

Acknowledgments

The authors are supported by the National Nature Science Foundation of China (Nos. 61876198, 61370130 and 61473294), the Fundamental Research Funds for the Central Universities (Nos. 2018YJS025 and 2015JBM033), and the International Science and Technology Cooperation Program of China (No. K11F100010).

Author information

Authors and Affiliations

School of Computer and Information Technology, Beijing Jiaotong University, Beijing, China
Mingtong Liu, Yasong Wang, Yujie Zhang, Jinan Xu & Yufeng Chen

Authors

Mingtong Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yasong Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yujie Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Jinan Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yufeng Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yujie Zhang .

Editor information

Editors and Affiliations

Tsinghua University, Beijing, China
Maosong Sun
Fudan University, Shanghai, China
Xuanjing Huang
University of Illinois at Urbana Champaign, Illinois, USA
Heng Ji
Tsinghua University, Beijing, China
Zhiyuan Liu
Tsinghua University, Beijing, China
Yang Liu

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Liu, M., Wang, Y., Zhang, Y., Xu, J., Chen, Y. (2019). Syntax-Aware Attention for Natural Language Inference with Phrase-Level Matching. In: Sun, M., Huang, X., Ji, H., Liu, Z., Liu, Y. (eds) Chinese Computational Linguistics. CCL 2019. Lecture Notes in Computer Science(), vol 11856. Springer, Cham. https://doi.org/10.1007/978-3-030-32381-3_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-32381-3_13
Published: 13 October 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-32380-6
Online ISBN: 978-3-030-32381-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Abstract

1 Introduction

2 Related Works

3 Approach

3.1 Tree-Structured Composition

3.2 Cross Attention

3.3 Self-attention

3.4 Prediction Layer

4 Experiments

4.1 Dataset

4.2 Implementation Details

4.3 Comparison Results on SNLI

4.4 Comparison Results on SciTail

4.5 Ablation Study

4.6 Investigation on Attention

4.7 Case Study

5 Conclusions and Future Work

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation