research-article

Open access

Bidirectional Sentence Ordering with Interactive Decoding

Authors:

Guirong Bai,

Shizhu He,

Kang Liu,

Jun ZhaoAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 22, Issue 2

Article No.: 45, Pages 1 - 15

https://doi.org/10.1145/3561510

Published: 30 March 2023 Publication History

All formats PDF

Abstract

Sentence ordering aims at restoring orders of shuffled sentences in a paragraph. Previous methods usually predict orders in a single direction, i.e., from head to tail. However, unidirectional prediction inevitably causes error accumulation, which restricts performance. In this article, we propose a bidirectional ordering method, which predicts orders in both head-to-tail and tail-to-head directions at the same time. In our bidirectional ordering method, two directions can interact with each other and help alleviate the error accumulation problem of ordering. Experiments demonstrate that our method can effectively improve performance of previous models.

1 Introduction

Sentence ordering [3] is an important task in natural language processing (NLP), which is to automatically organize shuffled sentences in a paragraph into the correct order. Besides better reading and understanding text, sentence ordering is useful for many other tasks in NLP, such as concept-to-text generation [16, 17, 18], retrieval-based question answering [36, 41], answer summarization [9], and extractive multi-document summarization [2, 10, 24, 26, 28, 38, 43].

Recently, with development of deep learning, neural models are proposed for sentence ordering and achieve significant improvements [7, 8, 13, 23, 27, 29, 40, 45]. Typically, these models employ a pointer network [37] as a decoder and predict orders in a single direction, i.e., from head to tail. However, unidirectional prediction inevitably causes error accumulation, which makes it difficult to correctly predict orders at farther timesteps. As shown in Figure 1, there are results predicted in a single direction by the typical model [7]. We can see it performs well at earlier timesteps and performs poorly at farther timesteps. Specifically, the accuracy gradually decreases when the timestep increases. Moreover, when the number of shuffled sentences is larger, the accuracy is worse. These statistics demonstrate that the error accumulation problem does exist and restricts performance. Besides, we can see the accuracy is significantly higher at earlier timesteps. This is because both the beginning and ending of a paragraph have obvious features to identify. For example, in the paper abstract dataset [23], the first sentence usually contains “in this paper ...” to describe contribution, and the last sentence usually contains “experiments demonstrate ...” to talk about empirical results. Therefore, simultaneously using head-to-tail and tail-to-head prediction may take advantage of this feature and alleviate the error accumulation problem.

Fig. 1.

In this article, we propose a bidirectional ordering method for sentence ordering. Comparing with previous unidirectional methods, our method can predict orders in both head-to-tail and tail-to-head directions at the same time. Specifically, we bridge the connection between two different directions by their decoding history. At every timestep, the decoding history of both head-to-tail and tail-to-head directions is stored. When predicting, two different directions interact with each other through decoding history. With the interaction, we can simultaneously utilize information from different directions and alleviate the error accumulation problem. Our method has compatibility and is easy to apply in other models of sentence ordering. We conduct experiments on four datasets. The results demonstrate that our method can effectively alleviate the error accumulation problem and improve performance of previous models.

In brief, our main contributions are shown as follows:

–

We find that (1) the error accumulation problem in sentence ordering, (2) the forward model does well in predicting the head (3) and the backward model does well in predicting the tail. Thus we propose to make use of backward prediction to enhance forward prediction (vice versa).

–

We propose a bidirectional ordering method for sentence ordering, which is able to fuse the forward prediction and the backward prediction to alleviate the error accumulation problem and reduce the difficulty of finding the correct order.

–

We verify the effectiveness of the bidirectional ordering method for sentence ordering, and the experimental results demonstrate that our proposed method is useful to improve previous models.

2 Preliminary

2.1 Task Definition

The sentence ordering task aims at ordering a set of shuffled sentences in a paragraph as a coherent text. An example is shown in Figure 2. There are three shuffled sentences \(s_1\), \(s_2,\) and \(s_3\). The ordering models need to obtain the correct order \(s_2\)\(\rightarrow\)\(s_3\)\(\rightarrow\)\(s_1\).

Fig. 2.

Formally, L shuffled sentences are denoted by \({\bf x}\) = [\(s_{o_1}\), \(s_{o_2}\), \(\ldots\), \(s_{o_L}\)], where \({\bf o}\) = [\(o_1\), \(o_2\), \(\ldots\), \(o_L\)] is the shuffled order list. The goal is to find the correct order \({\bf o}^*\) = [\(o_1^*\), \(o_2^*\), \(\ldots\), \(o_L^*\)] for them from the order space (\(L!\)). With the correct order \({\bf o}^*\), the whole sentences have the highest coherence probability:

\begin{equation} P({\bf o}^*|{\bf x})\gt P({\bf o}|{\bf x}),\forall {\bf o} \in \mathbf {\psi } , \end{equation}

(1)

where \({\bf o}\) indicates any order of input sentences, and \(\mathbf {\psi }\) indicates all possible orders of these sentences.

2.2 Framework

For sentence ordering, the input is shuffled sentences in a paragraph and the output is their correct order. Recently, the encoder-decoder framework in Figure 3 obtains state-of-the-art results. In this article, we adopt this framework.

Fig. 3.

Sentence Encoder: First, a sentence encoder is used to obtain a sentence-level representation of each single sentence:

\begin{equation} {\bf s}_{o_i}={\rm {SentEnc}}(w_1,w_2,\ldots ,w_T) , \end{equation}

(2)

where \({\bf s}_{o_i}\) denotes the representation of the sentence \(s_{o_i}\) with T words. Word embedding of every word can be obtained with a shared word embedding matrix \({\bf W}_e\) \(\in\) \(\mathbb {R}^{n_e \times d_e}\), where \(n_e\) denotes the vocabulary size and \(d_e\) denotes the embedding size. Then word embedding is sent to the sentence encoder \({\rm {SentEnc}}()\) to obtain the sentence-level representation. Typically, previous models adopt bidirectional LSTM [14] as the sentence encoder \({\rm {SentEnc}}()\) and regard the output of the last hidden state as the sentence representation.

Paragraph Encoder: Then, a paragraph encoder is used to obtain a paragraph-level representation of these sentences:

\begin{equation} {\bf v}_{para}={\rm {ParaEnc}}({\bf s}_{o_1},{\bf s}_{o_2},\ldots ,{\bf s}_{o_L}) , \end{equation}

(3)

where the paragraph encoder \({\rm {ParaEnc}}()\) encodes the whole sentences and obtains the paragraph-level representation \({\bf v}_{para}\), which can capture global dependencies among sentences and help predict orders for the decoder. In recent studies, \({\rm {ParaEnc}}()\) is made up of multiple self-attention [32, 34, 35] layers. \({\bf v}_{para}\) can be obtained by average pooling for all sentences at the last self-attention layer.

Decoder: Finally, a pointer network [37] is employed as the decoder to predict the order of input sentences:

\begin{align} &\overrightarrow{{\bf h}}_i=\overrightarrow{{\rm {Func}}}(\overrightarrow{{\bf h}}_{i-1},{\bf s}_{\overrightarrow{o}_{i-1}}), \end{align}

(4)

\begin{align} &\overrightarrow{{\bf u}}_j^i={\bf g}^\top \tanh ({\bf W}_1{\bf s}_{o_j}+{\bf W}_2\overrightarrow{{\bf h}}_i), \end{align}

(5)

\begin{align} P(\overrightarrow{o}_i&|\overrightarrow{o}_{i-1},\ldots ,\overrightarrow{o}_1,{\bf x})={\rm {softmax}}(\overrightarrow{{\bf u}}^i), \end{align}

(6)

where \(\overrightarrow{}\) denotes the head-to-tail direction. \(\overrightarrow{{\rm {Func}}}()\) is a function for recurrent prediction. Both self-attention and LSTM can be employed as \(\overrightarrow{{\rm {Func}}}()\). \(\overrightarrow{o}_{i-1}\) is the predicted order at last timestep and its sentence is sent to predict the next order. The initial state \(\overrightarrow{{\bf h}}_0\) \(\in\) \(\mathbb {R}^{d}\) is the paragraph-level representation \({\bf v}_{para}\) and the first input \({\bf s}_{\overrightarrow{o}_0}\) is zero vector. \({\bf g}\) \(\in\) \(\mathbb {R}^d\), \({\bf W}_1\) \(\in\) \(\mathbb {R}^{d \times d}\) and \({\bf W}_2\) \(\in\) \(\mathbb {R}^{d \times d}\) are learnable parameters. \(\overrightarrow{{\bf h}}_i\) is used to output the conditional probability of the order \(P(\overrightarrow{o}_i|\overrightarrow{o}_{i-1},\ldots ,\overrightarrow{o}_1,{\bf s})\) with these parameters.

3 The Proposed Method

3.1 Bidirectional Ordering with Interactive Decoding

Previous unidirectional models inevitably cause error accumulation. Therefore, we propose a bidirectional ordering method, which predicts orders in both head-to-tail and tail-to-head directions at the same time. When decoding orders, different directions can interact with each other by decoding history and reduce the difficulty of ordering. The architecture of the method is illustrated in Figure 4. Similarly, studies in [42] also propose to use a bidirectional decoder for machine translation. However, our method has some obvious differences. In sentence order, the output to predict is fixed sentences that can be predicted once and only once. Thus, our model is not generative. Moreover, our model handles the symmetrical characteristic in sentence order.

Fig. 4.

In bidirectional ordering, we also adopt the encoder-decoder framework in Section 2.2. The difference is that there are two synchronous bidirectional decoders interacting with each other. Specifically, Equation (5) in Section 2.2 is changed as follows:

\begin{equation} \overrightarrow{{\bf u}}_j^i={\bf g}^\top \tanh ({\bf W}_1 {\bf s}_{o_j}+{\bf W}_2((1-\overrightarrow{\lambda }_i)\overrightarrow{{\bf h}}_i+\overrightarrow{\lambda }_i\overleftarrow{{\bf m}}_i)), \end{equation}

(7)

\begin{equation} \overrightarrow{\lambda }_i= {\rm {sigmoid}}({\bf W}_{\lambda }^\top [\overrightarrow{{\bf h}}_i;\overleftarrow{{\bf m}}_i]+b_{\lambda }), \end{equation}

(8)

where i denotes the timestep. \(\overrightarrow{{\bf h}}_i\) denotes the hidden layer of head-to-tail decoding. \(\overleftarrow{{\bf m}}_i\) denotes the memory of decoding history from reverse tail-to-head direction. Through \(\overleftarrow{{\bf m}}_i\), the connection between two directions is bridged. \(\overrightarrow{\lambda }_i\) is the weight to decide which direction is more reliable at the ith timestep. We expect that reverse direction has higher weight at farther timesteps and alleviate the error accumulation problem. \({\bf W}_{\lambda }\) \(\in\) \(\mathbb {R}^{2d}\) and \(b_{\lambda }\) are learnable parameters. [;] denotes concatenation.

Similarly, the tail-to-head order is synchronously predicted by the same framework and can also be boosted by the memory of decoding history from its reverse head-to-tail direction:

\begin{align} \overleftarrow{{\bf h}}_i =\ &\overleftarrow{{\rm {Func}}}(\overleftarrow{{\bf h}}_{i-1},{\bf s}_{\overleftarrow{o}_{i-1}}), \end{align}

(9)

\begin{align} \overleftarrow{{\bf u}}_j^i={\bf g}^\top \tanh ({\bf W}_1&{\bf s}_{o_j}+{\bf W}_2((1-\overleftarrow{\lambda }_i)\overleftarrow{{\bf h}}_i+\overleftarrow{\lambda }_i\overrightarrow{{\bf m}}_i)), \end{align}

(10)

\begin{align} \overleftarrow{\lambda }_i =\ &{\rm {sigmoid}}({\bf W}_{\lambda }[\overleftarrow{{\bf h}}_i;\overrightarrow{{\bf m}}_i]+b_{\lambda }), \end{align}

(11)

\begin{align} P(\overleftarrow{o}_i|\overleftarrow{o}_{i-1},&\ldots ,\overleftarrow{o}_1,{\bf x})={\rm {softmax}}(\overleftarrow{{\bf u}}^i), \end{align}

(12)

where head-to-tail and tail-to-head directions share the same timestep i. \(\overrightarrow{{\bf m}}_i\) denotes the memory of decoding history from head-to-tail direction.

\(\overrightarrow{{\bf m}}_i\) and \(\overleftarrow{{\bf m}}_i\) are the memory of decoding history from different directions, which play an important role in bidirectional ordering. They contain the information of different directions and can send the information from a direction to the other direction. With \(\overrightarrow{{\bf m}}_i\) and \(\overleftarrow{{\bf m}}_i\), head-to-tail and tail-to-head directions can interact with each other at the same time and reduce the difficulty of ordering by decoding history of the other direction. Formally, \(\overrightarrow{{\bf m}}_i\) can be obtained according to decoding history as follows:

\begin{align} \overrightarrow{{\bf m}}_i&=\sum _{j=1}^{i-1}\overrightarrow{\alpha }_{ij}\overrightarrow{{\bf h}}_j, \end{align}

(13)

\begin{align} \overrightarrow{\alpha }_{ij}&=\frac{exp(\overrightarrow{e}_{ij})}{\sum _{j^{\prime }=1}^{i}exp(\overrightarrow{e}_{ij^{\prime }})}, \end{align}

(14)

\begin{equation} \overrightarrow{e}_{ij}={\bf W}_m^\top \tanh ({\bf W}_4(\overrightarrow{{\bf s}}_{o_j}+{\bf t}_j)+{\bf W}_5(\overleftarrow{{\bf h}}_i+{\bf t}_{L-i+1})), \end{equation}

(15)

where \(\overrightarrow{{\bf m}}_i\) consists of hidden layers in head-to-tail direction with attention mechanism [1], which contain information of decoding history. At the first timestep, \(\overrightarrow{{\bf m}}_i\) is zero vector. \(\overrightarrow{\alpha }_{ij}\) denotes the weight of attention mechanism. Intuitively, decoding history at different timesteps has different importance. \({\bf W}_4\) \(\in\) \(\mathbb {R}^{d \times d}\) and \({\bf W}_5\) \(\in\) \(\mathbb {R}^{d \times d}\) are learnable parameters. \({\bf t}\) \(\in\) \(\mathbb {R}^{d}\) denotes position embedding and the subscript of \({\bf t}\) denotes the position index. L denotes the number of timesteps (total number of shuffled sentences).

Similarly, the tail-to-head prediction is formulated as

\begin{align} \overleftarrow{{\bf m}}_i&=\sum _{j=1}^{i-1}\overleftarrow{\alpha }_{ij}\overleftarrow{{\bf h}}_j, \end{align}

(16)

\begin{align} \overleftarrow{\alpha }_{ij}&=\frac{exp(\overleftarrow{e}_{ij})}{\sum _{j^{\prime }=1}^{i}exp(\overleftarrow{e}_{ij^{\prime }})}, \end{align}

(17)

\begin{align} \overleftarrow{e}_{ij}={\bf W}_m^\top \tanh (&{\bf W}_4(\overleftarrow{{\bf s}}_{o_j}+{\bf t}_{L-j+1})+{\bf W}_5(\overrightarrow{{\bf h}}_i+{\bf t}_i)), \end{align}

(18)

where two different directions share the same position embedding \({\bf t}\). To obtain the final order, we use predictions of both head-to-tail and tail-to-head directions. It will be described in Section 3.3.

Specially, there is a symmetrical relation between head-to-tail and tail-to-head directions. For example, there are 10 sentences, the head-to-tail prediction at the 4th timestep corresponds to the tail-to-head prediction at the 7(10 \(-\) 4 + 1)-th timestep. Therefore, in reverse direction, we use the symmetrical timestep (\(L-i+1\)) for \({\bf t}\). With the position embedding, we expect to capture information around symmetrical positions of the reverse direction, and reduce the difficulty of sentence ordering, especially at farther timesteps.

3.2 Training

We train the model with both the head-to-tail direction and the tail-to-head direction together by minimizing the following loss function:

\begin{equation} \begin{aligned}L=&\frac{1}{N}\sum _{j=1}^N \frac{1}{L_j}\sum _{i=1}^{L_j} \log P(\overrightarrow{o}_i^*|\overrightarrow{o}_{i-1}^*,\ldots ,\overrightarrow{o}_{1}^*|{\bf x})\\ &+\frac{1}{N}\sum _{j=1}^N \frac{1}{L_j}\sum _{i=1}^{L_j} \log P(\overleftarrow{o}_i^*|\overleftarrow{o}_{i-1}^*,\ldots ,\overleftarrow{o}_{1}^*|{\bf x}) \end{aligned} \end{equation}

(19)

where N denotes the number of shuffled paragraphs (batch size) and \(L_j\) denotes the number of sentences in the jth paragraph. \(P(\overrightarrow{o}_i^*|\overrightarrow{o}_{i-1}^*,\ldots ,\overrightarrow{o}_{1}^*,{\bf x})\) denotes the probability of the correct head-to-tail order at the ith timestep. Similarly, \(P(\overleftarrow{o}_i^*|\overleftarrow{o}_{i-1}^*,\ldots ,\overleftarrow{o}_{1}^*|{\bf x})\) denotes the correct tail-to-head order at the ith timestep.

3.3 Inference

Following previous methods, the coherence probability of the final output sentence order \({\bf o}\) is formalized as

\begin{align} &P(\overrightarrow{{\bf o}}|{\bf x})=\prod _{i=1}^{n}P(\overrightarrow{o}_i|\overrightarrow{o}_{i-1},\ldots ,\overrightarrow{o}_1|{\bf x}), \end{align}

(20)

\begin{align} &P(\overleftarrow{{\bf o}}|{\bf x})=\prod _{i=1}^{n}P(\overleftarrow{o}_i|\overleftarrow{o}_{i-1},\ldots ,\overleftarrow{o}_1|{\bf x}), \end{align}

(21)

\begin{equation} P({\bf o}|{\bf x})=\max (P(\overrightarrow{{\bf o}}|{\bf x}),P(\overleftarrow{{\bf o}}|{\bf x})), \end{equation}

(22)

where \(P({\bf o}|{\bf x})\) denotes the distribution of the final output order. \({\bf x}\) denotes the input shuffled paragraph. \(\overrightarrow{}\) denotes head-to-tail direction and \(\overleftarrow{}\) denotes tail-to-head direction.

For bidirectional inference, we design a special beam search in Algorithm 1, where each candidate prediction in a direction is incorporated into every path of beam search in the other direction. Specially, each candidate prediction information in backward prediction is incorporated into every path of beam search in forward prediction at one timestep (vice versa). Comparing with standard beam search, function \({\rm {BeamSearch()}}\) has an extra input from the other direction, which means we select top K paths from \(K^2\) candidates at each timestep. Finally, we select the order with the highest probability between head-to-tail and tail-to-head predictions.

4 Experiments

4.1 Data

Following previous studies, we adopt four datasets to conduct the experiments. The statistics are shown in Table 1. Specifically, the four datasets are as follows:

Table 1.

	Train	Valid	Test	Len	Vocab
NIPS	2,248	409	402	6.0	16,721
AAN	8,569	962	2,626	4.9	34,485
arXiv	884,912	110,614	110,615	5.4	64,557
SIND	40,155	4,990	5,055	5.0	30,861

Table 1. Statistics of the Data Size, Average Number of Sentences in a Paragraph and Vocabulary Size

NIPS, AAN: They are made up of abstracts from NIPS papers, ACL papers dataset, respectively [23].

arXiv: It consists of abstracts from papers on arXiv website [6].

SIND: It contains photos and corresponding captions [15].

4.2 Evaluation Metrics

There are three evaluation metrics for sentence ordering.

Kendall’s tau (\(\mathbf {\tau }\)): It is one of the most popular metrics for the automatic evaluation of text coherence. It is formalized as \(\tau = 1 - 2 \times n_{inversion} / \binom{n}{2}\), where \(n_{inversion}\) denotes the number of pairs of incorrect relative order in the predicted sequence, n denotes the length of the sequence. It ranges from \(-\)1 (the worst) to 1 (the best).

Accuracy (Acc): It measures how many absolute positions of sentences are correctly predicted. It ranges from 0 (the worst) to 1 (the best).

Perfect Match Ratio (PMR): It calculates the ratio of exactly matching orders, which the most stringent measurement in this task. It could only be 0 (not exactly match) or 1 (exactly match) for a paragraph.

4.3 Comparisons

The methods to compare with are as follows:

(1)

Sentence-level Neural Models: Pairwise Model [6]; Seq2Seq [22]; SIM [31]. These methods employ neural networks to model sentence representation and predict orders.

(2)

Paragraph-level Neural Models: CNN+PtrNet, LSTM+PtrNet [13]; RNN Decoder, V-LSTM+PtrNet [23]; ATTNet [7]; TGCM [27]; SLM [12]; BERT4SO [44]; BERSON [8]. Besides sentence representation, these methods model paragraph representation. They obtain state-of-the-art results.

(3)

Bidirectional Ordering with Interactive Decoding: It’s our proposed method. We use BOID to indicate it. Paragraph-level models (such as ATTNet, TGCM, and BERSON) are recent sentence ordering models and all of them use a pointer-network type decoder framework. Thus, our method has compatibility and is easy to apply in other models of sentence ordering. Specifically, we can get pointer vectors, candidate vectors, decoding history vectors from these pointer-network based methods. Then, we add BOID into these models with the related vectors.

Models based on manual feature engineering perform significantly worse and thus we don’t show their results. Besides, our method is used to enhance previous unidirectional models and we use the same settings of them in BOID. Moreover, all baseline models are evaluated with using beam search.

4.4 Results

The experimental results are shown in Table 2. First, we can see that the results of paragraph-level neural models perform better than sentence-level neural models (e.g, Pairwise Model and Seq2Seq). With employing a pointer network, these models can model the coherence of a paragraph. Therefore, paragraph-level neural models can capture global dependencies among sentences and have better ability to order sentences. Besides, we can see three recent models (ATTNet, TGCM, BERSON) perform better and BERSON performs best with pre-trained parameters.

Moreover, we combine the three best models (ATTNet, TGCM, BERSON) with our proposed method BOID. We can see the three best models have further improvements after combining BOID. It demonstrates that the proposed BOID can enhance previous unidirectional models and obtain better performance. Previous models predict orders in a single direction and ignore the information of the other reverse direction. It inevitably causes error accumulation and makes it difficult to predict orders at farther timesteps. BOID predicts orders in both head-to-tail and tail-to-head directions at the same time, and prediction in two directions can interact with each other. Therefore, BOID can alleviate the error accumulation problem and improve performance.

Table 2.

5 Discussion

In discussion, we employ the typical model ATTNet [7] on NIPS abstract dataset [23] to conduct experiments.

5.1 Effectiveness of Direct Hard Bidirectional Decoding

We employ hard-bidirection decoding to show whether bidirectional ordering in BOID can be directly replaced by a simple bidirectional ordering algorithm. This simple method directly puts the prediction of head-to-tail decoder and tail-to-head decoder into the final order, where predicted ones of a decoder will become hard masks for the other and avoid predicting it again. In the method, head-to-tail decoder and tail-to-head decoder don’t share the same timestep, and the total timesteps of both the head-to-tail decoder and the tail-to-head decoder will be cut into a half. For example, from the timestep 1 to 5, the prediction order is: [1 (head-to-tail), 5 (tail-to-head), 2 (head-to-tail), 4 (tail-to-head), 3 (head-to-tail)].

Table 3 shows the results. It is shown that direct hard bidirectional ordering performs worse than ours. Through the results, we can know soft limit with vectors from BOID is better than hard limit with masks. This is because vectors can be smoothly and selectively merged into the model, but masks may bring error accumulation. So the soft limit of BOID is more reliable.

Table 3.

	\(\tau\)	PMR	Acc
base	56.09	29.87	0.72
hard bidirection	49.45	25.32	0.64
BOID	57.32	30.98	0.73

Table 3. Results of Direct Hard Bidirectional Ordering

5.2 Performance at the Head and the Tail

We want to know whether BOID can help improve the performance at the head and the tail. Besides, the performance at the head and the tail can reflect whether BOID with bidirectional prediction can reduce the difficulty of ordering at farther timesteps and alleviate the error accumulation problem.

Table 4 shows the results. We can see BOID improves the performance at both the head and the tail. Previous unidirectional models predict the tail at the last timestep and the error accumulation problem is serious at the end. BOID can employ the other direction and help alleviate the error accumulation problem. Therefore, the performance at both the head and the tail has improvements.

Table 4.

	unidirectional	bidirectional
head	0.796	0.824
tail	0.687	0.725

Table 4. Results at the Head and the Tail with/without BOID

5.3 Reliability of Reverse Direction

Here, we show whether the reverse tail-to-head prediction is reliable. Table 5 shows the results of head-to-tail direction and its reverse tail-to-head direction. We can observe that the model also has comparable results when trains and predicts in reverse tail-to-head direction. It can prove that the information of tail-to-head direction is reliable and can be used to model the coherence of sentences. Meanwhile, we find that head-to-tail direction performs better than tail-to-head direction, which shows that the head-to-tail direction is a little easier for modeling coherence. It is intuitive that the first sentence is easier than the last one to identify in most cases.

Table 5.

	\(\tau\)	PMR	Acc
head-to-tail	56.09	29.87	0.72
tail-to-head	55.73	29.43	0.71
bidirectional	57.32	30.98	0.73

Table 5. Results of Different Directions

5.4 Performance on Different Paragraph Lengths

We explore the performance on different paragraph lengths (number of sentences). The results are shown in Figure 5. We can see the accuracy gradually decreases when the length of paragraphs is longer. BOID decreases more slowly and performs better. This is because BOID can synchronously utilize bidirectional information and alleviate the error accumulation problem, especially at farther timesteps.

Fig. 5.

Although our method alleviates error propagation to a certain extent, we can observe that the performance still drops significantly with longer lengths. Longer distance means more predictions to output and brings more possible errors, which is natural and inevitable. This problem limits our proposed model and is difficult to handle.

5.5 Visualization for the Weight of Different Directions

In Section 3.1, we design the weight \(\overrightarrow{\lambda {_i}}\) and \(\overleftarrow{\lambda {_i}}\) to decide which direction is more reliable at the ith timestep. When predicting, we expect that reverse direction has higher weight at farther timesteps and alleviate the error accumulation problem. To show how the weight changes when ordering, here we sample a real example of the head-to-tail decoder (\(\overrightarrow{\lambda {_i}}\)). The visualization is shown in Figure 6.

Fig. 6.

We can see that the weight of the reverse direction becomes higher when the timestep increases. It is intuitive that reverse direction is more reliable at farther timesteps. The results demonstrate that the weight can adaptively change to alleviate the error accumulation problem. Besides, the weight significantly fluctuates around the middle timestep. This is because we introduce position embedding to capture the symmetrical positions of bidirectional ordering. After timestep crossing the middle timestep, a head-to-tail position can correspond to a symmetrical tail-to-head position and thus position embedding can effectively help adjust the weight.

6 Related Work

The methods for sentence ordering can be divided into three categories. The first is based on manual feature engineering, such as Probabilistic Model [19], Content Model [4], Entity Grid [3], and Utility-Trained Model [33]. However, these methods rely heavily on expensive handcrafted features. The second is based on sentence-level neural models, which utilize neural networks to model sentence representation. For example, Window Network [21] considers the coherence in a window of text, Pairwise Ranking Model [6] orders sentences in a pairwise way, and Seq2Seq [22] models employ an end-to-end framework for sentence ordering. However, they can’t capture global dependencies among sentences in a paragraph. The third is based on paragraph-level neural models, which employs LSTM [14] or self-attention [35] as the encoder to model paragraph representation and then a pointer network [37] is used as the decoder to predict orders, such as [7, 8, 13, 23, 27, 40]. However, these methods always order sentences in a single direction, which may aggravate the error accumulation problem, especially at farther timesteps. To this end, we design two interactive decoders to alleviate the problem.

Multiple decoders are proved effective in some other tasks such as machine translation [39, 42] and dialogue generation [25]. However, our method has some obvious differences from them. First, tasks in above methods are generative and their models can’t be directly used for sentence ordering. Then, there is an error accumulation problem in sentence ordering, and thus the reverse direction in sentence ordering is expected to have higher weight at farther timesteps. To adaptively adjust the weight, our method models the reliability of different decoders. In addition, candidates in sentence ordering must be predicted once and only once. Therefore, sentence ordering has a symmetrical characteristic. To make use of this characteristic, our method introduces special position embedding.

The easy-first decoding method is also related. For NLP, easy-first decoding was first applied to transition-based parsing in [11]. Similar to curriculum learning [5], easy-first decoding proposes to output sequences with an easy-to-hard direction as a human. Recently, non-autoregressive sequence generation models such as [20] appear, which is not limit to a directional decoding order. However, some studies [30] demonstrate that the difficulty of non-autoregressive generation correlates on the target token dependency, and knowledge distillation as well as alignment constraint reduces the dependency of target tokens and encourages the model to rely more on source context for target token prediction.

7 Conclusion

Through statistics of sentence ordering, we find that there is an error accumulation problem in previous unidirectional models. To alleviate the problem, we propose a bidirectional ordering method. It predicts orders in both head-to-tail and tail-to-head directions at the same time. Besides, two different directions are interactive and can enhance each other. Our method has compatibility and is easy to apply in other models of sentence ordering. We conduct experiments on four datasets. Experimental results demonstrate that our method can alleviate the error accumulation problem and improve performance of previous unidirectional models.

References

[1]

Dzmitry Bahdanau, Kyung Hyun Cho, and Yoshua Bengio. 2015. Neural machine translation by jointly learning to align and translate. In Proceedings of the 3rd International Conference on Learning Representations. arXiv:1409.0473. Retrieved from https://arxiv.org/abs/1409.0473.

Abstract

1 Introduction

2 Preliminary

2.1 Task Definition

2.2 Framework

3 The Proposed Method

3.1 Bidirectional Ordering with Interactive Decoding

3.2 Training

3.3 Inference

4 Experiments

4.1 Data

4.2 Evaluation Metrics

4.3 Comparisons

4.4 Results

5 Discussion

5.1 Effectiveness of Direct Hard Bidirectional Decoding

5.2 Performance at the Head and the Tail

5.3 Reliability of Reverse Direction

5.4 Performance on Different Paragraph Lengths

5.5 Visualization for the Weight of Different Directions

6 Related Work

7 Conclusion

References

Cited By

Index Terms

Recommendations

Cognitive memory-inspired sentence ordering model

Visual Story Ordering with a Bidirectional Writer

Analysis of Sentence Ordering Based on Support Vector Machine

Comments

Information

Published In

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

View options

PDF

eReader

HTML Format

Login options

Full Access

Share

Share this Publication link

Share on social media

Affiliations