Exploiting discourse information to identify paraphrases

https://doi.org/10.1016/j.eswa.2013.10.018Get rights and content

Highlights

  • We show the relation between discourse units and paraphrasing.

  • We propose a new method for computing text similarity based on elementary discourse units.

  • We apply the method to the task of paraphrase identification.

  • We achieved 93.4% accuracy in experiments conducted on the PAN corpus..

Abstract

Previous work on paraphrase identification using sentence similarities has not exploited discourse structures, which have been shown as important information for paraphrase computation. In this paper, we propose a new method named EDU-based similarity, to compute the similarity between two sentences based on elementary discourse units. Unlike conventional methods, which directly compute similarities based on sentences, our method divides sentences into discourse units and employs them to compute similarities. We also show the relation between paraphrases and discourse units, which plays an important role in paraphrasing. We apply our method to the paraphrase identification task. Experimental results on the PAN corpus, a large corpus for detecting paraphrases, show the effectiveness of using discourse information for identifying paraphrases. We achieve 93.1% and 93.4% accuracy, respectively by using a single SVM classifier and by using a maximal voting model.

Introduction

Paraphrase identification is the task of determining whether two sentences have essentially the same meaning. This task has been shown to play an important role in many natural language applications, including text summarization (Barzilay, McKeown, & Elhadad, 1999), question answering (Duboue & Chu-Carroll, 2006), machine translation (Callison-Burch, Koehn, & Osborne, 2006), natural language generation (Ganitkevitch, Callison-Burch, Napoles, & Van Durme, 2011), and plagiarism detection (Uzuner, Katz, & Nahnsen, 2005). For example, detecting paraphrase sentences would help a text summarization system to avoid adding redundant information.

Paraphrase identification is not an easy task. Considering the two following sentence pairs, the first sentence pair is a paraphrase although the two sentences only share a few words, while the second one is not a paraphrase even though the two sentences contain almost all the same words.

  • That would indeed be a great blessing.” and “The Lord had indeed fulfilled his hopes, and answered his prayers.

  • Peter usually goes to the cinema with his girlfriend.” and “Peter never goes to the cinema with his girlfriend.

Although the paraphrase identification task is defined in the term of semantics, it is usually modeled as a binary classification problem, which can be solved by training a statistical classifier. Many methods have been proposed for identifying paraphrases. These methods usually employ the similarity between two sentences as features, which are computed based on words (Fernando and Stevenson, 2008, Kozareva and Montoyo, 2006, Mihalcea et al., 2006), n-grams (Das and Smith, 2009, Kozareva and Montoyo, 2006), syntactic parse trees (Das and Smith, 2009, Rus et al., 2008, Socher et al., 2011), WordNet (Kozareva and Montoyo, 2006, Mihalcea et al., 2006), and MT metrics, the automated metrics for evaluation of translation quality (Madnani, Tetreault, & Chodorow, 2012).

Recently, several studies have shown that discourse structures deliver important information for paraphrase computation. For example, to extract paraphrases, Deléger and Zweigenbaum (2009) match similar paragraphs in comparable texts. Regneri and Wang (2012) extend the distributional hypothesis that entities are similar if they share similar contexts at the discourse level. According to them, sentences that play the same role in a certain discourse and have a similar discourse context can be paraphrases, even if a semantic similarity model does not consider them very similar. Using this assumption, Regneri and Wang (2012) introduce a method for collecting paraphrases based on the sequential event order in discourse. However, both Deléger and Zweigenbaum, 2009, Regneri and Wang, 2012 only consider some special kinds of data, where the discourse structures can be easily extracted.

Complete discourse structures such as discourse trees in the RST Discourse Treebank (RST-DT) (Carlson, Marcu, & Okurowski, 2002) are difficult to extract though they can be very useful for paraphrase computation (Regneri & Wang, 2012). In order to produce such complete discourse structures for a text, we first segment the text into several elementary discourse units (EDUs). Each EDU may be a simple sentence or a clause in a complex sentence. Consecutive EDUs are then put in relation with each other to create a discourse tree (Mann & Thompson, 1988). An example of a discourse tree with three EDUs is shown in Fig. 1. Existing full automatic discourse parsing systems are neither robust nor very precise (Bach et al., 2012b, Joty et al., 2013, Regneri and Wang, 2012). In recent years, however, several discourse segmenters with high performance have been introduced (Bach et al., 2012a, Hernault et al., 2010, Joty et al., 2012). The discourse segmenter described in Bach et al. (2012a) gives 91.0% in the F1 score on the RST-DT corpus when using Stanford parse trees (Klein & Manning, 2003).

In this paper, we present a new method to compute the similarity between two sentences based on elementary discourse units (EDU-based similarity). We first segment two sentences into several EDUs using a discourse segmenter, which is trained on the RST-DT corpus. These EDUs are then employed for computing the similarity between two sentences. The key idea is that for each EDU in one sentence, we try to find the most similar EDU in the other sentence and compute the similarity between them. We show how our method can be applied to the paraphrase identification task. Experimental results on the PAN corpus (Madnani et al., 2012) show that our method is effective for this task. Our work is the first work that employs discourse units for computing similarity as well as for identifying paraphrases.

The rest of this paper is organized as follows. We first present related work in Section 2. Section 3 describes the relation between paraphrases and discourse units. Section 4 presents our method, EDU-based similarity. In Section 5, we introduce our discourse segmentation system, which will be used to segment sentences into elementary discourse units. Experiments on the paraphrase identification task are described in Section 6. Section 7 presents some types of errors that our method made during experiments. Finally, Section 8 concludes the paper and discusses future work.

Section snippets

Related work

There have been many studies on the paraphrase identification task. Finch, Hwang, and Sumita (2005) use some MT metrics, including BLEU (Papineni, Roukos, Ward, & Zhu, 2002), NIST (Doddington, 2002), WER (Niessen, Och, Leusch, & Ney, 2000), and PER (Leusch, Ueffing, & Ney, 2003) as features for a SVM classifier. Wan, Dras, Dale, and Paris (2006) combine BLEU features with some others extracted from dependency relations and tree edit-distance, and also take SVMs as the learning method to train a

Paraphrases and discourse units

In this section, we describe the relation between paraphrases and discourse units. We will show that discourse units are blocks which play an important role in paraphrasing.

Fig. 2 shows an example of a paraphrase sentence pair. In this example, the first sentence can be divided into three elementary discourse units (EDUs), 1A, 1B, and 1C, and the second sentence can also be segmented into three EDUs, 2A, 2B, and 2C. Comparing these six EDUs, we can see that they make three aligned pairs of

EDU-based similarity

Motivated from the analysis of the relation between paraphrases and discourse units, we propose a method to compute the similarity between two sentences. In this section, we assume that each sentence can be represented as a sequence of elementary discourse units (EDUs). The method of segmenting sentences into EDUs will be presented in Section 5.

First, we present the notion of ordered similarity functions. Given two arbitrary texts t1 and t2, an ordered similarity function Simordered(t1, t2) will

A model for discourse segmentation

This section presents our model for segmenting sentences into elementary discourse units. Our model exploits subtree features to rerank N-best outputs of a base segmenter, which uses syntactic and lexical features in a CRF framework. We first introduce briefly the discriminative reranking method in Section 5.1. We then introduce our base model in Section 5.2 and our method of extracting subtree features for the reranking model in Section 5.3.

Experiments

This section describes our experiments on the paraphrase identification task using EDU-based similarities as features for statistical classifiers. Similar to the work of Madnani et al. (2012), we employed MT metrics as the ordered similarity functions. However, we computed the MT metrics based on EDUs in addition to MT metrics based on sentences. In all experiments, parse trees were obtained by using the Stanford parser (Klein & Manning, 2003).

Error analysis

This section identifies the cause of the errors that our method made on the test data of the PAN corpus, which includes 3,000 sentence pairs. Firstly, we wanted to know the statistic information of the experimental results on the test data. We considered the following questions:

  • 1.

    With each sentence pair in the test set, how many models among seven base models produced a correct output?

  • 2.

    How many sentence pairs were predicted correctly by at least one base model? And therefore, how many sentence

Conclusion and future work

We have presented a study on exploiting discourse information to identify paraphrases. By introducing a new method for computing text similarity based on discourse units, we showed that discourse structure provides important information for paraphrase identification. Unlike previous work, our method was not limited to any kind of text. The main contributions of our work can be summarized in the following points:

  • 1.

    We presented the first work on relations between discourse units and paraphrasing,

References (53)

  • M. Rushdi Saleh et al.

    Experiments with svm to classify opinions in different domains

    Expert Systems with Applications

    (2011)
  • Agirre, E., Cer, D., Diab, M., & Gonzalez-Agirre, A. (2012). Semeval-2012 task 6: A pilot on semantic textual...
  • Attardi, G., & Ciaramita, M. (2007). Tree revision learning for dependency parsing. In Proceedings of the annual...
  • Bach, N. X., Minh, N. L., & Shimazu, A. (2012a). A reranking model for discourse segmentation using subtree features....
  • Bach, N. X., Minh, N. L., & Shimazu, A. (2012b). UDRST: A novel system for unlabeled discourse parsing in the RST...
  • Barzilay, R., McKeown, K. R., & Elhadad, M. (1999). Information fusion in the context of multi-document summarization....
  • Bentivogli, L., Dagan, I., Dang, H.T., Giampiccolo, D., & Magnini, B. (2009). The fifth Pascal recognizing textual...
  • Boella, G., & Di Caro, L. (2013). Extracting definitions and hypernym relations relying on syntactic dependencies and...
  • Callison-Burch, C., Koehn, P., & Osborne, M. (2006). Improved statistical machine translation using paraphrases. In...
  • Carlson, L., Marcu, D., & Okurowski, M. E. (2002). RST discourse...
  • Chan, Y. S., & Ng, H. T. (2008). MAXSIM: A maximum similarity metric for machine translation evaluation. In Proceedings...
  • C.C. Chang et al.

    LIBSVM: A library for support vector machines

    ACM Transactions on Intelligent Systems and Technology

    (2011)
  • M. Collins et al.

    Discriminative reranking for natural language parsing

    Computational Linguistics

    (2005)
  • Das, D., & Smith, N. A. (2009). Paraphrase identification as probabilistic quasi-synchronous recognition. In...
  • Deléger, L., & Zweigenbaum, P. (2009). Extracting lay paraphrases of specialized expressions from monolingual...
  • Denkowski, M., & Lavie, M. (2010). Extending the METEOR machine translation metric to the phrase level. In Proceedings...
  • Doddington, G. (2002). Automatic evaluation of machine translation quality using N-gram co-occurrence statistics. In...
  • Duboue, P. A., & Chu-Carroll, J. (2006). Answering the question you wish they had asked: The impact of paraphrasing for...
  • Fernando, S., & Stevenson, M. (2008). A semantic similarity approach to paraphrase detection. In Proceedings of the...
  • Finch, A., Hwang, Y. S., & Sumita, E. (2005). Using machine translation evaluation techniques to determine...
  • Ganitkevitch, J., Callison-Burch, C., Napoles, C., & Van Durme, B. (2011). Learning sentential paraphrases from...
  • Habash, N., & Kholy, A. E. (2008). SEPIA: Surface span extension to syntactic dependency precision-based mt evaluation....
  • Hernault, H., Bollegala, D., & Ishizuka, M. (2010). A sequential model for discourse segmentation. In Proceedings of...
  • Joty, S., Carenini, G., & Ng, R. T. (2012). A novel discriminative framework for sentence-level discourse analysis. In...
  • Joty, S., Carenini, G., Ng, R., & Mehdad, Y. (2013). Combining intra- and multi-sentential rhetorical parsing for...
  • Klein, D., & Manning, C. (2003). Accurate unlexicalized parsing. In Proceedings of the 41st annual meeting of the...
  • Cited by (17)

    • Elementary discourse units with sparse attention for multi-label emotion classification

      2022, Knowledge-Based Systems
      Citation Excerpt :

      However, existing methods for multi-label emotion classification fail to effectively capture clause or EDUs information. EDUs are clause-like units that serve as building blocks for discourse parsing in Rhetorical Structure Theory [16], which has been demonstrated to be useful for many NLP tasks [17–19]. Therefore, modeling the associations between labels and EDUs can be beneficial to multi-label emotion classification.

    • Learning short-text semantic similarity with word embeddings and external knowledge sources

      2019, Knowledge-Based Systems
      Citation Excerpt :

      The authors reported the classifier achieves 83% and 93% accuracy on MRPC and P4PIN, respectively. Having recognized that discourse structure may significantly constitute the meaning of a sentence, the proposed method in [12] divides a sentence into elementary discourse units (EDUs) and measures STSS based on the similarity between their elementary discourse units. The authors showed that the similarity between EDUs plays an important role in paraphrasing.

    • Boosting paraphrase detection through textual similarity metrics with abductive networks

      2015, Applied Soft Computing Journal
      Citation Excerpt :

      This corpus is a collection of sentence pairs that has been created and used in [35] from the human-created plagiarism instances of the PAN 2010 competition corpus [49]. It has also been adopted recently in other research work, e.g., [42,43]. The corpus has a training dataset of 10,000 sentence pairs and a testing dataset of 3000 sentence pairs.

    • A Paraphrase Identification Approach in Paragraph length texts

      2022, Proceedings of the 15th International Conference on Educational Data Mining, EDM 2022
    • A Paraphrase Identification Approach in Paragraph Length Texts

      2022, IEEE International Conference on Data Mining Workshops, ICDMW
    • Role of discourse information in Urdu sentiment classification: A Rule-based Method and Machine-learning Technique

      2019, ACM Transactions on Asian and Low-Resource Language Information Processing
    View all citing articles on Scopus

    This paper is an improved and extended version of the paper: EDU-Based Similarity for Paraphrase Identification, presented at International Conference on Applications of Natural Language to Information Systems (NLDB), 2013, UK.

    View full text