Abstract
Paraphrase corpora annotated with the types of paraphrases they contain constitute an essential resource for the understanding of the phenomenon of paraphrasing and the improvement of paraphrase-related systems in natural language processing. In this article, a new annotation scheme for paraphrase-type annotation is set out, together with newly created measures for the computation of inter-annotator agreement. Three corpora different in nature and in two languages have been annotated using this infrastructure. The annotation results and the inter-annotator agreement scores for these corpora are proof of the adequacy and robustness of our proposal.



Similar content being viewed by others
Notes
See Madnani and Dorr (2010), Section 5 for a discussion on this topic.
http://research.microsoft.com/en-us/downloads/607d14d9-20cd-47e3-85bc-a2f65cd28042/. The readme of the corpus contains a discussion on when a pair of sentences should be considered a paraphrase and when it should not, according to their approach.
See Vila et al. (2013) for a more general state of the art on paraphrase corpora. See Vila et al. (2014) for a state of the art on paraphrase typologies: “paraphrase typology” does not equal “paraphrase-type annotation scheme”, but typologies are the linguistic knowledge in which annotation schemes may be based. In this section, and in this article in general, we focus on the latter.
http://www.cs.york.ac.uk/semeval-2012/task6/. Although Semeval organisers distinguish between semantic textual similarity and paraphrasing, being the former a sort of graded paraphrasing, this distinction is not relevant here.
Annotation guidelines are available at http://clic.ub.edu/corpus/en/paraphrases-en.
All the examples in this article are extracted from the three annotated corpora, namely P4P, MSRP-A, and WRPA-A. Typos in the original corpora have not been corrected.
It should be taken into account that corpora we annotate consist of positive cases of paraphrasing; therefore, non-paraphrases or non-paraphrase fragments are a minority.
We refer to the tags with small capital letters and sometimes using short names, e.g., synthetic/analytic for synthetic/analytic substitutions.
We use the subindex \(w\) (words) instead of \(t\) (tokens) in order to avoid confusion with the superindex \(t\) (type) that will appear in what follows.
http://clt.mq.edu.au/research/projects/hoo/hoo2011/index.html. See also Dale and Kilgarriff (2011) and Dale and Narroway (2012).
The \(\pi \) and \(\kappa \) factors can be omitted from the calculus (i.e., they can be set to 1) if they are not relevant, as in Barrón-Cedeño et al. (2013).
Annotated corpora are available at http://clic.ub.edu/corpus/en/paraphrases-en as a downloadable package and as a search interface.
The translation is ours.
Strong punctuation marks are full stops, semi-colons, question marks, exclamations, and other punctuation marks that can divide autonomous text fragments (in general, sentences, or clauses), such as parentheses, hyphens, or colons.
For reasons of space, we do not include the per-type scores of inter-annotator agreement. Instead, we point out the most relevant issues in this respect.
Dolan and Brockett (2005)’s agreement value and ours are not directly comparable, as they represent different measures in diverging tasks with different degrees of complexity. Nevertheless, we consider that obtaining a value in the line of that of Dolan and Brockett (2005)’s simpler task shows that ours can be considered a satisfactory result.
References
Agirre, E., Cer, D., Diab, M., & Gonzalez-Agirre, A. (2012). Semeval-2012 task 6: A pilot on semantic textual similarity. In Proceedings of the 1st joint conference on lexical and computational semantics (*SEM 2012) (pp. 385–393). Montréal.
Amigó, E., Giménez, J., Gonzalo, J., & Màrquez, L. (2006). MT evaluation: Human-like vs. human acceptable. In Proceedings of the 21st international conference on computational linguistics and the 44th annual meeting of the association for computational linguistics (COLING/ACL 2006) (pp. 17–24). Sydney.
Baeza-Yates, R., & Ribeiro-Neto, B. (1999). Modern information retrieval. Boston: Addison-Wesley Longman Publishing Co.
Barrón-Cedeño, A., Vila, M., Martí, M. A., & Rosso, P. (2013). Plagiarism meets paraphrasing: Insights for the next generation in automatic plagiarism detection. Computational Linguistics, 39(4), 917–947.
Barzilay, R., & McKeown, K. (2001). Extracting paraphrases from a parallel corpus. In Proceedings of the 39th annual meeting of the association for computational linguistics (ACL 2001) (pp. 50–57). Toulouse.
Bès, G. G., & Fuchs, C. (1988). Introduction. In Lexique et paraphrase (pp. 7–11). Presses Universitaires de Lille.
Bhagat, R. (2009). Learning paraphrases from Text, Ph.D. thesis. University of Southern California, Los Angeles.
Chen, D. L., & Dolan, W. B. (2011). Collecting highly parallel data for paraphrase evaluation. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies (ACL/HLT 2011) (Vol 1, pp. 190–200). Portland.
Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
Cohn, T., Callison-Burch, C., & Lapata, M. (2008). Constructing corpora for the development and evaluation of paraphrase systems. Computational Linguistics, 34(4), 597–614.
Dale, R., & Kilgarriff, A. (2011). Helping our own: The HOO 2011 pilot shared task. In Proceedings of the 13th European workshop on natural language generation (ENLG 2011) (pp. 242–249). Nancy.
Dale, R., & Narroway, G. (2011). The HOO pilot data set: Notes on release 2.0. Resource document. http://clt.mq.edu.au/research/projects/hoo/hoo2011/files/HOOReleaseNotes20110621.pdf. Accessed 8 February 2013
Dale, R., & Narroway, G. (2012). A framework for evaluating text correction. In Proceedings of the 8th international conference on language resources and evaluation (LREC 2012) (pp. 3015–3018). Istanbul.
Dolan, W. B., & Brockett, C. (2005). Automatically constructing a corpus of sentential paraphrases. In Proceedings of the 3rd international workshop on paraphrasing (IWP 2005) (pp. 9–16). Jeju Island.
Dutrey, C., Bernhard, D., Bouamor, H., & Max, A. (2011). Local modifications and paraphrases in Wikipedia’s revision history. Procesamiento del Lenguaje Natural, 46, 51–58.
España-Bonet, C., Vila, M., Rodríguez, H., & Martí, M. A. (2009). CoCo, a web interface for corpora compilation. Procesamiento del Lenguaje Natural, 43, 367–368.
Fleiss, J. L. (1981). Statistical methods for rates and proportions. New York: Wiley.
Fuchs, C. (1988). Paraphrases prédicatives et contraintes énonciatives. In: Bès G., & Fuchs C. (Eds.), Lexique et Paraphrase, no. 6 in Lexique, Presses Universitaires de Lille, Villeneuve d’Ascq (pp. 157–171).
Hovy, E., Lin, C. Y., Zhou, L., & Fukumoto, J. (2006). Automated summarization evaluation with basic elements. In Proceedings of the 5th international conference on language resources and evaluation (LREC 2006) (pp. 899–902). Genoa.
Kupper, L. L., & Hafner, K. B. (1989). On assessing interrater agreement for multiple attribute responses. Biometrics, 45(3), 957–967.
Lin, C. Y., & Hovy, E. (2003). Automatic evaluation of summaries using n-gram co-occurrence statistics. In Proceedings of the 4th annual meeting of the north american chapter of the association for computational linguistics: Human language technologies (NAACL/HLT 2003), Edmonton (Vol. 1, pp. 71–78).
Lin, C. Y., & Och, F. J. (2004). ORANGE: A method for evaluating automatic evaluation metrics for machine translation. In Proceedings of the 20th international conference on computational linguistics (COLING 2004), Geneva.
Liu, C., Dahlmeier, D., & Ng, H. T. (2010) PEM: A paraphrase evaluation metric exploiting parallel texts. In Proceedings of the 2010 conference on empirical methods in natural language processing (EMNLP 2010), Cambridge (pp. 923–932).
Madnani, N., & Dorr, B. J. (2010). Generating phrasal and sentential paraphrases: A survey of data-driven methods. Computational Linguistics, 36(3), 341–387.
Max, A., & Wisniewski, G. (2010). Mining naturally-occurring corrections and paraphrases from Wikipedia’s revision history. In Proceedings of the 7th international conference on language resources and evaluation (LREC 2010), Valletta (pp. 3143–3148).
Milićević, J. (2007). La paraphrase. Modélisation de la paraphrase langagière. Bern: Peter Lang.
Nenkova, A., & Passonneau, R. (2004). Evaluating content selection in summarization: the pyramid method. In Proceedings of the 5th annual meeting of the North American chapter of the association for computational linguistics: human language technologies (NAACL/HLT 2004), Boston (pp 145–152).
Potthast, M., Stein, B., Barrón-Cedeño, A., & Rosso, P. (2010). An evaluation framework for plagiarism detection. In Proceedings of the 23rd international conference on computational linguistics (COLING 2010), Beijing (pp. 997–1005).
Recasens, M., & Vila, M. (2010). On paraphrase and coreference. Computational Linguistics, 36(4), 639–647.
Romano, L., Kouylekov, M., Szpektor, I., Dagan, I., & Lavelli, A. (2006). Investigating a generic paraphrase-based approach for relations extraction. In Proceedings of the 11th conference of the European chapter of the association for computational linguistics (EACL 2006), Trento (pp. 409–416).
Vila, M., & Dras, M. (2012). Tree edit distance as a baseline approach for paraphrase representation. Procesamiento del Lenguaje Natural, 48, 89–95.
Vila, M., Rodríguez, H., & Martí, M. A. (2013). Relational paraphrase acquisition from Wikipedia. The WRPA method and corpus: Natural language engineering. doi:10.1017/S1351324913000235.
Vila, M., Martí, M. A., & Rodríguez, H. (2014). Is this a paraphrase? What kind? Paraphrase boundaries and typology. Open Journal of Modern Linguistics, 4, 205–218.
Zaenen, A. (2006). Mark-up barking up the wrong tree. Computational Linguistics, 32(4), 577–580.
Acknowledgments
We are grateful to the people that participated in the annotation of the corpora: Rita Zaragoza, Montse Nofre, Patricia Fernández, and Oriol Borrega. We would also like to thank Alberto Barrón-Cedeño for his help in shaping inter-annotator agreement measure formulae. This work is supported by the Spanish government through the projects DIANA (TIN2012-38603-C02-02) and SKATER (TIN2012-38584-C06-01) from Ministerio de Ciencia e Innovación, as well as a FPU Grant (AP2008-02185) from Ministerio de Educación, Cultura y Deporte.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Vila, M., Bertran, M., Martí, M.A. et al. Corpus annotation with paraphrase types: new annotation scheme and inter-annotator agreement measures. Lang Resources & Evaluation 49, 77–105 (2015). https://doi.org/10.1007/s10579-014-9272-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-014-9272-5