Abstract
Modern paraphrase research would benefit from large corpora with detailed annotations. However, currently these corpora are still thin on the ground. In this paper, we describe the development of such a corpus for Dutch, which takes the form of a parallel monolingual treebank consisting of over 2 million tokens and covering various text genres, including both parallel and comparable text. This publicly available corpus is richly annotated with alignments between syntactic nodes, which are also classified using five different semantic similarity relations. A quarter of the corpus is manually annotated, and this informs the development of an automatic tree aligner used to annotate the remainder of the corpus. We argue that this corpus is the first of this size and kind, and offers great potential for paraphrasing research.




Similar content being viewed by others
Explore related subjects
Discover the latest articles and news from researchers in related subjects, suggested using machine learning.Notes
Note that the sentences in Example (1) were constructed for illustrative purposes, even though, for example, caffeine intake has been shown to reduce the risk of Parkinson disease (Ross et al. 2000).
Note that this still allows one-to-many alignments if v t and \( v_{t}^{\prime} \) are equally similar to v s . In practice, we found such cases to be rare in our data collection, so we opted for excluding one-to-many node alignments. Formally our tree alignment is therefore a tree matching, which is a restricted form of tree alignment in which each node is aligned to at most one other node. Note also that there is no requirement for the alignment to be exhaustive, hence it may also be called a partial tree alignment.
For instance, the two translations of “On the Origin of Species” are based on different editions and show significant differences, largely due to Darwin’s own revisions. These range from long sentences in one translation being split into multiple sentences in the other to substantial pieces of added or removed text (the 6th edition even has a whole new chapter).
Hitaext is implemented in wxPython, runs on Mac OS X, Linux and Windows, and is freely available as open source software from http://daeso.uvt.nl/hitaext.
Algraeph is a rewritten and extended version of our earlier tool called Gadget. It is implemented in wxPython, runs on Mac OS X, Linux and Windows, and is available as open source software from http://daeso.uvt.nl/algraeph.
This may in part be due, however, to the larger number of non-uniquely aligned trees in the comparable text segments, because each count of aligned nodes only concerns one particular pair of source and target trees.
Since precision(A n , A m ) equals recall(A m , A n ), the mean over all pairwise precision scores equals the mean over all pairwise recall scores.
References
Agirre, E., Diab, M., Cer, D., & Gonzalez-Agirre, A. (2012). Semeval-2012 task 6: A pilot on semantic textual similarity. In Proceedings of the first joint conference on lexical and computational semantics (Vol. 1). Proceedings of the main conference and the shared task (Vol. 2). Proceedings of the sixth international workshop on semantic evaluation (pp. 385–393). Association for Computational Linguistics.
Androutsopoulos, I., & Malakasiotis, P. (2010). A survey of paraphrasing and textual entailment methods. Journal of Artificial Intelligence Research, 38, 135–187.
Bannard, C., & Callison-Burch, C. (2005). Paraphrasing with bilingual parallel corpora. In Proceedings of the 43rd annual meeting of the Association for Computational Linguistics (ACL) (pp. 597–604), Ann Arbor.
Barzilay, R., & Lee, L. (2003). Learning to paraphrase: An unsupervised approach using multiple-sequence alignment. In Proceedings of the conference of the North American chapter of the Association for Computational Linguistics on Human Language Technology (NAACL-HLT) (pp. 16–23), Morristown, NJ, USA.
Barzilay, R., & McKeown, K. (2001). Extracting paraphrases from a parallel corpus. In Proceedings of the 39th meeting of the Association for Computational Linguistics (ACL) (pp. 50–57), Toulouse, France.
Barzilay, R., & McKeown, K. (2005). Sentence fusion for multidocument news summarization. Computational Linguistics, 31(3), 297–328.
Bos, J., & Markert, K. (2005). Recognising textual entailment with logical inference. In Proceedings of the conference on human language technology and empirical methods in natural language processing (HLT-EMNLP) (pp. 628–635).
Bouma, G., van Noord, G., & Malouf, R. (2001). Alpino: Wide-coverage computational analysis of Dutch. In W. Daelemans, K. Sima’an, J. Veenstra & J. Zavre (Eds.), Computational linguistics in the Netherlands 2000: Selected papers (pp. 45–59). Amsterdam, New York: Rodopi.
Bouma, G., Mur, J., van Noord, G., van der Plas, L., & Tiedemann, J. (2005). Question answering for Dutch using dependency relations. In Proceedings of the CLEF 2005 workshop.
Burnard, L., & Sperberg-McQueen, C. M. (2006). TEI lite: Encoding for interchange: An introduction to the TEI Revised for TEI P5 release. Technical report, Text Encoding Initiative.
Burrows, S., Potthast, M., & Stein, B. (2012). Paraphrase acquisition via crowdsourcing and machine learning. ACM TIST.
Callison-Burch, C., Koehn, P., & Osborne, M. (2006). Improved statistical machine translation using paraphrases. In Proceedings of the human language technology conference of the North American Chapter of the Association of Computational Linguistics (HLT-NAACL) (pp. 17–24), New York City, USA.
Cohn, T., & Lapata, M. (2009). Sentence compression as tree transduction. Journal of Artificial Intelligence Research, 34(1), 637–674.
Cohn, T., Callison-Burch, C., & Lapata, M. (2008). Constructing corpora for the development and evaluation of paraphrase systems. Computational Linguistics, 34(4), 597–614.
Cui, H., Sun, R., Li, K., Kan, M., & Chua, T. (2005). Question answering passage retrieval using dependency relations. In Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (pp. 400–407).
Daelemans, W., & van den Bosch, A. (2005). Memory-based language processing. Cambridge: Cambridge University Press.
Daelemans, W., Höthker, A., & Tjong Kim Sang, E. (2004). Automatic sentence simplification for subtitling in Dutch and English. In Proceedings of the 4th international conference on language resources and evaluation (LREC) (pp. 1045–1048).
Dagan, I., Glickman, O., & Magnini, B. (2005). The PASCAL recognising textual entailment challenge. In Proceedings of the PASCAL challenges workshop on recognising textual entailment, Southampton, UK.
Dagan, I., Glickman, O., & Magnini, B. (2006). The PASCAL recognising textual entailment challenge. In J. Quiñonero Candela, I. Dagan, B. Magnini & F. d’Alché Buc (Eds.), Machine learning challenges (pp. 177–190). Berlin, Heidelberg: Springer.
Darwin, C. R. (2001). Het ontstaan van de soorten: door natuurlijke selectie ofwel het bewaard blijven van de rassen die in voordeel zijn in de strijd om het bestaan: de definitieve editie (6th ed.). Amsterdam: Atlas.
Darwin, C. R. (2002). Over het ontstaan van soorten: Door middel van natuurlijke selectie, of het behoud van bevoordeelde rassen in de strijd om het leven. Amsterdam: Nieuwezijds.
Daume, H., & Marcu, D. (2005). Induction of word and phrase alignments for automatic document summarization. Computational Linguistics, 31(4), 505–530.
de Montaigne, M. (2001). Essays. Amsterdam: Boom.
de Montaigne, M. (2004). De essays. Amsterdam: Atheneum, Polak and Van Gennip.
de Saint-Exupèry, A. (1960). De kleine prins. Rotterdam: Donker.
de Saint-Exupèry, A. (2000). De kleine prins. Rotterdam: Donker.
Dolan, B., Quirk, C., & Brockett, C. (2004). Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In Proceedings of the 20th international conference on computational linguistics (COLING) (pp. 350–356), Morristown, NJ, USA.
Filippova, K., & Strube, M. (2008). Sentence fusion via dependency graph compression. In Proceedings of the conference on empirical methods in natural language processing (EMNLP) (pp. 177–185), Morristown, NJ, USA.
Gale, W. A., & Church, K. W. (1993). A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1), 75–102.
Knight, K., & Marcu, D. (2002). Summarization beyond sentence extraction: A probabilistic approach to sentence compression. Artificial Intelligence, 139(1), 91–107.
Krahmer, E., Marsi, E., & van Pelt, P. (2008). Query-based sentence fusion is better defined and leads to more preferred results than generic sentence fusion. In Proceedings of the 46th annual meeting of the Association for Computational Linguistics: Human language technologies (ACL) (pp. 193–196), Columbus, OH, USA.
Lin, D., & Pantel, P. (2001). Discovery of inference rules for question answering. Natural Language Engineering, 7(4), 343–360.
MacCartney, B., & Manning, C. (2008). Modeling semantic containment and exclusion in natural language inference. In Proceedings of the 22nd international conference on computational linguistics (Vol. 1, pp. 521–528).
MacCartney, B., & Manning, C. (2009). An extended model of natural logic. In The eighth international conference on computational semantics (IWCS), Tilburg, The Netherlands.
Madnani, N., & Dorr, B. J. (2010). Generating phrasal and sentential paraphrases: A survey of data-driven methods. Computational Linguistics, 36(3), 341–387.
Marsi, E., & Krahmer, E. (2005a). Classification of semantic relations by humans and machines. In Proceedings of the ACL 2005 workshop on empirical modeling of semantic equivalence and entailment (pp. 1–6), Ann Arbor, MI.
Marsi, E., & Krahmer, E. (2005b). Explorations in sentence fusion. In Proceedings of the 10th European workshop on natural language generation (ENLG), Aberdeen, UK.
Marsi, E., & Krahmer, E. (2008). Detecting semantic overlap: A parallel monolingual treebank for Dutch. In S. Verberne, H. van Halteren & P. A. Coppen (Eds.), Computational linguistics in the Netherlands (CLIN): Selected papers (pp. 69–84), Rodopi, Amsterdam.
Marsi, E., & Krahmer, E. (2010). Automatic analysis of semantic similarity in comparable text through syntactic tree matching. In Proceedings of the 23rd international conference on computational linguistics (COLING) (pp. 752–760), Beijing, China.
Marsi, E., Krahmer, E., Hendrickx, I., & Daelemans, W. (2010). On the limits of sentence compression by deletion. In E. Krahmer & M. Theune (Eds.), Empirical methods in natural language generation (pp. 45–66). Berlin, Heidelberg: Springer.
Navarro, G. (2001). A guided tour to approximate string matching. ACM Computing Surveys, 33, 31–88.
Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.
Pang, B., Knight, K., & Marcu, D. (2003). Syntax-based alignment of multiple translations: Extracting paraphrases and generating new sentences. In Proceedings of the conference of the North American Chapter of the Association for Computational Linguistics on human language technology (NAACL-HLT) (pp. 181–188).
Pasca, M., & Dienes, P. (2005). Aligning needles in a haystack: Paraphrase aquisition across the web. In Proceedings of the 2nd international joint conference on natural language processing (IJCNLP) (pp. 119–130), Jeju Island, South Korea.
Potthast, M., Stein, B., Barrón-Cedeño, A., & Rosso, P. (2010). An evaluation framework for plagiarism detection. In Proceedings of the 23rd international conference on computational linguistics: Posters (pp. 997–1005). Association for Computational Linguistics.
Punyakanok, V., Roth, D., & Yih, W. (2004). Mapping dependencies trees: An application to question answering. In Proceedings of the eighth international symposium on artificial intelligence and mathematics, Fort Lauderdale, FL.
Quirk, C., Brockett, C. C., & Dolan, W. (2004). Monolingual machine translation for paraphrase generation. In Proceedings of the conference on empirical methods in natural language processing (EMNLP) (pp. 142–149), Barcelona, Spain.
Radev, D., & McKeown, K. (1998). Generating natural language summaries from multiple on-line sources. Computational Linguistics, 24(3), 469–500.
Reynaert, M. (2007). Sentence-splitting and tokenization in d-coi. Technical report 07-07. ILK Research Group.
Ross, G. W., Abbott, R. D., Petrovitch, H., Morens, D. M., Grandinetti, A., Tung, K. H., et al. (2000). Association of coffee and caffeine intake with the risk of parkinson disease. The Journal of the American Medical Association (JAMA), 283, 2674–2679.
Samuelsson, Y., & Volk, M. (2006). Phrase alignment in parallel treebanks. In Proceedings of 5th workshop on treebanks and linguistic theories, Prague, Czech Republik.
Shen, S., Radev, D. R., Patel, A., & Erkan, G. (2006). Adding syntax to dynamic programming for aligning comparable texts for the generation of paraphrases. In Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the Association for Computational Linguistics (COLING-ACL) (pp. 747–754), Sydney, Australia.
Shinyama, Y., Sekine, S., Sudo, K., & Grishman, R. (2002). Automatic paraphrase acquisition from news articles. In Proceedings of the human language technology conference (HLT 2002) (pp. 313–318), San Diego, USA.
Snover, M., Madnani, N., Dorr, B., & Schwartz, R. (2009). Ter-plus: Paraphrase, semantic, and alignment enhancements to translation edit rate. Machine Translation, 23(2–3), 117–127. doi:10.1007/s10590-009-9062-9.
Tiedemann, J., & Kotzé, G. (2009). Building a large machine-aligned parallel treebank. In Eighth international workshop on treebanks and linguistic theories (TLT) (pp. 197–208).
van den Bosch, A., & Bouma, G. (2011) Interactive multi-modal question-answering. Berlin, Heidelberg: Springer.
van Noord, G. (2006). At last parsing is now operational. In P. Mertens, C. Fairon, A. Dister & P. Watrin (Eds.), TALN06. Verbum Ex Machina. Actes de la 13e conference sur le traitement automatique des langues naturelles (pp. 20–42).
van Rijsbergen, C. (1979). Information retrieval (2nd ed.). London, Boston: Butterworth.
van der Wouden, T., Hoekstra, H., Moortgat, M., Renmans, B., & Schuurman, I. (2002). Syntactic analysis in the spoken dutch corpus. In Proceedings of the 3rd international conference on language resources and evaluation (LREC) (pp. 768–773), Las Palmas, Canary Islands, Spain.
Vossen, P., Maks, I., Segers, R., & van der Vliet, H. (2008). Integrating lexical units, synsets and ontology in the Cornetto database. In Proceedings of the 6th international conference on language resources and evaluation (LREC), Marrakech, Morocco.
Wan, S., Dale, R., Dras, M., & Paris, C. (2007). Global revision in summarisation: Generating novel sentences with prim’s algorithm. In Proceedings of the 10th conference of the Pacific Association for Computational Linguistics (pp. 19–21).
Wubben, S., van den Bosch, A., Krahmer, E., & Marsi, E. (2009). Clustering and matching headlines for automatic paraphrase acquisition. In The 12th European workshop on natural language generation (ENLG) (pp. 122–125), Athens.
Zhao, S., Wang, H., Liu, T., & Li, S. (2008). Pivot approach for extracting paraphrase patterns from bilingual corpora. In Proceedings of the 46th annual meeting of the Association for Computational Linguistics: Human language technologies (ACL-HLT) (pp. 780–788), Columbus, OH.
Zhechev, V., & Way, A. (2008). Automatic generation of parallel treebanks. In Proceedings of the 22nd international conference on computational linguistics (COLING) (pp. 1105–1112).
Zhou, L., Lin, C. Y., & Hovy, E. (2006). Re-evaluating machine translation results with paraphrase support. In Proceedings of the 2006 conference on empirical methods in natural language processing (pp. 77–84). Association for Computational Linguistics.
Zhu, Z., Bernhard, D., & Gurevych, I. (2010). A monolingual tree-based translation model for sentence simplification. In Proceedings of the 23rd international conference on computational linguistics (pp. 1353–1361). Association for Computational Linguistics.
Acknowledgments
We would like to thank Nienke Eckhardt, Koen van Lierop, Vera Nijveld, Paul van Pelt, Hanneke Schoormans and Jurry de Vos for all their annotation work, Erik Tsjong Kim Sang and colleagues for the autocue–subtitle material from the ATRANOS project, Gosse Bouma and others from the IMIX project for the QA reference corpus, Wauter Bosma for mining the headlines from Google News, and Sander Wubben for automatic subclustering of headlines. We also like to express our gratitude to the publishers and press agencies for providing the raw text material. We thank the anonymous reviewers for their constructive remarks. This work was conducted within the DAESO project (2006–2010) funded by the Stevin program (De Nederlandse Taalunie; The Dutch Language Union).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Marsi, E., Krahmer, E. Construction of an aligned monolingual treebank for studying semantic similarity. Lang Resources & Evaluation 48, 279–306 (2014). https://doi.org/10.1007/s10579-013-9252-1
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10579-013-9252-1