Construction of an aligned monolingual treebank for studying semantic similarity

Marsi, Erwin; Krahmer, Emiel

doi:10.1007/s10579-013-9252-1

Construction of an aligned monolingual treebank for studying semantic similarity

Original Paper
Published: 04 October 2013

Volume 48, pages 279–306, (2014)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Erwin Marsi¹ &
Emiel Krahmer²

293 Accesses
Explore all metrics

Abstract

Modern paraphrase research would benefit from large corpora with detailed annotations. However, currently these corpora are still thin on the ground. In this paper, we describe the development of such a corpus for Dutch, which takes the form of a parallel monolingual treebank consisting of over 2 million tokens and covering various text genres, including both parallel and comparable text. This publicly available corpus is richly annotated with alignments between syntactic nodes, which are also classified using five different semantic similarity relations. A quarter of the corpus is manually annotated, and this informs the development of an automatic tree aligner used to annotate the remainder of the corpus. We argue that this corpus is the first of this size and kind, and offers great potential for paraphrasing research.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Constructing a Turkish Corpus for Paraphrase Identification and Semantic Similarity

LexDivPara: A Measure of Paraphrase Quality with Integrated Sentential Lexical Complexity

The Groningen Meaning Bank

Discover the latest articles and news from researchers in related subjects, suggested using machine learning.

Notes

Note that the sentences in Example (1) were constructed for illustrative purposes, even though, for example, caffeine intake has been shown to reduce the risk of Parkinson disease (Ross et al. 2000).
Note that this still allows one-to-many alignments if v _t and $ v_{t}^{\prime} $ are equally similar to v _s. In practice, we found such cases to be rare in our data collection, so we opted for excluding one-to-many node alignments. Formally our tree alignment is therefore a tree matching, which is a restricted form of tree alignment in which each node is aligned to at most one other node. Note also that there is no requirement for the alignment to be exhaustive, hence it may also be called a partial tree alignment.
http://daeso.uvt.nl.
For instance, the two translations of “On the Origin of Species” are based on different editions and show significant differences, largely due to Darwin’s own revisions. These range from long sentences in one translation being split into multiple sentences in the other to substantial pieces of added or removed text (the 6th edition even has a whole new chapter).
Hitaext is implemented in wxPython, runs on Mac OS X, Linux and Windows, and is freely available as open source software from http://daeso.uvt.nl/hitaext.
Algraeph is a rewritten and extended version of our earlier tool called Gadget. It is implemented in wxPython, runs on Mac OS X, Linux and Windows, and is available as open source software from http://daeso.uvt.nl/algraeph.
This may in part be due, however, to the larger number of non-uniquely aligned trees in the comparable text segments, because each count of aligned nodes only concerns one particular pair of source and target trees.
Since precision(A _n, A _m) equals recall(A _m, A _n), the mean over all pairwise precision scores equals the mean over all pairwise recall scores.

References

Agirre, E., Diab, M., Cer, D., & Gonzalez-Agirre, A. (2012). Semeval-2012 task 6: A pilot on semantic textual similarity. In Proceedings of the first joint conference on lexical and computational semantics (Vol. 1). Proceedings of the main conference and the shared task (Vol. 2). Proceedings of the sixth international workshop on semantic evaluation (pp. 385–393). Association for Computational Linguistics.
Androutsopoulos, I., & Malakasiotis, P. (2010). A survey of paraphrasing and textual entailment methods. Journal of Artificial Intelligence Research, 38, 135–187.
Google Scholar
Bannard, C., & Callison-Burch, C. (2005). Paraphrasing with bilingual parallel corpora. In Proceedings of the 43rd annual meeting of the Association for Computational Linguistics (ACL) (pp. 597–604), Ann Arbor.
Barzilay, R., & Lee, L. (2003). Learning to paraphrase: An unsupervised approach using multiple-sequence alignment. In Proceedings of the conference of the North American chapter of the Association for Computational Linguistics on Human Language Technology (NAACL-HLT) (pp. 16–23), Morristown, NJ, USA.
Barzilay, R., & McKeown, K. (2001). Extracting paraphrases from a parallel corpus. In Proceedings of the 39th meeting of the Association for Computational Linguistics (ACL) (pp. 50–57), Toulouse, France.
Barzilay, R., & McKeown, K. (2005). Sentence fusion for multidocument news summarization. Computational Linguistics, 31(3), 297–328.
Article Google Scholar
Bos, J., & Markert, K. (2005). Recognising textual entailment with logical inference. In Proceedings of the conference on human language technology and empirical methods in natural language processing (HLT-EMNLP) (pp. 628–635).
Bouma, G., van Noord, G., & Malouf, R. (2001). Alpino: Wide-coverage computational analysis of Dutch. In W. Daelemans, K. Sima’an, J. Veenstra & J. Zavre (Eds.), Computational linguistics in the Netherlands 2000: Selected papers (pp. 45–59). Amsterdam, New York: Rodopi.
Google Scholar
Bouma, G., Mur, J., van Noord, G., van der Plas, L., & Tiedemann, J. (2005). Question answering for Dutch using dependency relations. In Proceedings of the CLEF 2005 workshop.
Burnard, L., & Sperberg-McQueen, C. M. (2006). TEI lite: Encoding for interchange: An introduction to the TEI Revised for TEI P5 release. Technical report, Text Encoding Initiative.
Burrows, S., Potthast, M., & Stein, B. (2012). Paraphrase acquisition via crowdsourcing and machine learning. ACM TIST.
Callison-Burch, C., Koehn, P., & Osborne, M. (2006). Improved statistical machine translation using paraphrases. In Proceedings of the human language technology conference of the North American Chapter of the Association of Computational Linguistics (HLT-NAACL) (pp. 17–24), New York City, USA.
Cohn, T., & Lapata, M. (2009). Sentence compression as tree transduction. Journal of Artificial Intelligence Research, 34(1), 637–674.
Google Scholar
Cohn, T., Callison-Burch, C., & Lapata, M. (2008). Constructing corpora for the development and evaluation of paraphrase systems. Computational Linguistics, 34(4), 597–614.
Article Google Scholar
Cui, H., Sun, R., Li, K., Kan, M., & Chua, T. (2005). Question answering passage retrieval using dependency relations. In Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval (pp. 400–407).
Daelemans, W., & van den Bosch, A. (2005). Memory-based language processing. Cambridge: Cambridge University Press.
Book Google Scholar
Daelemans, W., Höthker, A., & Tjong Kim Sang, E. (2004). Automatic sentence simplification for subtitling in Dutch and English. In Proceedings of the 4th international conference on language resources and evaluation (LREC) (pp. 1045–1048).
Dagan, I., Glickman, O., & Magnini, B. (2005). The PASCAL recognising textual entailment challenge. In Proceedings of the PASCAL challenges workshop on recognising textual entailment, Southampton, UK.
Dagan, I., Glickman, O., & Magnini, B. (2006). The PASCAL recognising textual entailment challenge. In J. Quiñonero Candela, I. Dagan, B. Magnini & F. d’Alché Buc (Eds.), Machine learning challenges (pp. 177–190). Berlin, Heidelberg: Springer.
Google Scholar
Darwin, C. R. (2001). Het ontstaan van de soorten: door natuurlijke selectie ofwel het bewaard blijven van de rassen die in voordeel zijn in de strijd om het bestaan: de definitieve editie (6th ed.). Amsterdam: Atlas.
Google Scholar
Darwin, C. R. (2002). Over het ontstaan van soorten: Door middel van natuurlijke selectie, of het behoud van bevoordeelde rassen in de strijd om het leven. Amsterdam: Nieuwezijds.
Google Scholar
Daume, H., & Marcu, D. (2005). Induction of word and phrase alignments for automatic document summarization. Computational Linguistics, 31(4), 505–530.
Article Google Scholar
de Montaigne, M. (2001). Essays. Amsterdam: Boom.
Google Scholar
de Montaigne, M. (2004). De essays. Amsterdam: Atheneum, Polak and Van Gennip.
Google Scholar
de Saint-Exupèry, A. (1960). De kleine prins. Rotterdam: Donker.
Google Scholar
de Saint-Exupèry, A. (2000). De kleine prins. Rotterdam: Donker.
Google Scholar
Dolan, B., Quirk, C., & Brockett, C. (2004). Unsupervised construction of large paraphrase corpora: Exploiting massively parallel news sources. In Proceedings of the 20th international conference on computational linguistics (COLING) (pp. 350–356), Morristown, NJ, USA.
Filippova, K., & Strube, M. (2008). Sentence fusion via dependency graph compression. In Proceedings of the conference on empirical methods in natural language processing (EMNLP) (pp. 177–185), Morristown, NJ, USA.
Gale, W. A., & Church, K. W. (1993). A program for aligning sentences in bilingual corpora. Computational Linguistics, 19(1), 75–102.
Google Scholar
Knight, K., & Marcu, D. (2002). Summarization beyond sentence extraction: A probabilistic approach to sentence compression. Artificial Intelligence, 139(1), 91–107.
Article Google Scholar
Krahmer, E., Marsi, E., & van Pelt, P. (2008). Query-based sentence fusion is better defined and leads to more preferred results than generic sentence fusion. In Proceedings of the 46th annual meeting of the Association for Computational Linguistics: Human language technologies (ACL) (pp. 193–196), Columbus, OH, USA.
Lin, D., & Pantel, P. (2001). Discovery of inference rules for question answering. Natural Language Engineering, 7(4), 343–360.
Article Google Scholar
MacCartney, B., & Manning, C. (2008). Modeling semantic containment and exclusion in natural language inference. In Proceedings of the 22nd international conference on computational linguistics (Vol. 1, pp. 521–528).
MacCartney, B., & Manning, C. (2009). An extended model of natural logic. In The eighth international conference on computational semantics (IWCS), Tilburg, The Netherlands.
Madnani, N., & Dorr, B. J. (2010). Generating phrasal and sentential paraphrases: A survey of data-driven methods. Computational Linguistics, 36(3), 341–387.
Article Google Scholar
Marsi, E., & Krahmer, E. (2005a). Classification of semantic relations by humans and machines. In Proceedings of the ACL 2005 workshop on empirical modeling of semantic equivalence and entailment (pp. 1–6), Ann Arbor, MI.
Marsi, E., & Krahmer, E. (2005b). Explorations in sentence fusion. In Proceedings of the 10th European workshop on natural language generation (ENLG), Aberdeen, UK.
Marsi, E., & Krahmer, E. (2008). Detecting semantic overlap: A parallel monolingual treebank for Dutch. In S. Verberne, H. van Halteren & P. A. Coppen (Eds.), Computational linguistics in the Netherlands (CLIN): Selected papers (pp. 69–84), Rodopi, Amsterdam.
Marsi, E., & Krahmer, E. (2010). Automatic analysis of semantic similarity in comparable text through syntactic tree matching. In Proceedings of the 23rd international conference on computational linguistics (COLING) (pp. 752–760), Beijing, China.
Marsi, E., Krahmer, E., Hendrickx, I., & Daelemans, W. (2010). On the limits of sentence compression by deletion. In E. Krahmer & M. Theune (Eds.), Empirical methods in natural language generation (pp. 45–66). Berlin, Heidelberg: Springer.
Chapter Google Scholar
Navarro, G. (2001). A guided tour to approximate string matching. ACM Computing Surveys, 33, 31–88.
Article Google Scholar
Och, F. J., & Ney, H. (2003). A systematic comparison of various statistical alignment models. Computational Linguistics, 29(1), 19–51.
Article Google Scholar
Pang, B., Knight, K., & Marcu, D. (2003). Syntax-based alignment of multiple translations: Extracting paraphrases and generating new sentences. In Proceedings of the conference of the North American Chapter of the Association for Computational Linguistics on human language technology (NAACL-HLT) (pp. 181–188).
Pasca, M., & Dienes, P. (2005). Aligning needles in a haystack: Paraphrase aquisition across the web. In Proceedings of the 2nd international joint conference on natural language processing (IJCNLP) (pp. 119–130), Jeju Island, South Korea.
Potthast, M., Stein, B., Barrón-Cedeño, A., & Rosso, P. (2010). An evaluation framework for plagiarism detection. In Proceedings of the 23rd international conference on computational linguistics: Posters (pp. 997–1005). Association for Computational Linguistics.
Punyakanok, V., Roth, D., & Yih, W. (2004). Mapping dependencies trees: An application to question answering. In Proceedings of the eighth international symposium on artificial intelligence and mathematics, Fort Lauderdale, FL.
Quirk, C., Brockett, C. C., & Dolan, W. (2004). Monolingual machine translation for paraphrase generation. In Proceedings of the conference on empirical methods in natural language processing (EMNLP) (pp. 142–149), Barcelona, Spain.
Radev, D., & McKeown, K. (1998). Generating natural language summaries from multiple on-line sources. Computational Linguistics, 24(3), 469–500.
Google Scholar
Reynaert, M. (2007). Sentence-splitting and tokenization in d-coi. Technical report 07-07. ILK Research Group.
Ross, G. W., Abbott, R. D., Petrovitch, H., Morens, D. M., Grandinetti, A., Tung, K. H., et al. (2000). Association of coffee and caffeine intake with the risk of parkinson disease. The Journal of the American Medical Association (JAMA), 283, 2674–2679.
Article Google Scholar
Samuelsson, Y., & Volk, M. (2006). Phrase alignment in parallel treebanks. In Proceedings of 5th workshop on treebanks and linguistic theories, Prague, Czech Republik.
Shen, S., Radev, D. R., Patel, A., & Erkan, G. (2006). Adding syntax to dynamic programming for aligning comparable texts for the generation of paraphrases. In Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the Association for Computational Linguistics (COLING-ACL) (pp. 747–754), Sydney, Australia.
Shinyama, Y., Sekine, S., Sudo, K., & Grishman, R. (2002). Automatic paraphrase acquisition from news articles. In Proceedings of the human language technology conference (HLT 2002) (pp. 313–318), San Diego, USA.
Snover, M., Madnani, N., Dorr, B., & Schwartz, R. (2009). Ter-plus: Paraphrase, semantic, and alignment enhancements to translation edit rate. Machine Translation, 23(2–3), 117–127. doi:10.1007/s10590-009-9062-9.
Article Google Scholar
Tiedemann, J., & Kotzé, G. (2009). Building a large machine-aligned parallel treebank. In Eighth international workshop on treebanks and linguistic theories (TLT) (pp. 197–208).
van den Bosch, A., & Bouma, G. (2011) Interactive multi-modal question-answering. Berlin, Heidelberg: Springer.
Book Google Scholar
van Noord, G. (2006). At last parsing is now operational. In P. Mertens, C. Fairon, A. Dister & P. Watrin (Eds.), TALN06. Verbum Ex Machina. Actes de la 13e conference sur le traitement automatique des langues naturelles (pp. 20–42).
van Rijsbergen, C. (1979). Information retrieval (2nd ed.). London, Boston: Butterworth.
Google Scholar
van der Wouden, T., Hoekstra, H., Moortgat, M., Renmans, B., & Schuurman, I. (2002). Syntactic analysis in the spoken dutch corpus. In Proceedings of the 3rd international conference on language resources and evaluation (LREC) (pp. 768–773), Las Palmas, Canary Islands, Spain.
Vossen, P., Maks, I., Segers, R., & van der Vliet, H. (2008). Integrating lexical units, synsets and ontology in the Cornetto database. In Proceedings of the 6th international conference on language resources and evaluation (LREC), Marrakech, Morocco.
Wan, S., Dale, R., Dras, M., & Paris, C. (2007). Global revision in summarisation: Generating novel sentences with prim’s algorithm. In Proceedings of the 10th conference of the Pacific Association for Computational Linguistics (pp. 19–21).
Wubben, S., van den Bosch, A., Krahmer, E., & Marsi, E. (2009). Clustering and matching headlines for automatic paraphrase acquisition. In The 12th European workshop on natural language generation (ENLG) (pp. 122–125), Athens.
Zhao, S., Wang, H., Liu, T., & Li, S. (2008). Pivot approach for extracting paraphrase patterns from bilingual corpora. In Proceedings of the 46th annual meeting of the Association for Computational Linguistics: Human language technologies (ACL-HLT) (pp. 780–788), Columbus, OH.
Zhechev, V., & Way, A. (2008). Automatic generation of parallel treebanks. In Proceedings of the 22nd international conference on computational linguistics (COLING) (pp. 1105–1112).
Zhou, L., Lin, C. Y., & Hovy, E. (2006). Re-evaluating machine translation results with paraphrase support. In Proceedings of the 2006 conference on empirical methods in natural language processing (pp. 77–84). Association for Computational Linguistics.
Zhu, Z., Bernhard, D., & Gurevych, I. (2010). A monolingual tree-based translation model for sentence simplification. In Proceedings of the 23rd international conference on computational linguistics (pp. 1353–1361). Association for Computational Linguistics.

Download references

Acknowledgments

We would like to thank Nienke Eckhardt, Koen van Lierop, Vera Nijveld, Paul van Pelt, Hanneke Schoormans and Jurry de Vos for all their annotation work, Erik Tsjong Kim Sang and colleagues for the autocue–subtitle material from the ATRANOS project, Gosse Bouma and others from the IMIX project for the QA reference corpus, Wauter Bosma for mining the headlines from Google News, and Sander Wubben for automatic subclustering of headlines. We also like to express our gratitude to the publishers and press agencies for providing the raw text material. We thank the anonymous reviewers for their constructive remarks. This work was conducted within the DAESO project (2006–2010) funded by the Stevin program (De Nederlandse Taalunie; The Dutch Language Union).

Author information

Authors and Affiliations

Department of Computer and Information Science, Norwegian University of Science and Technology, Sem Sælands vei 7-9, 7491, Trondheim, Norway
Erwin Marsi
Tilburg Center for Cognition and Communication (TiCC), Tilburg University, P.O. Box 90153, 5000 LE, Tilburg, The Netherlands
Emiel Krahmer

Authors

Erwin Marsi
View author publications
You can also search for this author inPubMed Google Scholar
Emiel Krahmer
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Emiel Krahmer.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Marsi, E., Krahmer, E. Construction of an aligned monolingual treebank for studying semantic similarity. Lang Resources & Evaluation 48, 279–306 (2014). https://doi.org/10.1007/s10579-013-9252-1

Download citation

Published: 04 October 2013
Issue Date: June 2014
DOI: https://doi.org/10.1007/s10579-013-9252-1

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Construction of an aligned monolingual treebank for studying semantic similarity

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Constructing a Turkish Corpus for Paraphrase Identification and Semantic Similarity

LexDivPara: A Measure of Paraphrase Quality with Integrated Sentential Lexical Complexity

The Groningen Meaning Bank

Explore related subjects

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now