Abstract
Morphological and syntactic annotation of large text corpora for use as empirical corpus linguistic research data is typically a work and expertise intensive multi-year process. We outline an ongoing project, FIN-CLARIN FinnTreeBank, that uses outsourcing as a method to enable high-quality annotation according to specification on a large scale (tens of millions of words). We describe the main stages of the project: task specification, subcontractor selection and collaboration with the subcontractor to enable successful delivery evaluation.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
References
Abeillé, Anne, ed. 2003. Treebanks: Building and using syntactically annoted corpora. Dordrecht: Kluwer Academic.
Carlson, Lauri, and Krister Lindén. 1987. Unification as a grammatical tool. Nordic Journal of Linguistics 10: 111–136.
de Marneffe, Marie-Catherine, and Christopher D. Manning. 2008. The Stanford typed dependencies representation. In Coling 2008: Proceedings of the workshop on cross-framework and cross-domain parser evaluation, CrossParser ’08, 1–8. Stroudsburg: Association for Computational Linguistics. http://portal.acm.org/citation.cfm?id=1608858.1608859.
Hakulinen, Auli, Maria Vilkuna, Riitta Korhonen, Vesa Koivisto, Tarja Riitta Heinonen, and Irja Alho. 2004a. Iso suomen kielioppi. Helsinki: Suomalaisen Kirjallisuuden Seura. ISBN 951-746-557-2.
Hakulinen, Auli, Maria Vilkuna, Riitta Korhonen, Vesa Koivisto, Tarja Riitta Heinonen, and Irja Alho. 2004b. Ison suomen kieliopin verkkoversio: määritelmät. Helsinki: Suomalaisen Kirjallisuuden Seura. http://kaino.kotus.fi/cgi-bin/visktermit/visktermit.cgi.
Hwa, Rebecca, Philip Resnik, Amy Weinberg, Clara Cabezas, and Okan Kolak. 2005. Bootstrapping parsers via syntactic projection across parallel texts. Natural Language Engineering 11: 311–325. http://dl.acm.org/citation.cfm?id=1088141.1088144.
Karlsson, Fred, Atro Voutilainen, Juha Heikkilä, and Arto Anttila, eds. 1995. Constraint grammar: A language-independent system for parsing running text. Vol. 4 of Natural language processing. Berlin: Mouton de Gruyter. ISBN 3-11-014179-5.
Koskenniemi, Kimmo, Pasi Tapanainen, and Atro Voutilainen. 1992. Compiling and using finite-state syntactic rules. In Proceedings of the 15th international conference on computational linguistics (COLING-92), Vol. I, 156–162. Nantes: ICCL.
Kromann, Matthias. 2003. The Danish Dependency Treebank and the underlying linguistic theory. In Proceedings of the second workshop on treebanks and linguistic theories (TLT).
Lindén, Krister, Miikka Silfverberg, and Tommi Pirinen. 2009. HFST tools for morphology—an efficient open-source package for construction of morphological analyzers. In Proceedings of the workshop on systems and frameworks for computational morphology, Zürich, Switzerland.
Mikulová, Marie, Alevtina Bémová, Jan Hajič, Eva Hajičová, Jiří Havelka, Veronika Kolářová, Lucie Kučová, Markéta Lopatková, Petr Pajas, Jarmila Panevová, Magda Razímová, Petr Sgall, Jan Štěpánek, Zdeňka Urešová, Kateřina Veselá, and Zdeněk Žabokrtský. 2006. Annotation on the tectogrammatical level in the Prague Dependency Treebank. Annotation Manual. Technical Report 30, UFAL MFF UK, Prague, Czech Republic.
Muhonen, Kristiina, and Tanja Purtonen. 2011. Creating a dependency syntactic treebank: Towards intuitive language modeling. In Proceedings of the international conference on dependency linguistics, Barcelona, eds. Kim Gerdes, Eva Hajičová, and Leo Wanner, 155–164. ISBN 978-84-615-1834-0.
Nelimarkka, Esa, Harri Jäppinen, and Aarno Lehtola. 1984. Two-way finite automata and dependency grammar: A parsing method for inflectional free word order languages. In Proceedings 10th international conference on computational linguistics and 22nd annual meeting of the Association for Computational Linguistics, 389–392. Stroudsburg: ACL.
Nivre, Joakim, Jens Nilsson, and Johan Hall. 2006. Talbanken05: A Swedish treebank with phrase structure and dependency annotation. In Proceedings of the fifth international conference on language resources and evaluation (LREC 2006), 24–26.
Pedersen, Ted. 2008. Last words: Empiricism is not a matter of faith. Computational Linguistics 34: 465–470.
Tapanainen, Pasi, and Timo Järvinen. 1997. A non-projective dependency parser. In Proceedings fifth conference on applied natural language processing, ANLC ’97, 64–71. Stroudsburg: Association for Computational Linguistics. http://dx.doi.org/10.3115/974557.974568.
Tesnière, Lucien. 1980. Grundzüge der strukturalen Syntax. 3–12. Stuttgart: Klett-Cotta. ISBN 3-12-911790-3.
Voutilainen, Atro. 1997. Designing a (finite-state) parsing grammar. In Finite state language processing, eds. Emmanuel Roche and Yves Schabes, 283–310. Cambridge: The MIT Press. Chap. 9.
Voutilainen, Atro, and Krister Lindén. 2011. Finnish language bank: A framework for depositing and disseminating language resources for R&D. In Proceedings of the workshop on visibility and availability of LT resources at NODALIDA 2011. NEALT proceedings series.
Voutilainen, Atro, and Tanja Purtonen. 2011. A double-blind experiment on interannotator agreement: The case of dependency syntax and Finnish. In NODALIDA 2011 conference proceedings, 319–322.
Acknowledgements
We gratefully acknowledge the ongoing software and programming support of the Helsinki HFST Team, in particular the help of Tommi Pirinen and Sam Hardwick related to Finnish morphological analysis and various corpus processing tasks. We also thank Nick Ostler and Wanjiku Nganga for constructive comments on an earlier draft. The project has been funded via CLARIN, FIN-CLARIN, FIN-CLARIN-CONTENT and META-NORD by EU, University of Helsinki and the Academy of Finland.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this chapter
Cite this chapter
Voutilainen, A., Purtonen, T., Muhonen, K. (2012). Outsourcing Parsebanking: The FinnTreeBank Project. In: Santos, D., Lindén, K., Ng’ang’a, W. (eds) Shall We Play the Festschrift Game?. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-30773-7_9
Download citation
DOI: https://doi.org/10.1007/978-3-642-30773-7_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-30772-0
Online ISBN: 978-3-642-30773-7
eBook Packages: Computer ScienceComputer Science (R0)