Skip to main content

Outsourcing Parsebanking: The FinnTreeBank Project

  • Chapter
Shall We Play the Festschrift Game?

Abstract

Morphological and syntactic annotation of large text corpora for use as empirical corpus linguistic research data is typically a work and expertise intensive multi-year process. We outline an ongoing project, FIN-CLARIN FinnTreeBank, that uses outsourcing as a method to enable high-quality annotation according to specification on a large scale (tens of millions of words). We describe the main stages of the project: task specification, subcontractor selection and collaboration with the subcontractor to enable successful delivery evaluation.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    http://www.helsinki.fi/fin-clarin.

  2. 2.

    http://www.kotus.fi.

References

  • Abeillé, Anne, ed. 2003. Treebanks: Building and using syntactically annoted corpora. Dordrecht: Kluwer Academic.

    Google Scholar 

  • Carlson, Lauri, and Krister Lindén. 1987. Unification as a grammatical tool. Nordic Journal of Linguistics 10: 111–136.

    Article  Google Scholar 

  • de Marneffe, Marie-Catherine, and Christopher D. Manning. 2008. The Stanford typed dependencies representation. In Coling 2008: Proceedings of the workshop on cross-framework and cross-domain parser evaluation, CrossParser ’08, 1–8. Stroudsburg: Association for Computational Linguistics. http://portal.acm.org/citation.cfm?id=1608858.1608859.

    Chapter  Google Scholar 

  • Hakulinen, Auli, Maria Vilkuna, Riitta Korhonen, Vesa Koivisto, Tarja Riitta Heinonen, and Irja Alho. 2004a. Iso suomen kielioppi. Helsinki: Suomalaisen Kirjallisuuden Seura. ISBN 951-746-557-2.

    Google Scholar 

  • Hakulinen, Auli, Maria Vilkuna, Riitta Korhonen, Vesa Koivisto, Tarja Riitta Heinonen, and Irja Alho. 2004b. Ison suomen kieliopin verkkoversio: määritelmät. Helsinki: Suomalaisen Kirjallisuuden Seura. http://kaino.kotus.fi/cgi-bin/visktermit/visktermit.cgi.

    Google Scholar 

  • Hwa, Rebecca, Philip Resnik, Amy Weinberg, Clara Cabezas, and Okan Kolak. 2005. Bootstrapping parsers via syntactic projection across parallel texts. Natural Language Engineering 11: 311–325. http://dl.acm.org/citation.cfm?id=1088141.1088144.

    Article  Google Scholar 

  • Karlsson, Fred, Atro Voutilainen, Juha Heikkilä, and Arto Anttila, eds. 1995. Constraint grammar: A language-independent system for parsing running text. Vol. 4 of Natural language processing. Berlin: Mouton de Gruyter. ISBN 3-11-014179-5.

    Google Scholar 

  • Koskenniemi, Kimmo, Pasi Tapanainen, and Atro Voutilainen. 1992. Compiling and using finite-state syntactic rules. In Proceedings of the 15th international conference on computational linguistics (COLING-92), Vol. I, 156–162. Nantes: ICCL.

    Google Scholar 

  • Kromann, Matthias. 2003. The Danish Dependency Treebank and the underlying linguistic theory. In Proceedings of the second workshop on treebanks and linguistic theories (TLT).

    Google Scholar 

  • Lindén, Krister, Miikka Silfverberg, and Tommi Pirinen. 2009. HFST tools for morphology—an efficient open-source package for construction of morphological analyzers. In Proceedings of the workshop on systems and frameworks for computational morphology, Zürich, Switzerland.

    Google Scholar 

  • Mikulová, Marie, Alevtina Bémová, Jan Hajič, Eva Hajičová, Jiří Havelka, Veronika Kolářová, Lucie Kučová, Markéta Lopatková, Petr Pajas, Jarmila Panevová, Magda Razímová, Petr Sgall, Jan Štěpánek, Zdeňka Urešová, Kateřina Veselá, and Zdeněk Žabokrtský. 2006. Annotation on the tectogrammatical level in the Prague Dependency Treebank. Annotation Manual. Technical Report 30, UFAL MFF UK, Prague, Czech Republic.

    Google Scholar 

  • Muhonen, Kristiina, and Tanja Purtonen. 2011. Creating a dependency syntactic treebank: Towards intuitive language modeling. In Proceedings of the international conference on dependency linguistics, Barcelona, eds. Kim Gerdes, Eva Hajičová, and Leo Wanner, 155–164. ISBN 978-84-615-1834-0.

    Google Scholar 

  • Nelimarkka, Esa, Harri Jäppinen, and Aarno Lehtola. 1984. Two-way finite automata and dependency grammar: A parsing method for inflectional free word order languages. In Proceedings 10th international conference on computational linguistics and 22nd annual meeting of the Association for Computational Linguistics, 389–392. Stroudsburg: ACL.

    Chapter  Google Scholar 

  • Nivre, Joakim, Jens Nilsson, and Johan Hall. 2006. Talbanken05: A Swedish treebank with phrase structure and dependency annotation. In Proceedings of the fifth international conference on language resources and evaluation (LREC 2006), 24–26.

    Google Scholar 

  • Pedersen, Ted. 2008. Last words: Empiricism is not a matter of faith. Computational Linguistics 34: 465–470.

    Article  Google Scholar 

  • Tapanainen, Pasi, and Timo Järvinen. 1997. A non-projective dependency parser. In Proceedings fifth conference on applied natural language processing, ANLC ’97, 64–71. Stroudsburg: Association for Computational Linguistics. http://dx.doi.org/10.3115/974557.974568.

    Chapter  Google Scholar 

  • Tesnière, Lucien. 1980. Grundzüge der strukturalen Syntax. 3–12. Stuttgart: Klett-Cotta. ISBN 3-12-911790-3.

    Google Scholar 

  • Voutilainen, Atro. 1997. Designing a (finite-state) parsing grammar. In Finite state language processing, eds. Emmanuel Roche and Yves Schabes, 283–310. Cambridge: The MIT Press. Chap. 9.

    Google Scholar 

  • Voutilainen, Atro, and Krister Lindén. 2011. Finnish language bank: A framework for depositing and disseminating language resources for R&D. In Proceedings of the workshop on visibility and availability of LT resources at NODALIDA 2011. NEALT proceedings series.

    Google Scholar 

  • Voutilainen, Atro, and Tanja Purtonen. 2011. A double-blind experiment on interannotator agreement: The case of dependency syntax and Finnish. In NODALIDA 2011 conference proceedings, 319–322.

    Google Scholar 

Download references

Acknowledgements

We gratefully acknowledge the ongoing software and programming support of the Helsinki HFST Team, in particular the help of Tommi Pirinen and Sam Hardwick related to Finnish morphological analysis and various corpus processing tasks. We also thank Nick Ostler and Wanjiku Nganga for constructive comments on an earlier draft. The project has been funded via CLARIN, FIN-CLARIN, FIN-CLARIN-CONTENT and META-NORD by EU, University of Helsinki and the Academy of Finland.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Atro Voutilainen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2012 Springer-Verlag Berlin Heidelberg

About this chapter

Cite this chapter

Voutilainen, A., Purtonen, T., Muhonen, K. (2012). Outsourcing Parsebanking: The FinnTreeBank Project. In: Santos, D., Lindén, K., Ng’ang’a, W. (eds) Shall We Play the Festschrift Game?. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-30773-7_9

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-30773-7_9

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-30772-0

  • Online ISBN: 978-3-642-30773-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics