Skip to main content
Log in

Efficient corpus development for lexicography: building the New Corpus for Ireland

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

In a 12-month project we have developed a new, register-diverse, 55-million-word bilingual corpus—the New Corpus for Ireland (NCI)—to support the creation of a new English-to-Irish dictionary. The paper describes the strategies we employed, and the solutions to problems encountered. We believe we have a good model for corpus creation for lexicography, and others may find it useful as a blueprint. The corpus has two parts, one Irish, the other Hiberno-English (English as spoken in Ireland). We describe its design, collection and encoding.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

Notes

  1. The project is under the direction of Foras na Gaeilge, the government-funded body responsible for the promotion of the Irish language throughout the island of Ireland, whose statutory functions include the development of new dictionaries (http://www.forasnagaeilge.ie). Full details of the NEID project can be found at http://www.focloir.ie. The main contractor for setting up the project, including corpus preparation, is Lexicography MasterClass Ltd (http://www.lexmasterclass.com/).

  2. Figures from the 2002 Census.

  3. Irish is taught throughout the school system, and about 30,000 students are educated in Irish-medium schools, ‘Gaelscoileanna’.

  4. See http://natcorp.ox.ac.uk

  5. See http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T05

  6. While this is clearly also true of English worldwide, it is a lesser consideration for English produced in Ireland, where English is the mother tongue of an overwhelming majority of the population.

  7. See http://www.ul.ie/∼lcie/

  8. Since the work was done, the shingling algorithm (Broder, Glassman, Manasse, & Zweig, 1997) has become widely known as the leading tool for de-duplication.

  9. Constraint Grammar vislcg downloadable at http://www.sourceforge.net

  10. For alternative work on Irish grammar checking see: http://borel.slu.edu/gramadoir/

References

  • An Roinn Oideachais. (1986). Foclóir Póca English-Irish/Irish-English Dictionary. Baile Átha Cliath: An Gúm.

    Google Scholar 

  • Atkins, B. T. S. (2002). Then and now: Competence and performance in 35 years of lexicography. In Braasch & Povlsen (Eds.) Proceedings of the Tenth Euralex Congress (pp. 1–28). Denmark: University of Copenhagen .

  • Atkins, B. T. S., Clear, J. H., & Ostler, N. (1992). Corpus design criteria. Journal of Literary and Linguistic Computing. 1–16.

  • Beesley, K. & Karttunen, L. (2003). Finite state morphology. California: CSLI Publications.

    Google Scholar 

  • Broder, A., Glassman, S., Manasse, M. & Zweig, G. (1997). Syntactic clustering on the Web. In Proceedings 6th Intnl World-Wide Web Conference.

  • Census of Ireland, (2002). Volume 11 Irish language. Tables 7A and 31A http://www.cso.ie/.

  • Clough, P., Gaizauskas, R., Piao, S. & Wilks, Y. (2002). MeTeR, Measuring Text Reuse. Proc. 40th Anniversary Meeting for the Association for Computational Linguistics (ACL-02) (pp. 152–159). 7–12 July, University of Pennsylvania, Philadelphia, USA.

  • Christian Brothers, (1980). New Irish grammar. Dublin: Fallons.

    Google Scholar 

  • de Bhaldraithe, T. (1959). English–Irish dictionary. Baile Átha Cliath: An Gúm.

    Google Scholar 

  • Grefenstette, G., & Nioche, J. (2000). Estimation of English and non-English Language Use on the WWW. Proc. RIAO (Recherche d’Informations Assistee par Ordinateur), Paris.

  • Janes, A. (2004). Bilingual comparable corpora for bilingual lexicography. MSc Dissertation, University of Brighton.

  • Johnson, S. (1747). The plan of an English dictionary.

  • Jones, R. & Ghani, R. (2000). Automatically building a corpus for a minority language from the web. 38th Meeting of the ACL, Proceedings of the Student Research Workshop (pp. 29–36). Hong Kong.

  • Karlsson, F., Voutilainen, A., Heikkilä, J., & Anttila, A. (Eds.) (1995). Constraint grammar: A language-independent system for parsing unrestricted text. Mouton de Gruyter, Berlin and New York.

  • Karttunen, L. & Beesley, K. (1992). Two-level rule compiler. Technical report, Xerox PARC.

  • Kilgarriff, A., Rychly, P., Smrz, P., & Tugwell, D. (2004). The Sketch Engine. Proceedings of the Eleventh Euralex Congress (pp. 105–116). France: UBS Lorient.

  • Kilgarriff, A., & Grefenstette, G. (2003). Web as Corpus: Introduction to the special issue. Computational Linguistics, 29(3), 333–347.

    Article  Google Scholar 

  • Schulze, B. & Christ, O. (1994). The IMS Corpus Workbench. Institut für maschinelle Sprachverarbeitung, Universität Stuttgart.

  • Tapanainen, P. (1996). The Constraint Grammar Parser CG-2. Publication No. 27, University of Helsinki.

  • Trench, R. C. (1857). On some deficiencies in our English dictionaries. London: The Philological Society. (reprinted at http://www.oed.com/archive/paper-deficiencies/).

  • Uí Dhonnchadha, E. (2002). An analyser and generator for Irish inflectional morphology using finite state transducers. Unpublished MSc Thesis: Dublin, DCU.

  • Uí Dhonnchadha, E., Nic Pháidín, C. Van Genabith, J. (2003). Design, implementation and evaluation of an inflectional morphology finite-state transducer for Irish. In MT Journal - Special issue on finite state language resources and language processing. Kluwer.

  • Uí Dhonnchadha, E., & Van Genabith, J. (2005). Scaling an Irish FST morphology engine for use on unrestricted text. In Proceedings of FSMNLP 2005, Helsinki, September 2005.

Download references

Acknowledgements

In addition to the authors, the main corpus-development team comprised Steve Finch, Eamon Keegan, Eoghan Mac Aogáin, Mark McLauchlan, Lisa Nic Shea, Jo O’Donoghue, Paul Atkins, Pavel Rychly and Dan Xu, all of whom deserve our heartfelt gratitude. We would also like to thank Seosamh Ó Murchú, Foras na Gaeilge’s Project Manager for the NEID, for his supportive role; Josef van Genabith of Dublin City University, for arranging the student internships; Dónall Ó Riagáin for helpful advice at the corpus design stage; John Kirk of the Queen’s University, Belfast, for permission to use NICTS; and Anne O’Keefe and Fiona Farr of the University of Limerick, for permission to use the Limerick Corpus of Irish English.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Adam Kilgarriff.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kilgarriff, A., Rundell, M. & Uí Dhonnchadha, E. Efficient corpus development for lexicography: building the New Corpus for Ireland. Lang Resources & Evaluation 40, 127–152 (2006). https://doi.org/10.1007/s10579-006-9011-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-006-9011-7

Keywords

Navigation