Efficient corpus development for lexicography: building the New Corpus for Ireland

Kilgarriff, Adam; Rundell, Michael; Uí Dhonnchadha, Elaine

doi:10.1007/s10579-006-9011-7

Efficient corpus development for lexicography: building the New Corpus for Ireland

Original Paper
Published: 23 December 2006

Volume 40, pages 127–152, (2006)
Cite this article

Language Resources and Evaluation Aims and scope Submit manuscript

Adam Kilgarriff¹,
Michael Rundell¹ &
Elaine Uí Dhonnchadha²

501 Accesses
10 Citations
3 Altmetric
Explore all metrics

Abstract

In a 12-month project we have developed a new, register-diverse, 55-million-word bilingual corpus—the New Corpus for Ireland (NCI)—to support the creation of a new English-to-Irish dictionary. The paper describes the strategies we employed, and the solutions to problems encountered. We believe we have a good model for corpus creation for lexicography, and others may find it useful as a blueprint. The corpus has two parts, one Irish, the other Hiberno-English (English as spoken in Ireland). We describe its design, collection and encoding.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

Diksha Khurana, Aditya Koli, … Sukhdev Singh

A survey on large language model based autonomous agents

Article Open access 22 March 2024

Lei Wang, Chen Ma, … Jirong Wen

Natural Language Processing

Notes

The project is under the direction of Foras na Gaeilge, the government-funded body responsible for the promotion of the Irish language throughout the island of Ireland, whose statutory functions include the development of new dictionaries (http://www.forasnagaeilge.ie). Full details of the NEID project can be found at http://www.focloir.ie. The main contractor for setting up the project, including corpus preparation, is Lexicography MasterClass Ltd (http://www.lexmasterclass.com/).
Figures from the 2002 Census.
Irish is taught throughout the school system, and about 30,000 students are educated in Irish-medium schools, ‘Gaelscoileanna’.
See http://natcorp.ox.ac.uk
See http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T05
While this is clearly also true of English worldwide, it is a lesser consideration for English produced in Ireland, where English is the mother tongue of an overwhelming majority of the population.
See http://www.ul.ie/∼lcie/
Since the work was done, the shingling algorithm (Broder, Glassman, Manasse, & Zweig, 1997) has become widely known as the leading tool for de-duplication.
Constraint Grammar vislcg downloadable at http://www.sourceforge.net
For alternative work on Irish grammar checking see: http://borel.slu.edu/gramadoir/

References

An Roinn Oideachais. (1986). Foclóir Póca English-Irish/Irish-English Dictionary. Baile Átha Cliath: An Gúm.
Google Scholar
Atkins, B. T. S. (2002). Then and now: Competence and performance in 35 years of lexicography. In Braasch & Povlsen (Eds.) Proceedings of the Tenth Euralex Congress (pp. 1–28). Denmark: University of Copenhagen .
Atkins, B. T. S., Clear, J. H., & Ostler, N. (1992). Corpus design criteria. Journal of Literary and Linguistic Computing. 1–16.
Beesley, K. & Karttunen, L. (2003). Finite state morphology. California: CSLI Publications.
Google Scholar
Broder, A., Glassman, S., Manasse, M. & Zweig, G. (1997). Syntactic clustering on the Web. In Proceedings 6th Intnl World-Wide Web Conference.
Census of Ireland, (2002). Volume 11 Irish language. Tables 7A and 31A http://www.cso.ie/.
Clough, P., Gaizauskas, R., Piao, S. & Wilks, Y. (2002). MeTeR, Measuring Text Reuse. Proc. 40th Anniversary Meeting for the Association for Computational Linguistics (ACL-02) (pp. 152–159). 7–12 July, University of Pennsylvania, Philadelphia, USA.
Christian Brothers, (1980). New Irish grammar. Dublin: Fallons.
Google Scholar
de Bhaldraithe, T. (1959). English–Irish dictionary. Baile Átha Cliath: An Gúm.
Google Scholar
Grefenstette, G., & Nioche, J. (2000). Estimation of English and non-English Language Use on the WWW. Proc. RIAO (Recherche d’Informations Assistee par Ordinateur), Paris.
Janes, A. (2004). Bilingual comparable corpora for bilingual lexicography. MSc Dissertation, University of Brighton.
Johnson, S. (1747). The plan of an English dictionary.
Jones, R. & Ghani, R. (2000). Automatically building a corpus for a minority language from the web. 38th Meeting of the ACL, Proceedings of the Student Research Workshop (pp. 29–36). Hong Kong.
Karlsson, F., Voutilainen, A., Heikkilä, J., & Anttila, A. (Eds.) (1995). Constraint grammar: A language-independent system for parsing unrestricted text. Mouton de Gruyter, Berlin and New York.
Karttunen, L. & Beesley, K. (1992). Two-level rule compiler. Technical report, Xerox PARC.
Kilgarriff, A., Rychly, P., Smrz, P., & Tugwell, D. (2004). The Sketch Engine. Proceedings of the Eleventh Euralex Congress (pp. 105–116). France: UBS Lorient.
Kilgarriff, A., & Grefenstette, G. (2003). Web as Corpus: Introduction to the special issue. Computational Linguistics, 29(3), 333–347.
Article Google Scholar
Schulze, B. & Christ, O. (1994). The IMS Corpus Workbench. Institut für maschinelle Sprachverarbeitung, Universität Stuttgart.
Tapanainen, P. (1996). The Constraint Grammar Parser CG-2. Publication No. 27, University of Helsinki.
Trench, R. C. (1857). On some deficiencies in our English dictionaries. London: The Philological Society. (reprinted at http://www.oed.com/archive/paper-deficiencies/).
Uí Dhonnchadha, E. (2002). An analyser and generator for Irish inflectional morphology using finite state transducers. Unpublished MSc Thesis: Dublin, DCU.
Uí Dhonnchadha, E., Nic Pháidín, C. Van Genabith, J. (2003). Design, implementation and evaluation of an inflectional morphology finite-state transducer for Irish. In MT Journal - Special issue on finite state language resources and language processing. Kluwer.
Uí Dhonnchadha, E., & Van Genabith, J. (2005). Scaling an Irish FST morphology engine for use on unrestricted text. In Proceedings of FSMNLP 2005, Helsinki, September 2005.

Download references

Acknowledgements

In addition to the authors, the main corpus-development team comprised Steve Finch, Eamon Keegan, Eoghan Mac Aogáin, Mark McLauchlan, Lisa Nic Shea, Jo O’Donoghue, Paul Atkins, Pavel Rychly and Dan Xu, all of whom deserve our heartfelt gratitude. We would also like to thank Seosamh Ó Murchú, Foras na Gaeilge’s Project Manager for the NEID, for his supportive role; Josef van Genabith of Dublin City University, for arranging the student internships; Dónall Ó Riagáin for helpful advice at the corpus design stage; John Kirk of the Queen’s University, Belfast, for permission to use NICTS; and Anne O’Keefe and Fiona Farr of the University of Limerick, for permission to use the Limerick Corpus of Irish English.

Author information

Authors and Affiliations

Lexicography MasterClass Ltd, Brighton, UK
Adam Kilgarriff & Michael Rundell
Trinity College, Dublin, Ireland
Elaine Uí Dhonnchadha

Authors

Adam Kilgarriff
View author publications
You can also search for this author in PubMed Google Scholar
Michael Rundell
View author publications
You can also search for this author in PubMed Google Scholar
Elaine Uí Dhonnchadha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Adam Kilgarriff.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kilgarriff, A., Rundell, M. & Uí Dhonnchadha, E. Efficient corpus development for lexicography: building the New Corpus for Ireland. Lang Resources & Evaluation 40, 127–152 (2006). https://doi.org/10.1007/s10579-006-9011-7

Download citation

Received: 21 July 2005
Accepted: 16 October 2006
Published: 23 December 2006
Issue Date: May 2006
DOI: https://doi.org/10.1007/s10579-006-9011-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Efficient corpus development for lexicography: building the New Corpus for Ireland

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

A survey on large language model based autonomous agents

Natural Language Processing

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Efficient corpus development for lexicography: building the New Corpus for Ireland

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

A survey on large language model based autonomous agents

Natural Language Processing

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation