Skip to main content
Log in

Implementing NLP projects for noncentral languages: instructions for funding bodies, strategies for developers

  • Original Paper
  • Published:
Machine Translation

Abstract

This research begins by distinguishing a small number of “central” languages from the “noncentral languages”, where centrality is measured by the extent to which a given language is supported by natural language processing tools and research. We analyse the conditions under which noncentral language projects (NCLPs) and central language projects are conducted. We establish a number of important differences which have far-reaching consequences for NCLPs. In order to overcome the difficulties inherent in NCLPs, traditional research strategies have to be reconsidered. Successful styles of scientific cooperation, such as those found in open-source software development or in the development of the Wikipedia, provide alternative views of how NCLPs might be designed. We elaborate the concepts of free software and software pools and argue that NCLPs, in their own interests, should embrace an open-source approach for the resources they develop and pool these resources together with other similar open-source resources. The expected advantages of this approach are so important that we suggest that funding organizations put it as sine qua non condition into project contracts.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • AbiSource (2006) AbiWord: word processing for everyone. http://www.abisource.com/. Accessed 23 October, 2006

  • Agirre E, Aldezabal I, Alegria I, Arregi X, Arriola JM, Artola X, Díaz de Ilarraza A, Ezeiza N, Gojenola K, Sarasola K, Soroa A (2002) Towards the definition of a basic toolkit for HLT. In: LREC (2002), pp 42–48

  • Armentano-Oller C, Corbí-Bellot AM, Forcada ML, Ginestí-Rosell M, Bonev B, Ortiz-Rojas S, Pérez-Ortiz JA, Ramírez-Sánchez G, Sánchez-Martínez F (2005) An opensource shallow-transfer machine translation toolbox: consequences of its release and availability. In: Proceedings of the open source machine translation workshop at MT summit X, Pukhet, Thailand, pp 12–16

  • Berment V (2004) Méthodes pour informatiser des langues et des groupes de langues peu dotées [Methods for computerizing under-resourced languages and groups of languages]. Thèse de doctorat, Université Joseph Fourier, Saint-Martin-d’Hères, France

  • Bird S (2004) UNESCO international mother language day, language log, February 21, 2004. Available at http://www.itre.cis.upenn.edu/myl/languagelog/archives/000481.html. Accessed September 26, 2006

  • Bird S, Loper ED (2006) Natural language toolkit. http://www.nltk.sourceforge.net/. Accessed October 22, 2006

  • Bretz A (2006) Custom eclipse builder. http://www.ceb.sourceforge.net/privateproperties.html. Accessed September 28, 2006

  • Bungeroth J, Ney H (2004) Statistical sign language translation. In: Workshop on the representation and processing of sign languages, held in conjunction with the 4th international conference on language resources and evaluation, LREC 2004, Lisbon, Portugal, pp 105–108

  • Caplan P and Guenther R (2005). Practical preservation: the PREMIS experience. Libr Trends 54: 111–124

    Article  Google Scholar 

  • Crystal D (2001) Weaving a web of linguistic diversity, Guardian Weekly, 25 January 2001. Available at http://www.guardian.co.uk/GWeekly/Story/0,3939,427939,00.html. Accessed September 12, 2006

  • Csató EA and Nathan D (2003). Multimedia and documentation of endangered languages. In: Austin, PK (eds) Language documentation and description, vol 1, pp 73–84. Hans Rausing Endangered Languages Project, SOAS, London

    Google Scholar 

  • Debian (2006a) Debian – the universal operating system. http://www.debian.org. Accessed September 28, 2006

  • Debian (2006b) Debian worldwide mirror sites. http://www.debian.org/mirror/list. Accessed September 28, 2006

  • Díaz de Ilarraza A, Gurrutxaga A, Hernaez I, Lopez de Gereñu N, Sarasola K (2003) HIZKING21: integrating language engineering resources and tools into systems with linguistic capabilities. In: TALN (2003), pp 243–252

  • Eisenlohr P (2004). Language revitalization and new technologies: cultures of electronic mediation and the refiguring of communities. Annu Rev Anthropol 33: 21–45

    Article  Google Scholar 

  • Fink (2006) Fink. http://www.fink.sourceforge.net/. Accessed September 28, 2006

  • Forcada M (2006) Open source machine translation: an opportunity for minor languages. In: LREC (2006), pp 1–6

  • Free Software Foundation (1991) GNU general public license. http://www.gnu.org/copyleft/gpl.html. Accessed October 26, 2006

  • Free Software Foundation (2005) GNU lesser general public license. http://www.gnu.org/licenses/lgpl.html. Accessed October 26, 2006

  • Free Software Foundation (2007) The GNU operating system – free as in freedom. http://www.gnu.org/. Accessed March 30, 2007

  • GATE (2006) GATE – general architecture for text engineering. http://www.gate.ac.uk/. Accessed October 22, 2006

  • Gaup B, Moshagen S, Omma T, Palismaa M, Pieski T, Trosterud T (2005) From Xerox to Aspell: a first prototype of a North Sámi speller based on TWOL technology. In: Finite-state methods and natural language processing, 5th international workshop, FSMMNLP 2005, Helsinki, Finland, pp 306–307

  • Gentoo Foundation (2006) Gentoo Linux news. http://www.gentoo.org/. Accessed October 23, 2006

  • Ide N, Suderman K (2002) Corpus encoding standard for XML. http://www.cs.vassar.edu/XCES/. Accessed February 15, 2006

  • Koster CHA, Gradmann S (2004) The language belongs to the people! In: LREC (2004), pp 353–356

  • Krauwer S (1998) ELSNET and ELRA: common past, common future. ELRA Newsl 3.2. Available at http://www.elsnet.org/dox/blark.html. Accessed April 30, 2007

  • Krauwer S (2003) The basic language resource kit (BLARK) as the first milestone for the language resources roadmap. In: SPECOM’ 2003 international workshop speech and computer, Moscow, Russia [pages not numbered]

  • Kuhn TS (1962/1996). The structure of scientific revolutions. University of Chicago Press, Chicago, IL

    Google Scholar 

  • LDC [Linguistic Data Consortium] (2000) Linguistic exploration: new methods for creating, exploring and disseminating linguistic field data, held in conjunction with the annual meeting of the Linguistic Society of America, Chicago, USA

  • Leuski A (2006) CocoAspell, Mac OS X interface for Aspell. http://www.cocoaspell.leuski.net/. Accessed October 23, 2006

  • LISA [Localization Industry Standards Association] (2007) TMX – translation memory exchange. http://www.lisa.org/standards/tmx/. Accessed March 30, 2007

  • Liu DY-C, Su SC-F, Lai LY-H, Sung EH-Y, Hsu JY-L, Hsieh SY-C, Streiter O (2006) From corpora to spell checkers: first steps in building an infrastructure for the collaborative development of African language resources. In: LREC workshop networking the development of language resources for African languages, Genova, Italy, pp 50–53

  • LREC (1998) Workshop on language resources for European minority languages, held in conjunction with the first international conference on language resources and evaluation, Granada, Spain

  • LREC (2000) Developing language resources for minority languages: Re-usability and strategic priorities, Workshop held in conjunction with the second international conference on language resources and evaluation, Athens, Greece

  • LREC (2002) Portability issues in human language technologies (HLT), Workshop held in conjunction with the third international conference on language resources and evaluation, Las Palmas de Gran Canaria, Spain

  • LREC (2004) 4th international SALTMIL (ISCA SIG) LREC Workshop on first steps for language documentation of minority languages: Computational linguistic tools for morphology, lexicon and corpus compilation, Lisbon, Portugal

  • LREC (2006) Satellite workshop W06: Strategies for developing machine translation for minority languages, Genova, Italy

  • LULCL (2005) Proceedings of the conference on lesser used languages & computer linguistics, Bozen-Bolzano, Italy

  • MacKay D (2007) The Dasher project. http://www.inference.phy.cam.ac.uk/dasher/. Accessed April 2, 2007

  • Mandriva (nd) Welcome/home – Mandriva Linux. http://www.mandriva.com/. Accessed October 23, 2006

  • Maxwell M, Hughes B (2006) Frontiers in linguistic annotation for lower-density languages. In: Frontiers in linguistically annotated corpora, COLING/ACL 2006 workshop, Sydney, Australia, pp 29–37

  • Microsoft Corporation (2007a) Internet Explorer 6: worldwide. http://www.microsoft.com/windows/ie/ie6/worldwide/default.mspx. Accessed April 2, 2007

  • Microsoft Corporation (2007b) Internet Explorer: worldwide sites. http://www.microsoft.com/windows/products/winfamily/ie/worldwide.mspx. Accessed October 30, 2007

  • Morrissey S, Way A (2005) An example-based approach to translating sign language. In: Second workshop on example-based machine translation, MT summit X workshop, Phuket, Thailand, pp 109–116

  • Mozilla (2006) Home of the Mozilla project. http://www.mozilla.org. Accessed October 23, 2006

  • Mozilla (2007) Download a Firefox version that speaks your language! http://www.mozilla.com/firefox/all.html. Accessed April 2, 2007

  • NIH-OER [National Institutes of Health Office of Extramural Research] (2006) NIH data sharing policy. http://www.grants.nih.gov/grants/policy/data_sharing/index.htm. Accessed September 12, 2006

  • Opensource (2006) The open source definition (annotated). http://www.opensource.org/docs/definition.php. Accessed October 26, 2006

  • Opensource (nd) Open source licenses. http://www.opensource.org/licenses/. Accessed October 26, 2006

  • Prinsloo DJ, Heid U (2005) Creating word class tagged corpora for Northern Sotho by linguistically informed bootstrapping. In: LULCL (2005), pp 97–115

  • Probst K, Levin L, Peterson E, Lavie A and Carbonell J (2002). MT for minority languages using elicitation-based learning of syntactic transfer rules. Mach Translat 17: 245–270

    Article  Google Scholar 

  • Roux J (2004) Technologically challenged languages. Presentation at: Building the LR&E roadmap, joint COCOSDA & ICCWLRE meeting, Lisbon, Portugal. Available at http://www.lrec-conf.org/lrec2004/doc/presentation/Roux.pdf. Accessed October 26, 2006

  • Sarasola K (2000) Strategic priorities for the development of language technology in minority languages. In: LREC (2000), pp 106–109

  • Scannell KP (2006) Machine translation for closely related language pairs. In: LREC (2006), pp 103–107

  • Scannel KP (2007) The crúbadán project: corpus building for under-resourced languages. In: Fairon C, Naets H, Kilgarriff A, de Schryver G-M (eds) Building and exploring web corpora. Proceedings of the 3rd Web as Corpus Workshop, September 2007, pp 5-15

  • Somers H (1998) “New paradigms” in MT: the state of play now that the dust has settled. In: 10th European summer school in logic, language and information, workshop on machine translation, Saarbrücken, Germany, pp 22–33

  • Stallman RM (1999) Various licenses and comments about them. http://www.gnu.org/philosophy/license-list.html. Accessed October 26, 2006

  • Streiter O, De Luca EW (2003) Example-based NLP for minority languages: tasks, resources and tools. In: TALN (2003), pp 233–242

  • Streiter O, Stuflesser M (2005) XNLRDF, the open source framework for multilingual computing. In: LULCL (2005), pp 189–207

  • TALN (2003) Workshop TALN 2003: Traitement automatique des langues minoritaires et des petites langues [NLP for minority and small languages], Batz-sur-Mer, France

  • TALN (2005) Atelier traitement des langues peu dotées [Workshop on processing under-resourced languages]. In: TALN 2005, 12ème conférence annuelle sur le traitement automatique des langues naturelles, actes tome 2: Ateliers, Dourdan, France, pp 205–318

  • TEI (nd) TEI the text encoding initiative. http://www.tei-c.org/. Accessed February 15, 2006

  • Trosterud T (2005) Grammar-based language technology for the Sámi languages. In: LULCL 2005, pp 133–147

  • Uchechukwu C (2005) The Igbo language and computer linguistics: problems and prospects. In: LULCL (2005), pp 247–264

  • Vossen PTJM, Fellbaum C (2007) The global WordNet association. http://www.globalwordnet.org/. Accessed March 13, 2007

  • Webster A (2003) Digital race to save languages, BBC News, 20 March 2003. http://www.news.bbc.co.uk/2/hi/technology/2857041.stm. Accessed September 12, 2006

  • XNLRDF (2005) XNLRDF, an open source natural language resource description framework. http://www.140.127.211.214/xnlrdf. Accessed October 22, 2006

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Oliver Streiter.

Additional information

All trademarks are hereby acknowledged.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Streiter, O., Scannell, K.P. & Stuflesser, M. Implementing NLP projects for noncentral languages: instructions for funding bodies, strategies for developers. Machine Translation 20, 267–289 (2006). https://doi.org/10.1007/s10590-007-9026-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10590-007-9026-x

Keywords

Navigation