Abstract
This research begins by distinguishing a small number of “central” languages from the “noncentral languages”, where centrality is measured by the extent to which a given language is supported by natural language processing tools and research. We analyse the conditions under which noncentral language projects (NCLPs) and central language projects are conducted. We establish a number of important differences which have far-reaching consequences for NCLPs. In order to overcome the difficulties inherent in NCLPs, traditional research strategies have to be reconsidered. Successful styles of scientific cooperation, such as those found in open-source software development or in the development of the Wikipedia, provide alternative views of how NCLPs might be designed. We elaborate the concepts of free software and software pools and argue that NCLPs, in their own interests, should embrace an open-source approach for the resources they develop and pool these resources together with other similar open-source resources. The expected advantages of this approach are so important that we suggest that funding organizations put it as sine qua non condition into project contracts.
Similar content being viewed by others
References
AbiSource (2006) AbiWord: word processing for everyone. http://www.abisource.com/. Accessed 23 October, 2006
Agirre E, Aldezabal I, Alegria I, Arregi X, Arriola JM, Artola X, Díaz de Ilarraza A, Ezeiza N, Gojenola K, Sarasola K, Soroa A (2002) Towards the definition of a basic toolkit for HLT. In: LREC (2002), pp 42–48
Armentano-Oller C, Corbí-Bellot AM, Forcada ML, Ginestí-Rosell M, Bonev B, Ortiz-Rojas S, Pérez-Ortiz JA, Ramírez-Sánchez G, Sánchez-Martínez F (2005) An opensource shallow-transfer machine translation toolbox: consequences of its release and availability. In: Proceedings of the open source machine translation workshop at MT summit X, Pukhet, Thailand, pp 12–16
Berment V (2004) Méthodes pour informatiser des langues et des groupes de langues peu dotées [Methods for computerizing under-resourced languages and groups of languages]. Thèse de doctorat, Université Joseph Fourier, Saint-Martin-d’Hères, France
Bird S (2004) UNESCO international mother language day, language log, February 21, 2004. Available at http://www.itre.cis.upenn.edu/myl/languagelog/archives/000481.html. Accessed September 26, 2006
Bird S, Loper ED (2006) Natural language toolkit. http://www.nltk.sourceforge.net/. Accessed October 22, 2006
Bretz A (2006) Custom eclipse builder. http://www.ceb.sourceforge.net/privateproperties.html. Accessed September 28, 2006
Bungeroth J, Ney H (2004) Statistical sign language translation. In: Workshop on the representation and processing of sign languages, held in conjunction with the 4th international conference on language resources and evaluation, LREC 2004, Lisbon, Portugal, pp 105–108
Caplan P and Guenther R (2005). Practical preservation: the PREMIS experience. Libr Trends 54: 111–124
Crystal D (2001) Weaving a web of linguistic diversity, Guardian Weekly, 25 January 2001. Available at http://www.guardian.co.uk/GWeekly/Story/0,3939,427939,00.html. Accessed September 12, 2006
Csató EA and Nathan D (2003). Multimedia and documentation of endangered languages. In: Austin, PK (eds) Language documentation and description, vol 1, pp 73–84. Hans Rausing Endangered Languages Project, SOAS, London
Debian (2006a) Debian – the universal operating system. http://www.debian.org. Accessed September 28, 2006
Debian (2006b) Debian worldwide mirror sites. http://www.debian.org/mirror/list. Accessed September 28, 2006
Díaz de Ilarraza A, Gurrutxaga A, Hernaez I, Lopez de Gereñu N, Sarasola K (2003) HIZKING21: integrating language engineering resources and tools into systems with linguistic capabilities. In: TALN (2003), pp 243–252
Eisenlohr P (2004). Language revitalization and new technologies: cultures of electronic mediation and the refiguring of communities. Annu Rev Anthropol 33: 21–45
Fink (2006) Fink. http://www.fink.sourceforge.net/. Accessed September 28, 2006
Forcada M (2006) Open source machine translation: an opportunity for minor languages. In: LREC (2006), pp 1–6
Free Software Foundation (1991) GNU general public license. http://www.gnu.org/copyleft/gpl.html. Accessed October 26, 2006
Free Software Foundation (2005) GNU lesser general public license. http://www.gnu.org/licenses/lgpl.html. Accessed October 26, 2006
Free Software Foundation (2007) The GNU operating system – free as in freedom. http://www.gnu.org/. Accessed March 30, 2007
GATE (2006) GATE – general architecture for text engineering. http://www.gate.ac.uk/. Accessed October 22, 2006
Gaup B, Moshagen S, Omma T, Palismaa M, Pieski T, Trosterud T (2005) From Xerox to Aspell: a first prototype of a North Sámi speller based on TWOL technology. In: Finite-state methods and natural language processing, 5th international workshop, FSMMNLP 2005, Helsinki, Finland, pp 306–307
Gentoo Foundation (2006) Gentoo Linux news. http://www.gentoo.org/. Accessed October 23, 2006
Ide N, Suderman K (2002) Corpus encoding standard for XML. http://www.cs.vassar.edu/XCES/. Accessed February 15, 2006
Koster CHA, Gradmann S (2004) The language belongs to the people! In: LREC (2004), pp 353–356
Krauwer S (1998) ELSNET and ELRA: common past, common future. ELRA Newsl 3.2. Available at http://www.elsnet.org/dox/blark.html. Accessed April 30, 2007
Krauwer S (2003) The basic language resource kit (BLARK) as the first milestone for the language resources roadmap. In: SPECOM’ 2003 international workshop speech and computer, Moscow, Russia [pages not numbered]
Kuhn TS (1962/1996). The structure of scientific revolutions. University of Chicago Press, Chicago, IL
LDC [Linguistic Data Consortium] (2000) Linguistic exploration: new methods for creating, exploring and disseminating linguistic field data, held in conjunction with the annual meeting of the Linguistic Society of America, Chicago, USA
Leuski A (2006) CocoAspell, Mac OS X interface for Aspell. http://www.cocoaspell.leuski.net/. Accessed October 23, 2006
LISA [Localization Industry Standards Association] (2007) TMX – translation memory exchange. http://www.lisa.org/standards/tmx/. Accessed March 30, 2007
Liu DY-C, Su SC-F, Lai LY-H, Sung EH-Y, Hsu JY-L, Hsieh SY-C, Streiter O (2006) From corpora to spell checkers: first steps in building an infrastructure for the collaborative development of African language resources. In: LREC workshop networking the development of language resources for African languages, Genova, Italy, pp 50–53
LREC (1998) Workshop on language resources for European minority languages, held in conjunction with the first international conference on language resources and evaluation, Granada, Spain
LREC (2000) Developing language resources for minority languages: Re-usability and strategic priorities, Workshop held in conjunction with the second international conference on language resources and evaluation, Athens, Greece
LREC (2002) Portability issues in human language technologies (HLT), Workshop held in conjunction with the third international conference on language resources and evaluation, Las Palmas de Gran Canaria, Spain
LREC (2004) 4th international SALTMIL (ISCA SIG) LREC Workshop on first steps for language documentation of minority languages: Computational linguistic tools for morphology, lexicon and corpus compilation, Lisbon, Portugal
LREC (2006) Satellite workshop W06: Strategies for developing machine translation for minority languages, Genova, Italy
LULCL (2005) Proceedings of the conference on lesser used languages & computer linguistics, Bozen-Bolzano, Italy
MacKay D (2007) The Dasher project. http://www.inference.phy.cam.ac.uk/dasher/. Accessed April 2, 2007
Mandriva (nd) Welcome/home – Mandriva Linux. http://www.mandriva.com/. Accessed October 23, 2006
Maxwell M, Hughes B (2006) Frontiers in linguistic annotation for lower-density languages. In: Frontiers in linguistically annotated corpora, COLING/ACL 2006 workshop, Sydney, Australia, pp 29–37
Microsoft Corporation (2007a) Internet Explorer 6: worldwide. http://www.microsoft.com/windows/ie/ie6/worldwide/default.mspx. Accessed April 2, 2007
Microsoft Corporation (2007b) Internet Explorer: worldwide sites. http://www.microsoft.com/windows/products/winfamily/ie/worldwide.mspx. Accessed October 30, 2007
Morrissey S, Way A (2005) An example-based approach to translating sign language. In: Second workshop on example-based machine translation, MT summit X workshop, Phuket, Thailand, pp 109–116
Mozilla (2006) Home of the Mozilla project. http://www.mozilla.org. Accessed October 23, 2006
Mozilla (2007) Download a Firefox version that speaks your language! http://www.mozilla.com/firefox/all.html. Accessed April 2, 2007
NIH-OER [National Institutes of Health Office of Extramural Research] (2006) NIH data sharing policy. http://www.grants.nih.gov/grants/policy/data_sharing/index.htm. Accessed September 12, 2006
Opensource (2006) The open source definition (annotated). http://www.opensource.org/docs/definition.php. Accessed October 26, 2006
Opensource (nd) Open source licenses. http://www.opensource.org/licenses/. Accessed October 26, 2006
Prinsloo DJ, Heid U (2005) Creating word class tagged corpora for Northern Sotho by linguistically informed bootstrapping. In: LULCL (2005), pp 97–115
Probst K, Levin L, Peterson E, Lavie A and Carbonell J (2002). MT for minority languages using elicitation-based learning of syntactic transfer rules. Mach Translat 17: 245–270
Roux J (2004) Technologically challenged languages. Presentation at: Building the LR&E roadmap, joint COCOSDA & ICCWLRE meeting, Lisbon, Portugal. Available at http://www.lrec-conf.org/lrec2004/doc/presentation/Roux.pdf. Accessed October 26, 2006
Sarasola K (2000) Strategic priorities for the development of language technology in minority languages. In: LREC (2000), pp 106–109
Scannell KP (2006) Machine translation for closely related language pairs. In: LREC (2006), pp 103–107
Scannel KP (2007) The crúbadán project: corpus building for under-resourced languages. In: Fairon C, Naets H, Kilgarriff A, de Schryver G-M (eds) Building and exploring web corpora. Proceedings of the 3rd Web as Corpus Workshop, September 2007, pp 5-15
Somers H (1998) “New paradigms” in MT: the state of play now that the dust has settled. In: 10th European summer school in logic, language and information, workshop on machine translation, Saarbrücken, Germany, pp 22–33
Stallman RM (1999) Various licenses and comments about them. http://www.gnu.org/philosophy/license-list.html. Accessed October 26, 2006
Streiter O, De Luca EW (2003) Example-based NLP for minority languages: tasks, resources and tools. In: TALN (2003), pp 233–242
Streiter O, Stuflesser M (2005) XNLRDF, the open source framework for multilingual computing. In: LULCL (2005), pp 189–207
TALN (2003) Workshop TALN 2003: Traitement automatique des langues minoritaires et des petites langues [NLP for minority and small languages], Batz-sur-Mer, France
TALN (2005) Atelier traitement des langues peu dotées [Workshop on processing under-resourced languages]. In: TALN 2005, 12ème conférence annuelle sur le traitement automatique des langues naturelles, actes tome 2: Ateliers, Dourdan, France, pp 205–318
TEI (nd) TEI the text encoding initiative. http://www.tei-c.org/. Accessed February 15, 2006
Trosterud T (2005) Grammar-based language technology for the Sámi languages. In: LULCL 2005, pp 133–147
Uchechukwu C (2005) The Igbo language and computer linguistics: problems and prospects. In: LULCL (2005), pp 247–264
Vossen PTJM, Fellbaum C (2007) The global WordNet association. http://www.globalwordnet.org/. Accessed March 13, 2007
Webster A (2003) Digital race to save languages, BBC News, 20 March 2003. http://www.news.bbc.co.uk/2/hi/technology/2857041.stm. Accessed September 12, 2006
XNLRDF (2005) XNLRDF, an open source natural language resource description framework. http://www.140.127.211.214/xnlrdf. Accessed October 22, 2006
Author information
Authors and Affiliations
Corresponding author
Additional information
All trademarks are hereby acknowledged.
Rights and permissions
About this article
Cite this article
Streiter, O., Scannell, K.P. & Stuflesser, M. Implementing NLP projects for noncentral languages: instructions for funding bodies, strategies for developers. Machine Translation 20, 267–289 (2006). https://doi.org/10.1007/s10590-007-9026-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10590-007-9026-x