Skip to main content

Creating Multilingual Parallel Corpora in Indian Languages

  • Conference paper
  • First Online:
Book cover Human Language Technology Challenges for Computer Science and Linguistics (LTC 2011)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8387))

Included in the following conference series:

Abstract

This paper presents a description of the parallel corpora being created simultaneously in 12 major Indian languages including English under a nationally funded project named Indian Languages Corpora Initiative (ILCI) run through a consortium of institutions across India. The project runs in two phases. The first phase of the project has two distinct goals - creating parallel sentence aligned corpus and parts of speech (POS) annotation of the corpora as per recently evolved national standard under Bureau of Indian Standard (BIS). This phase of the project is finishing in April 2012 and the next phase with newer domains and more national languages is likely to take off in May 2012. The goal of the current phase is to create parallel aligned POS tagged corpora in 12 major Indian languages (including English) with Hindi as the source language in health and tourism domains. Additional languages and domains will be added in the next phase. With the goal of 25 thousand sentences in each domain, we find that the total number of words in each of the domains has reached up to 400 thousands, the largest in size for a parallel corpus in any pair of Indian languages. A careful attempt has been made to capture various types of texts. With an analysis of the domains, we divided the two domains into sub-domains and then looked for the source text in those particular sub-domains to be included in the source text. With a preferable structure of the corpora in mind, we present our experiences also in selecting the text as the source and recount the problems like that of a judgment on the sub-domain text representation in the corpora. The POS annotation framework used for this corpora creation has also seen new changes in the POS tagsets. We also give a brief on the POS annotation framework being applied in this endeavor.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.ethnologue.com/show_country.asp?name=inaccessed:4September,2011

  2. 2.

    as per Census of India, 2001 http://censusindia.gov.in/Census_Data_2001/Census_Data_Online/Language/Statement5.html

  3. 3.

    No standard published reference can be given for this tagset as yet. We refer to the document circulated in the consortia meetings. This document was referred as “Linguistic Resource Standards: Standards for POS Tagsets for Indian Languages”, ver. 005, August, 2010.

  4. 4.

    http://www.sil.org/iso639-3/default.asp

References

  • Baker, P., Hardie, A., McEnery, T., Xiao, R., Bontcheva, K., Cunningham, H., Gaizauskas, R., Hamza, O., Maynard, D., Tablan, V., Ursu, C., Jayaram, B.D., Leisher, M.: Corpus linguistics and South Asian languages: corpus creation and tool development. Literary Linguist. Comput. 19, 509–524 (2004)

    Article  Google Scholar 

  • Jha, G.N.: The TDIL program and the Indian language corpora initiative (ILCI). In: Calzolari, N., et al. (ed.) Proceedings of the 7th Conference on International Language Resources and Evaluation (LREC’10). European Language Resources Association (ELRA) (2010)

    Google Scholar 

  • Choudhary, N.: Web-drawn Corpus for Indian languages: a case of Hindi. In: Singh, C., Singh Lehal, G., Sengupta, J., Sharma, D.V., Goyal, V. (eds.) ICISIL 2011. CCIS, vol. 139, pp. 218–223. Springer, Heidelberg (2011)

    Chapter  Google Scholar 

  • Shrivastava, M., Bhattacharyya, P.: Hindi POS tagger using naive stemming: harnessing morphological information without extensive Linguistic knowledge. In: Proceedings of the International Conference on NLP (ICON08), Pune, India (2008)

    Google Scholar 

  • Avinesh, P.V.S., Karthik, G.: Part-of-speech tagging and chunking using conditional random fields and transformation-based learning. In: Proceedings of the IJCAI and the Workshop On Shallow Parsing for South Asian Languages (SPSAL), pp. 21–24 (2007)

    Google Scholar 

  • Dandapat, S., Sarkar, S., Basu, A.: Automatic part-of-speech tagging for Bengali: an approach for morphologically rich languages in a poor resource scenario. In: Proceedings of the Association for Computational Linguistic, pp 221–224 (2007)

    Google Scholar 

  • Kumar, D., Josan, G.S.: Part of speech taggers for morpho-logically rich Indian languages: a survey. Int. J. Comput. Appl. 6(5), 1–9 (2010). Foundation of Computer Science

    Google Scholar 

  • Bharati, A., Sharma, D.M., Bai, L., Sangal, R.: Anncorra: Annotating Corpora. LTRC, IIIT, Hyderabad (2006)

    Google Scholar 

  • Baskaran, S., Bali, K., Choudhury, M., Bhattacharya, T., Bhattacharyya, P., Jha, G.N., Rajendran, S., Saravanan, K., Sobha, L., Subbarao., K.V.: A Common parts-of-speech tag set framework for indian languages. In: Nicoletta Calzolari (Conference Chair), Choukri, K., Maegaard, B., Mariani, J., Odjik, J., Piperidis, S., Tapias, D. (eds.) Proceedings of the 6th International Language Resources and Evaluation (LREC’08), Marrakech, Morocco (2008)

    Google Scholar 

  • Santorini, B.: Part-of-speech Tagging Guidelines for the Penn Treebank Project. Technical report MS-CIS-90-47, Department of Computer and Information Science, University of Pennsylvania (1990)

    Google Scholar 

  • Goyal, V., Lehal, G.S.: Hindi morphological analyzer and generator. In: Proceedings of the 1st International Conference on Emerging Trends in Engineering and Technology (2008)

    Google Scholar 

  • Bögel, T., Butt, M., Hautli, A., Sulger, S.: Developing a finite-state morphological analyzer for Urdu and Hindi. In: Proceedings of the 6th International Workshop on Finite-State Methods and Natural Language Processing, Potsdam (2007)

    Google Scholar 

  • Leech, G., Wilson, A.: Standards for tagsets. In: van Halteren, H. (ed.) EAGLES Recommendations for the Morphosyntactic Annotation of Corpora, (1999). http://www.ilc.cnr.it/EAGLES96/browse.html

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Narayan Choudhary .

Editor information

Editors and Affiliations

Appendix I: Super Set of POS Tags for Indian Languages

Appendix I: Super Set of POS Tags for Indian Languages

Sl. No.

Category

(Category. Type. Subtype)

Label

Annotation convention

1

1 Noun

N

N

2

1.1 Common

NN

N_NN

3

1.2 Proper

NNP

N_NNP

4

1.3 Verbal

NNV

N_NNV

5

1.4 Nloc

NST

N_NST

6

2 Pronoun

PR

PR

7

2.1 Personal

PRP

PR_PRP

8

2.2 Reflexive

PRF

PR_PRF

9

2.3 Relative

PRL

PR_PRL

10

2.4 Reciprocal

PRC

PR_PRC

11

2.5 Wh-word

PRQ

PR_PRQ

12

3 Demonstrative

DM

DM

13

3.1 Deictic

DMD

DM_DMD

14

3.2 Relative

DMR

DM_DMR

15

3.3 Wh-word

DMQ

DM_DMQ

16

Verb

V

V

17

4.1 Main

VM

V_VM

18

4.1.1 Finite

VF

V_VM_VF

19

4.1.2 Non-finite

VNF

V_VM_VNF

20

4.1.3 Infinitive

VINF

V_VM_VINF

21

4.1.4 Gerund

VNG

V_VM_VNG

22

4.2 Auxiliary

VAUX

V_VAUX

23

5 Adjective

JJ

 

24

6 Adverb

RB

 

25

7 Postposition

PSP

 

26

8 Conjunction

CC

CC

27

8.1 Co-ordinator

CCD

CC_CCD

28

8.2 Subordinator

CCS

CC_CCS

29

8.2.1 Quotative

UT

CC_CCS_UT

30

9 Particles

RP

RP

31

9.1 Default

RPD

RP_RPD

32

9.2 Classifier

CL

RP_CL

33

9.3 Interjection

INJ

RP_INJ

34

9.4 Intensifier

INTF

RP_INTF

35

9.5 Negation

NEG

RP_NEG

36

10 Quantifiers

QT

QT

37

10.1 General

QTF

QT_QTF

38

10.2 Cardinals

QTC

QT_QTC

39

10.3 Ordinals

QTO

QT_QTO

40

11 Residuals

RD

RD

41

11.1 Foreign word

RDF

RD_RDF

42

11.2 Symbol

SYM

RD_SYM

43

11.3 Punctuation

PUNC

RD_PUNC

44

11.4 Unknown

UNK

RD_UNK

45

11.5 Echo-words

ECH

RD_ECH

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Choudhary, N., Jha, G.N. (2014). Creating Multilingual Parallel Corpora in Indian Languages. In: Vetulani, Z., Mariani, J. (eds) Human Language Technology Challenges for Computer Science and Linguistics. LTC 2011. Lecture Notes in Computer Science(), vol 8387. Springer, Cham. https://doi.org/10.1007/978-3-319-08958-4_43

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-08958-4_43

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-08957-7

  • Online ISBN: 978-3-319-08958-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics