Name Entity Recognition for Malay Texts Using Cross-Lingual Annotation Projection Approach

Zamin, Norshuhani; Bakar, Zainab Abu

doi:10.1007/978-3-319-21404-7_18

Name Entity Recognition for Malay Texts Using Cross-Lingual Annotation Projection Approach

Norshuhani Zamin²¹ &
Zainab Abu Bakar²²

Conference paper
First Online: 01 January 2015

1163 Accesses
3 Citations

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 9155))

Abstract

Cross-lingual annotation projection methods can benefit from rich-resourced languages to improve the performance of Natural Language Processing (NLP) tasks in less-resourced languages. In this research, Malay is experimented as the less-resourced language and English is experimented as the rich-resourced language. The research is proposed to reduce the deadlock in Malay computational linguistic research due to the shortage of Malay tools and annotated corpus by exploiting state-of-the-art English tools. This paper proposes an alignment method known as MEWA (Malay-English Word Aligner) that integrates a Dice Coefficient and bigram string similarity measure with little supervision to automatically recognize three common named entities – person (PER), organization (ORG) and location (LOC). Firstly, the test collection of Malay journalistic articles describing on Indonesian terrorism is established in three volumes – 646, 5413 and 10002 words. Secondly, a comparative study between selected state-of-the-art tools is conducted to evaluate the performance of the tools against the test collection. Thirdly, MEWA is experimented to automatically induced annotations using the test collection and the identified English tool. A total of 93% accuracy rate is achieved in a series of NE annotation projection experiment.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Cowie, J., Wills, Y.: Information Extraction: A Handbook of Natural Language Processing. Marcel Dekker, New York (2000)
Google Scholar
Bird, S., Klein, E., Loper, E.: Natural Language Processing with Python, 1st edn. O’Reilly Bookstore (2009)
Google Scholar
Yarowsky, D., Ngai, G., Wicentowski, R.: Inducing multilingual text analysis tools via robust projection across aligned corpora. In: Proceedings of the Human Language Technology Research, pp. 1–8 (2001)
Google Scholar
Abdullah, I.H., Ahmad, Z., Ghani, R.A., Jalaludin, N.H., Aman, I.: A Practical Grammar of Malay–A Corpus based Algorithm to the Description of Malay: Extending the Possibilities for Endless and Lifelong Language Learning. National University of Singapore (2004)
Google Scholar
Ranaivo, M.B.: Computational analysis of affixed words in malay language. In: Proceedings of the 8th International Symposium on Malay/Indonesian Linguistics, Penang, Malaysia (2004)
Google Scholar
Don, Z.M.: Processing Natural Malay Texts: A Data-driven approach. Trames 1, 90–103 (2010)
Article Google Scholar
Indurkhya, N., Damerau, F.J.: Handbook of Natural Language Processing, 2nd edn. Chapman & Hall / CRC Press (2010)
Google Scholar
Tsuruoka, Y., Tateishi, Y., Kim, J.-D., Ohta, T., McNaught, J., Ananiadou, S., Tsujii, J.: Developing a robust part-of-speech tagger for biomedical text. In: Bozanis, P., Houstis, E.N. (eds.) PCI 2005. LNCS, vol. 3746, pp. 382–392. Springer, Heidelberg (2005)
Chapter Google Scholar
Christodoulopoulus, C., Goldwater, S., Steedman, M.: Two decades of unsupervised POS induction: how far have we come. In: Proceedings of Empirical Methods in Natural Language Processing (2010)
Google Scholar
Grishman, R.: Lecture Notes on Information Extraction (2013). http://cs/nu.edu/grishman/tarragona.pdf
Mikheev, A., Moens, M., Grover, C.: Named entity recognition without gazetteers. In: Proceedings of the 9^th European Chapter of the Association for Computational Linguistics, pp. 1–8. Association for Computational Linguistics (1999)
Google Scholar
Sharum, M.Y., Abdullah, M.T., Sulaiman, M.N., Murad, M.A.A., Hamzah, Z.A.Z.: Name extraction for unstructured malay text. In: IEEE Symposium on Computers & Informatics (ISCI), pp. 787–791. IEEE (2011)
Google Scholar
Alfred, R., Leong, L.C., On, C.K., Anthony, P.: Malay Named Entity Recognition Based on Rule-Based Approach. International Journal of Machine Learning and Computing 4(3), June 2014
Google Scholar
Galescu, L., Blaylock, N.: A corpus of clinical narratives annotated with temporal information. In: Proceedings of the 2^nd ACM SIGHIT International Health Informatics Symposium, pp. 715–720. ACM (2012)
Google Scholar
Roberts, A., Gaizauskas, R., Hepple, M., Demetriou, G., Guo, Y., Roberts, I.: Building a Semantically Annotated Corpus of Clinical Texts. Journal of Biomedical Informatics 42(5), 950–966 (2009)
Article Google Scholar
Katz, B.: Annotating the world wide web using natural language. In: Proceedings of the 5^th RIAO Conference on Computer Assisted Information Searching on the Internet (RIAO 1997), pp. 136–59 (1997)
Google Scholar
Manaf, S.A., Nordin, M.J.: Review on statistical approaches for automatic image annotation. In: International Conference on Electrical Engineering and Informatics, 2009. ICEEI 2009, vol. 1, pp. 56–61 (2009)
Google Scholar
Kim, S., Jeong, M., Lee, J., Lee, G.G.: A cross-lingual annotation projection algorithm for relation detection. In: Proceedings of the 23^rd International Conference on Computational Linguistics, pp. 564–571. Association for Computational Linguistics (2010)
Google Scholar
Spreyer, K., Frank, A.: Projection-based acquisition of a temporal labeller. In: Proceedings of IJCNLP 2008 (2008)
Google Scholar
Mayobre, G.: Using code reusability analysis to identify reusable components from the software related to an application domain. In: Proceedings of the 4^th Annual Workshop on Software Reuse, pp. 1–14 (1991)
Google Scholar
Bollinger, T.B., Pfleeger, S.L.: Economics of Software Reuse: Issues and Alternatives. Information and Software Technology 32(10), 643–652 (1990)
Article Google Scholar
Barnes, B.H., Bollinger, T.B.: Making Reuse Cost-Effective. IEEE Software 8(1), 13–24 (1991)
Article Google Scholar
Kim, Y., Stohr, E.A.: Software Reuse: Survey and Research Directions. Journal of Management Information Systems, 113–147 (1998)
Google Scholar
Brill, E., Lin, J., Banko, M., Dumais, S., Ng, A.: Data-intensive question answering. In: Proceedings of the Tenth Text Retrieval Conference (TREC 2001) (2001)
Google Scholar
Banko, M., Brill, E.: Mitigating the paucity-of-data problem: exploring the effect of training corpus size on classifier performance for natural language processing. In: Proceedings of the First International Conference on Human Language Technology Research, pp. 1–5. Association for Computational Linguistics (2001)
Google Scholar
de Souza, J.G.C., Orăsan, C.: Can projected chains in parallel corpora help coreference resolution? In: Hendrickx, I., Lalitha Devi, S., Branco, A., Mitkov, R. (eds.) DAARC 2011. LNCS, vol. 7099, pp. 59–69. Springer, Heidelberg (2011)
Chapter Google Scholar
De Pauw, G., Wagacha, P.W., De Schryver, G.M.: The SAWA corpus: a parallel corpus english-swahili. In: Proceedings of the First Workshop on Language Technologies for African Languages, pp. 9–16. Association for Computational Linguistics (2009)
Google Scholar
Padó, M., Lapata, M.: Cross-linguistic projection of role-semantic information. In: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, pp. 859–866. Association for Computational Linguistics (2005)
Google Scholar
Mititelu, V.B., Ion, R.: Cross-Language Transfer of Syntactic Relations Using Parallel Corpora. Cross-Language Knowledge Induction Workshop, Romania (2005)
Google Scholar
Frank, A.: Network of Linguistic Annotation: The Linguist Web [Power Point Slides]. University of Heidelberg, Heidelberg (2007)
Google Scholar
Dice, L.R.: Measures of the Amount of Ecologic Association between Species. Ecology 26(3), 297–302 (1945)
Article Google Scholar
Moore, R.C.: Improving IBM word-alignment model 1. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics (518). Association for Computational Linguistics (2004)
Google Scholar
Dien, D.I.N.H.: Building an Annotated English-Vietnamese Parallel Corpus. MKS: A Journal of Southeast Asian Linguistics and Languages 35, 21–36 (2005)
Google Scholar
Jurafsky, D., Martin, J.H.: Speech and Language Processing: An Introduction to Natural Language Processing, Computational Linguistics and Speech. Prentice Hall (2000)
Google Scholar
Jurafsky, D., Bates, R., Coccaro, N., Martin, R., Meteer, M., Ries, K., Ess-Dykema, V.: Automatic detection of discourse structure for speech recognition and understanding. In: Proceedings of the 1997 IEEE Workshop on Automatic Speech Recognition and Understanding, 1997, pp. 88–95. IEEE (1997)
Google Scholar
Jurafsky, D., Wooters, C., Tajchman, G., Segal, J., Stolcke, A., Foster, E., Morgan, N.: The Berkeley Restaurant Project. ICSLP 94, 2139–2142 (1994)
Google Scholar
Sørensen, T.: A Method of Establishing Groups of Equal Amplitude in Plant Sociology based on Similarity of Species and its Application to Analyses of the Vegetation on Danish Commons. Biol. Skr. 5, 1–34 (1948)
Google Scholar
Kondrak, G., Marcu, D., Knight, K.: Cognates can improve statistical translation models. In: Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology: Companion Volume of the Proceedings of HLT-NAACL 2003–Short Papers, vol. 2, pp. 46–48. Association for Computational Linguistics, May 2003
Google Scholar
Feldman, R., Sanger, J.: The Text Mining Handbook: Advanced Algorithm in Analyzing Unstructured Data. Cambridge University Press (2006)
Google Scholar
Minkov, E., Wang, R., Cohen, W.: Extracting personal names from emails: applying named entity recognition to informal text. In: Proceedings of the Human Language Technology and Conference on Empirical Methods in Natural Language Processing, pp. 443–450 (2005). doi:10.3115/1220575.1220631

Download references

Author information

Authors and Affiliations

Faculty of Science and Information Technology, Universiti Teknologi PETRONAS, 32610, Bandar Seri Iskandar, Perak, Malaysia
Norshuhani Zamin
Faculty of Computer and Mathematical Sciences, Universiti Teknologi MARA, 40000, Shah Alam, Selangor, Malaysia
Zainab Abu Bakar

Authors

Norshuhani Zamin
View author publications
You can also search for this author in PubMed Google Scholar
Zainab Abu Bakar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Norshuhani Zamin .

Editor information

Editors and Affiliations

University of Perugia, Perugia, Italy
Osvaldo Gervasi
University of Basilicata, Potenza, Italy
Beniamino Murgante
Covenant University, Canaanland, Nigeria
Sanjay Misra
University of Calgary, Calgary, Alberta, Canada
Marina L. Gavrilova
University of Minho, Braga, Portugal
Ana Maria Alves Coutinho Rocha
Polytechnic University, Bari, Italy
Carmelo Torre
Monash University, Clayton, Victoria, Australia
David Taniar
Kyushu Sangyo University, Fukuoka, Japan
Bernady O. Apduhan

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zamin, N., Bakar, Z.A. (2015). Name Entity Recognition for Malay Texts Using Cross-Lingual Annotation Projection Approach. In: Gervasi, O., et al. Computational Science and Its Applications -- ICCSA 2015. ICCSA 2015. Lecture Notes in Computer Science(), vol 9155. Springer, Cham. https://doi.org/10.1007/978-3-319-21404-7_18

Download citation

DOI: https://doi.org/10.1007/978-3-319-21404-7_18
Published: 19 June 2015
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-21403-0
Online ISBN: 978-3-319-21404-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics