Abstract
A statistical-based approach to word alignment involving automatically projecting part-of-speech (POS) tags is presented. The approach is referred to as the “lazy man’s way” because it improves POS assignment for a resource-poor language by exploiting its similarity to a resource-rich one. This unsupervised learning method combines the N-gram and Dice Coefficient similarity functions in order to align English texts with Malay texts thus projecting the POS tags from English to Malay. It is a quick method that does not require the laborious effort needed to annotate the Malay dataset. A case study, an experiment done on 25 terrorism news articles written in Malay, has shown that leveraging pre-existing resources from a resource-rich language, i.e. English, to supplement a resource-poor language, i.e. Malay, is feasible and avoids building new text-processing tools from scratch. The system was tested on the Malay corpus, consisting of 5413 word tokens. The results reached values of 86.87% for precision, 72.56% for recall and 79.07% for F1-Score. This shows that the “lazy man’s way”, where a resource-poor language just exploits the rich linguistic information available in English, increases bitext projection accuracy significantly.
Keywords
These keywords were added by machine and not by the authors. This process is experimental and the keywords may be updated as the learning algorithm improves.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
El-Imam, Y.A., Don, Z.M.: Rules and Algorithms for Phonetic Transcription of Standard Malay. IEICE - Trans. Inf. Syst. E88-D, 2354–2372 (2005)
Hassan, A.: The Morphology of Malay. Dewan Bahasa dan Pustaka, Kuala Lumpur, Malaysia (1974)
Tan, Y.L.: A Minimally-Supervised Malay Affix Learner. In: Proceedings of the Class of 2003 Senior Conference, Computer Science Department, Swarthmore College (2003)
Abdullah, I.H., Ahmad, Z., Ghani, R.A., Jalaludin, N.H., Aman, I.: A Practical Grammar of Malay – A Corpus based Approach to the Description of Malay: Extending the Possibilities for Endless and Lifelong Language Learning. National University of Singapore (2004)
Ranaivo, B.: Methodology for Compiling and Preparing Malay Corpus. Technical Report. Unit Terjemahan Melalui Komputer. Pusat Pengajian Sains Komputer, Universiti Sains Malaysia (2004)
Don, Z.M.: Processing Natural Malay Texts: A Data Driven Approach. TRAMES 14(1), 90–103 (2010)
Jody, F.: An Overview of Bitext Alignment Algorithm, http://www.ida.liu.se/~jodfo/gslt/bitext-alignment-jody.pdf (accessed on March 2012)
Tsuruoka, Y., Tateishi, Y., Kim, J.-D., Ohta, T., McNaught, J., Ananiadou, S., Tsujii, J.: Developing a Robust Part-of-Speech Tagger for Biomedical Text. In: Bozanis, P., Houstis, E.N. (eds.) PCI 2005. LNCS, vol. 3746, pp. 382–392. Springer, Heidelberg (2005), doi:10.1007/11573036_36
Zamin, N., Oxley, A., Bakar, Z.A., Farhan, S.A.: A Statistical Dictionary-based Word Alignment Algorithm: An Unsupervised Approach. In: Proceedings of International Conference on Computer and Information Sciences (2012) (manuscript to be published)
Ranaivo-Malanco, B.: Malay Lexical Analysis Through Corpus-based Approach. In: Proceedings of International Conference of Malay Lexicology and Lexicography (PALMA), Kuala Lumpur, Malaysia (2005)
Ranaivo-Malancon, B.: Approach for a Malay Morphosyntactic Tagging. In: Proceedings of the Traitement Automatique des Langues Naturelles, Dourdan, France (2005)
Ranaivo-Malancon, B.: Computational Analysis of Affixed Words in Malay Language. In: Proceedings of the 8th International Symposium on Malay/Indonesian Linguistics, Penang, Malaysia (2004)
Knowles, G., Don, Z.M.: Tagging a Corpus of Malay Text and Coping with Syntactic Drift. In: Proceedings of the Corpus Linguistics. Centre for Computer Corpus Research on Language, pp. 422–428. University of Lancaster (2003)
Knowles, G., Don, Z.M.: World Class in Malay: A Corpus-based Approach. Dewan Bahasa dan Pustaka (2006)
Baldwin, T., Awab, S.: Open Source Corpus Analysis Tools for Malay. In: Proceedings of the International Conference of Language Resources and Evaluation, Genoa, Italy (2005)
Quah, C.K., Bond, F., Yamazaki, T.: Design and Construction of a Machine-Tractable Malay-English Lexicon. In: Proceedings of Asian Association of Lexicography, Seoul, Korea (2001)
Al-Adhaileh, Mosleh, H., Tang, E.K., Melamed, I.: Malay-English Bitext Mapping and Alignment Using SIMR/GSA Algorithms. Working Paper, Universiti Sains Malaysia (2009)
Mohamed, H., Omar, N., Aziz, A.J.A.: Statistical Malay Part-of-Speech (POS) Tagger using Hidden Markov Model Approach. In: Proceedings of the International Conference on Semantic Technology and Information Retrieval, Putrajaya, Malaysia (2011)
Hock, O.Y.: Kamus Dwibahasa Edisi Kedua. Pearson Longman, Malaysia (2009)
Indurkhya, N., Damerau, F.J.: Handbook of Natural Language Processing, 2nd edn. Chapman & Hall / CRC Press (2010)
Toutonova, R., Klein, D., Manning, C.D., Singer, Y.: Feature-Rich Part-of-Speech Tagging with a Cyclic Dependency Network. In: Proceedings Human Language Technology Conference (2003)
Jusoh, S., Fawareh, H.M.A.: Resolving Ambiguous Semantic in Malay Texts. In: Proceedings of International CODATA Conference, pp. 350–356 (2009)
Brill, E.: Transformation-Based Error-Driven Learning and Natural Language Processing: A Case Study in Part-of-speech Tagging. Journal of Computational Linguistics (1995)
Brill, E.: A Simple Rule-Based Part of Speech Tagger. In: Proceedings of the 3rd Conference on Applied Natural Language Processing (1992)
Christodoulopoulus, C., Goldwater, S., Steedman, M.: Two Decades of Unsupervised POS Induction: How Far Have We Come. In: Proceedings of Empirical Methods in Natural Language Processing (2010)
Dice, L.R.: Measures of the Amount of Ecologic Association between Species. Journal of Ecology 26, 297–302 (1945)
Dien, D.: Building an English-Vietnamese Bilingual Corpus. Master Thesis in Comparative Linguistics, University of Social Sciences and Humanity of HCM City, Vietnam (2001)
Kondrak, G.: N-gram Similarity and Distance. In: Proceedings of the International Conference on String Processing and Information, Buenos Aires, Argentina (2005)
Dunning, T.: Statistical Identification of Language. New Mexico State University, Technical Report MCCS, pp 94-273 (1994)
Florian, R., Ngai, G.: Fast Transformation-based Learning Toolkit. Technical Report (2001)
Ahrenberg, M., Hein, A.S., Tiedemann, J.: Evaluation of Word Alignment Systems. In: Proceedings of International Conference on Linguistic Resources (2000)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2012 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Zamin, N., Oxley, A., Abu Bakar, Z., Farhan, S.A. (2012). A Lazy Man’s Way to Part-of-Speech Tagging. In: Richards, D., Kang, B.H. (eds) Knowledge Management and Acquisition for Intelligent Systems. PKAW 2012. Lecture Notes in Computer Science(), vol 7457. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-32541-0_9
Download citation
DOI: https://doi.org/10.1007/978-3-642-32541-0_9
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-32540-3
Online ISBN: 978-3-642-32541-0
eBook Packages: Computer ScienceComputer Science (R0)