ABSTRACT
POS tagging is a process of marking text into an appropriate word-class based on word definitions and word relationships. In general, several POS tagging approaches have been applied in Bahasa Indonesia namely rule-based, stochastic, and neural. Besides, there is another approach to POS tagging which has been applied to English, namely the approach using ontology. This approach has not yet been applied to Bahasa Indonesia so we will implement an ontology to conduct POS tagging in Bahasa Indonesia. In this study, the ontology was constructed using the Word2Vec and the DBSCAN clustering method. The Word2Vec model is implemented to extract each word in vector form based on its context and the DBSCAN clustering method is implemented for the classification process of word classes based on word vectors modeled by Word2Vec. The process of POS tagging with ontology is carried out in several stages, namely: data collection using web scraping techniques from Kompas.com and Detik.com online news articles, text preprocessing, Word2Vec feature building, clustering with DBSCAN, ontology construction and evaluation. The experiments carried out in this study were to choose the optimal parameter values from DBSCAN in forming word clusters for ontology construction. Overall, the implementation of ontology with Word2Vec and DBSCAN can do POS tagging with the highest accuracy value of 0.62, the highest precision value of 0.79, the highest recall value of 0.62, and the highest f1-score of 0.67.
- N. Mishra and S. Jain, "POS Tagging of Hindi Language Using Hybrid Approach," in NGCT 2017: Smart and Innovative Trends in Next Generation Computing Technologies, India, 2018.Google Scholar
- F. Muhammad Hassan, N. UzZaman and M. Khan, "Comparison of Unigram, Bigram, HMM and Brill's POS Tagging Approaches for some South Asian Languages," BRAC University, 2007.Google Scholar
- Badan Pusat Statistik, "Penduduk Indonesia Hasil Sensus Penduduk 2010," Jakarta, 2012.Google Scholar
- S. Larasati, V. Kubon and D. Zeman, "Indonesian Morphology Tool (MorphInd): Towards an Indonesian Corpus," in SFCM: International Workshop on Systems and Frameworks for Computational Morphology, Zurich, 2011.Google Scholar
- F. Rashel, A. Luthfi, A. Dinakaramani and A. Manurung, "Building an Indonesian Rule-Based Part-of-Speech Tagger," in IALP 2014: International Conference on Asian Language Processing, 2014.Google Scholar
- F. Pisceldo, R. Manurung and M. Adriani, "Probabilistic Part of Speech Tagging for Bahasa Indonesia," in Third International MALINDO Workshop, colocated event ACL-IJCNLP, 2009.Google Scholar
- A. F. Abka, "Evaluating the Use of Word Embeddings for Part of Speech Tagging in Bahasa Indonesia," in IC3INA 2016: International Conference on Computer Control Informatics and Its Applications, 2016.Google Scholar
- S. Fu, N. Lin, G. Zhu and S. Jiang, "Towards Indonesian Part of Speech Tagging: Corpus and Models," in Proceedings of the LREC 2018 Workshop "Belt & Road: Language Resouces and Evaluation, 2018.Google Scholar
- V. Jayawardana, D. Lakmal, N. de Silva, A. S Perera, K. Sugathadasa, B. Ayesha and M. Perera, "Semi-Supervised Instance Population of an Ontology using Word Vector Embedding," in ICTer 2017: Seventeenth International Conference on Advances in ICT for Emerging Regions, 2017.Google Scholar
- T. R. Gruber, "A Translation Approach to Portable Ontology Specifications," Knowledge Acquisition, vol. 5, no. 2, pp. 199--220, 1993. Google ScholarDigital Library
- C. Chiarcos and M. Sukhareva, "An Ontology-based Approach to Automatic Part-of-Speech Tagging Using Heterogeneously Annotated Corpora," in Proceedings of the Second Workshop on Natural Language Processing and Linked Open Data, 2015.Google Scholar
- R. Fu, J. Guo, B. Qin, W. Che, H. Wang and T. Liu, "Learning Semantic Hierarchies via Word Embeddings," in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Maryland, 2014.Google Scholar
- A. C. Mercedes, D. George, R. Warren, J. F. P. Maria, M. Nava, M. F. Diego, N. Goran, K. Julie, K. John and S. Robert, "Deep Learning meets Semantic Web: A feasibility study with the Cardiovascular Disease Ontology and PubMed citations," in ODLS 2016, Halle, 2016.Google Scholar
- Y. V. B. Reddy, D. L. Reddy and D. S. S. N. Reddy, "Comparative Study of Density-Based Clustering Algorithms," IJCIET: International Journal of Civil Engineering and Technology, vol. 8, no. 12, p. 763--767, 2017.Google Scholar
- C. Choi, M. Cho, J. Choi, M. Hwang, J. Park and P. Kim, "Travel Ontology for Intelligent Recommendation System," in 2009 Third Asia International Conference on Modelling & Simulation, Bali, 2009. Google ScholarDigital Library
- S. Abburu and S. B. Golla, "Ontology and NLP Support for Building Disaster Knowledge Base," in ICCES 2017: 2nd International Conference on Communication and Electronics Systems, Tamilnadu, 2017.Google Scholar
- O. Daramola, M. Adigun and C. Ayo, "Building an Ontology-based Framework for Tourism Recommendation Services," in Proceedings of the International Conference, Netherlands, 2009.Google Scholar
- M. Uschold and M. Gruninger, "Ontologies: principles, methods and applications," The Knowledge Engineering Review, pp. 93 -- 136, 1996.Google Scholar
- T. Mikolov, K. Chen, G. Corrado and J. Dean, "Efficient Estimation of Word Representations in Vector Space," in ICLR 2013: Proceedings of the International Conference on Learning Representations, Scottsdale, 2013.Google Scholar
- T. Mikolov, I. Sutskever, K. Chen, G. Corrado and J. Dean, "Distributed Representations of Words and Phrases Distributed Representations of Words and Phrases," in Advances in neural information processing systems, 2013. Google ScholarDigital Library
- J. Han, J. Pei and M. Kamber, Data Mining, Southeast Asia Edition, San Fransisco: Morgan Kaufmann, 2006.Google Scholar
- S. O. Al-mamory and Z. M. Algelal, "A Modified DBSCAN Clustering Algorithm for Proactive Detection of DDoS Attacks," in NTICT: Annual Conference on New Trends in Information & Communications Technology Applications, 2017.Google Scholar
- L. P. Manik, A. S. Ferti, H. F. Mustika, A. F. Abka and Y. Rianto, "Evaluating the Morphological and Capitalization Features for Word Embedding-Based POS Tagger in Bahasa Indonesia," in IC3INA: International Conference on Computer, Control, Informatics and its Applications, 2018.Google Scholar
- A. Dinakaramani, F. Rashel, A. Luthfi and R. Manurung, "Designing an Indonesian Part of speech Tagset and Manually Tagged Indonesian Corpus," in IALP: International Conference on Asian Language Processing, 2014.Google Scholar
Index Terms
- Implementation of ontology-based on Word2Vec and DBSCAN for part-of-speech
Recommendations
Word Embedding in Nepali Language using Word2Vec
NLPIR '22: Proceedings of the 2022 6th International Conference on Natural Language Processing and Information RetrievalWord embedding is a technique for understanding the relationship among words by mapping words to numbers. Several kinds of research have been carried out in this field in different languages such as English, Hindi, Bengali etc. but very few works are ...
A study of lexical function detection with word2vec and supervised machine learning
Special Section: Applied Machine Learning and Management of Volatility, Uncertainty, Complexity & Ambiguity (V.U.C.A)In this work, we report the results of our experiments on the task of distinguishing the semantics of verb-noun collocations in a Spanish corpus. This semantics was represented by four lexical functions of the Meaning-Text Theory. Each lexical function ...
Building Synsets for Indonesian WordNet with Monolingual Lexical Resources
IALP '10: Proceedings of the 2010 International Conference on Asian Language ProcessingThis paper presents an approach to build synsets for Indonesian Word Net semi-automatically using monolingual lexical resources available freely in Bahasa Indonesia. Monolingual lexical resources refer to Kamus Besar Bahasa Indoensia or KBBI (...
Comments