skip to main content
10.1145/3427423.3427431acmotherconferencesArticle/Chapter ViewAbstractPublication PagessietConference Proceedingsconference-collections
research-article

Implementation of ontology-based on Word2Vec and DBSCAN for part-of-speech

Published:28 December 2020Publication History

ABSTRACT

POS tagging is a process of marking text into an appropriate word-class based on word definitions and word relationships. In general, several POS tagging approaches have been applied in Bahasa Indonesia namely rule-based, stochastic, and neural. Besides, there is another approach to POS tagging which has been applied to English, namely the approach using ontology. This approach has not yet been applied to Bahasa Indonesia so we will implement an ontology to conduct POS tagging in Bahasa Indonesia. In this study, the ontology was constructed using the Word2Vec and the DBSCAN clustering method. The Word2Vec model is implemented to extract each word in vector form based on its context and the DBSCAN clustering method is implemented for the classification process of word classes based on word vectors modeled by Word2Vec. The process of POS tagging with ontology is carried out in several stages, namely: data collection using web scraping techniques from Kompas.com and Detik.com online news articles, text preprocessing, Word2Vec feature building, clustering with DBSCAN, ontology construction and evaluation. The experiments carried out in this study were to choose the optimal parameter values from DBSCAN in forming word clusters for ontology construction. Overall, the implementation of ontology with Word2Vec and DBSCAN can do POS tagging with the highest accuracy value of 0.62, the highest precision value of 0.79, the highest recall value of 0.62, and the highest f1-score of 0.67.

References

  1. N. Mishra and S. Jain, "POS Tagging of Hindi Language Using Hybrid Approach," in NGCT 2017: Smart and Innovative Trends in Next Generation Computing Technologies, India, 2018.Google ScholarGoogle Scholar
  2. F. Muhammad Hassan, N. UzZaman and M. Khan, "Comparison of Unigram, Bigram, HMM and Brill's POS Tagging Approaches for some South Asian Languages," BRAC University, 2007.Google ScholarGoogle Scholar
  3. Badan Pusat Statistik, "Penduduk Indonesia Hasil Sensus Penduduk 2010," Jakarta, 2012.Google ScholarGoogle Scholar
  4. S. Larasati, V. Kubon and D. Zeman, "Indonesian Morphology Tool (MorphInd): Towards an Indonesian Corpus," in SFCM: International Workshop on Systems and Frameworks for Computational Morphology, Zurich, 2011.Google ScholarGoogle Scholar
  5. F. Rashel, A. Luthfi, A. Dinakaramani and A. Manurung, "Building an Indonesian Rule-Based Part-of-Speech Tagger," in IALP 2014: International Conference on Asian Language Processing, 2014.Google ScholarGoogle Scholar
  6. F. Pisceldo, R. Manurung and M. Adriani, "Probabilistic Part of Speech Tagging for Bahasa Indonesia," in Third International MALINDO Workshop, colocated event ACL-IJCNLP, 2009.Google ScholarGoogle Scholar
  7. A. F. Abka, "Evaluating the Use of Word Embeddings for Part of Speech Tagging in Bahasa Indonesia," in IC3INA 2016: International Conference on Computer Control Informatics and Its Applications, 2016.Google ScholarGoogle Scholar
  8. S. Fu, N. Lin, G. Zhu and S. Jiang, "Towards Indonesian Part of Speech Tagging: Corpus and Models," in Proceedings of the LREC 2018 Workshop "Belt & Road: Language Resouces and Evaluation, 2018.Google ScholarGoogle Scholar
  9. V. Jayawardana, D. Lakmal, N. de Silva, A. S Perera, K. Sugathadasa, B. Ayesha and M. Perera, "Semi-Supervised Instance Population of an Ontology using Word Vector Embedding," in ICTer 2017: Seventeenth International Conference on Advances in ICT for Emerging Regions, 2017.Google ScholarGoogle Scholar
  10. T. R. Gruber, "A Translation Approach to Portable Ontology Specifications," Knowledge Acquisition, vol. 5, no. 2, pp. 199--220, 1993. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. C. Chiarcos and M. Sukhareva, "An Ontology-based Approach to Automatic Part-of-Speech Tagging Using Heterogeneously Annotated Corpora," in Proceedings of the Second Workshop on Natural Language Processing and Linked Open Data, 2015.Google ScholarGoogle Scholar
  12. R. Fu, J. Guo, B. Qin, W. Che, H. Wang and T. Liu, "Learning Semantic Hierarchies via Word Embeddings," in Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, Maryland, 2014.Google ScholarGoogle Scholar
  13. A. C. Mercedes, D. George, R. Warren, J. F. P. Maria, M. Nava, M. F. Diego, N. Goran, K. Julie, K. John and S. Robert, "Deep Learning meets Semantic Web: A feasibility study with the Cardiovascular Disease Ontology and PubMed citations," in ODLS 2016, Halle, 2016.Google ScholarGoogle Scholar
  14. Y. V. B. Reddy, D. L. Reddy and D. S. S. N. Reddy, "Comparative Study of Density-Based Clustering Algorithms," IJCIET: International Journal of Civil Engineering and Technology, vol. 8, no. 12, p. 763--767, 2017.Google ScholarGoogle Scholar
  15. C. Choi, M. Cho, J. Choi, M. Hwang, J. Park and P. Kim, "Travel Ontology for Intelligent Recommendation System," in 2009 Third Asia International Conference on Modelling & Simulation, Bali, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Abburu and S. B. Golla, "Ontology and NLP Support for Building Disaster Knowledge Base," in ICCES 2017: 2nd International Conference on Communication and Electronics Systems, Tamilnadu, 2017.Google ScholarGoogle Scholar
  17. O. Daramola, M. Adigun and C. Ayo, "Building an Ontology-based Framework for Tourism Recommendation Services," in Proceedings of the International Conference, Netherlands, 2009.Google ScholarGoogle Scholar
  18. M. Uschold and M. Gruninger, "Ontologies: principles, methods and applications," The Knowledge Engineering Review, pp. 93 -- 136, 1996.Google ScholarGoogle Scholar
  19. T. Mikolov, K. Chen, G. Corrado and J. Dean, "Efficient Estimation of Word Representations in Vector Space," in ICLR 2013: Proceedings of the International Conference on Learning Representations, Scottsdale, 2013.Google ScholarGoogle Scholar
  20. T. Mikolov, I. Sutskever, K. Chen, G. Corrado and J. Dean, "Distributed Representations of Words and Phrases Distributed Representations of Words and Phrases," in Advances in neural information processing systems, 2013. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. J. Han, J. Pei and M. Kamber, Data Mining, Southeast Asia Edition, San Fransisco: Morgan Kaufmann, 2006.Google ScholarGoogle Scholar
  22. S. O. Al-mamory and Z. M. Algelal, "A Modified DBSCAN Clustering Algorithm for Proactive Detection of DDoS Attacks," in NTICT: Annual Conference on New Trends in Information & Communications Technology Applications, 2017.Google ScholarGoogle Scholar
  23. L. P. Manik, A. S. Ferti, H. F. Mustika, A. F. Abka and Y. Rianto, "Evaluating the Morphological and Capitalization Features for Word Embedding-Based POS Tagger in Bahasa Indonesia," in IC3INA: International Conference on Computer, Control, Informatics and its Applications, 2018.Google ScholarGoogle Scholar
  24. A. Dinakaramani, F. Rashel, A. Luthfi and R. Manurung, "Designing an Indonesian Part of speech Tagset and Manually Tagged Indonesian Corpus," in IALP: International Conference on Asian Language Processing, 2014.Google ScholarGoogle Scholar

Index Terms

  1. Implementation of ontology-based on Word2Vec and DBSCAN for part-of-speech

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Other conferences
      SIET '20: Proceedings of the 5th International Conference on Sustainable Information Engineering and Technology
      November 2020
      277 pages
      ISBN:9781450376051
      DOI:10.1145/3427423

      Copyright © 2020 ACM

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 28 December 2020

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      SIET '20 Paper Acceptance Rate45of57submissions,79%Overall Acceptance Rate45of57submissions,79%

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader